Large Scale Distributed Acoustic Modeling With Back-off N-grams

Google Search by Voice Ciprian Chelba, Peng Xu, Fernando Pereira, Thomas Richardson

04/23/2013 Ciprian Chelba et al., BAM: Large Scale Distributed Acoustic Modeling – p. 1

Statistical Modeling in Automatic Speech Recognition

Speaker’s Mind

W

Speech Producer

Speech

Speaker

Acoustic Processor

A

Linguistic Decoder

^ W

Speech Recognizer Acoustic Channel

ˆ = argmaxW P (W |A) = argmaxW P (A|W ) · P (W ) W P (A|W ) acoustic model (Hidden Markov Model) P (W ) language model (Markov chain) ˆ search for the most likely word string W due to the large vocabulary size—1M words—an exhaustive search is intractable 04/23/2013 Ciprian Chelba et al., BAM: Large Scale Distributed Acoustic Modeling – p. 2

Voice Search LM Training Setup

correct google.com queries, normalized for ASR, e.g. 5th -> fifth vocabulary size: 1M words, OoV rate 0.57% (!), excellent n-gram hit ratios training data: 230B words Order 3 3 5

no. n-grams pruning PPL n-gram hit-ratios 15M entropy 190 47/93/100 7.7B none 132 97/99/100 12.7B 1-1-2-2-2 108 77/88/97/99/100

04/23/2013 Ciprian Chelba et al., BAM: Large Scale Distributed Acoustic Modeling – p. 3

Is a Bigger LM Better? YES!

Perplexity (left) and WER (right) as a function of 5−gram LM size 200

19

180

18.5

160

18

140

17.5

120

17

100 −2 10

−1

0

10

10

16.5 1 10

LM size: # 5−grams(B)

PPL is really well correlated with WER. It is critical to let model capacity (number of parameters) grow with the data. 04/23/2013 Ciprian Chelba et al., BAM: Large Scale Distributed Acoustic Modeling – p. 4

Back to Acoustic Modeling: How Much Model Can We Afford? typical amounts of training data for AM in ASR vary from 100 to 1000 hours frame rate in most systems is 100 Hz (every 10ms) assuming 1000 frames are sufficient for robustly estimating a single Gaussian 1000 hours of speech would allow for training about 0.36 million Gaussians (quite close to actual systems!) We have 100,000 hours of speech! Where is the 40 million Gaussians AM?

04/23/2013 Ciprian Chelba et al., BAM: Large Scale Distributed Acoustic Modeling – p. 5

Previous Work GMM sizing: a log(num. components) = log(β) + α · log(n) typical values: α = 0.3, β = 2.2 or α = 0.7, β = 0.1 same approach to getting training data as CU-HTK b they report diminishing returns past 1350 hours, 9k states/300k Gaussians we use 87,000 hours and build models up to 1.1M states/40M Gaussians. a

Kim et al., “Recent advances in broadcast news transcription,” in IEEE

Workshop on Automatic Speech Recognition and Understanding, 2003. b

Gales at al., “Progress in the CU-HTK broadcast news transcription system,”

IEEE Transactions on Audio, Speech, and Language Processing, 2006. 04/23/2013 Ciprian Chelba et al., BAM: Large Scale Distributed Acoustic Modeling – p. 6

Back-off N-gram Acoustic Model (BAM) W = action , sil ae k sh ih n sil BAM with M = 3 extracts : ih_1 / ae k sh ___ n sil ih_1 / k sh ___ n sil ih_1 / sh ___ n

frames frames frames

Back-off strategy: back-off at both ends if the M-phone is symmetric if not, back-off from the longer end until the M-phone becomes symmetric Rich Schwartz et al., Improved Hidden Markov modeling of phonemes for continuous speech recognition, in Proceedings of ICASSP, 1984. 04/23/2013 Ciprian Chelba et al., BAM: Large Scale Distributed Acoustic Modeling – p. 7

Back-off Acoustic Model Training

generate context-dependent state-level Viterbi alignment using: H ◦ C ◦ L ◦ W and the first-pass AM extract maximal order M-phones along with speech frames, and output (M-phone key, frames) pairs compute back-off M-phones and output (M-phone key, empty) pairs to avoid sending the frame data M times, we sort the stream of M-phones arriving at Reducer in nesting order cashe frames arriving on maximal order M-phones for use with lower order M-phones when they arrive. 04/23/2013 Ciprian Chelba et al., BAM: Large Scale Distributed Acoustic Modeling – p. 8

MapReduce for BAM Training

action --- frames

fashion --- frames

faction --- frames

Chunked Input Data

Mapper: -generate alignment: sil ae k sh ih n sil -extract and emit M-phones … ih_1 / ae k sh ___ n sil ~ , frames_A ih_1 / k sh ___ n sil , ih_1 / sh ___ n , ...

Mapper: -generate alignment: sil f ae sh ih n sil -extract and emit M-phones … ih_1 / f ae sh ___ n sil ~ , frames_B ih_1 / ae sh ___ n sil , ih_1 / sh ___ n , ...

Mapper: -generate alignment: sil f ae k sh ih n sil -extract and emit M-phones … ih_1 / ae k sh ___ n sil ~ , frames_C ih_1 / k sh ___ n sil , ih_1 / sh ___ n , ...

Shuffling: - M-phones sent to their Reduce shard, as determined by the partitioning key shard(ih_1 / sh ___ n) - M-phone stream arriving at a given Reduce shard is sorted in lexicographic order Reducer for partition shard(ih_1 / sh ___ n): - maintains a stack of nested M-phones in reverse order along with frames reservoir … ae k sh ___ n sil ~, frames_A … f ae sh ___ n sil ~, frames_B … k sh ___ n sil , frames_A | frames_C … ae sh ___ n sil , frames_B … sh ___ n , frames_A | frames_B | frames_C ... When a new M-phone arrives: - pop top entry - estimate GMM - output (M-phone, GMM) pair

Partition shard(ih_1 / sh ___ n) of the associative array (M-phone, GMM) storing BAM

SSTable output storing BAM as a distributed associative array (M-phone, key)

04/23/2013 Ciprian Chelba et al., BAM: Large Scale Distributed Acoustic Modeling – p. 9

N-best Rescoring

load model into an in-memory key-value serving system (SSTable service) with S servers each holding 1/S-th of the data query SSTable service with batch requests for all M -phones (including back-off) in an N-best list log PAM (A|W ) = λ · log Pf irst pass (A|W ) + (1.0 − λ) · log Psecond pass (A|W ) log P (W, A) = 1/lmw · log PAM (A|W ) + log PLM (W )

04/23/2013 Ciprian Chelba et al., BAM: Large Scale Distributed Acoustic Modeling – p. 10

Experimental Setup

training data baseline ML AM : 1 million manually transcribed Voice Search spoken queries—approx. 1,000 hours of speech filtered logs: 110 million Voice Search spoken queries + 1-best ASR transcript, filtered at 0.8 confidence (approx. 87,000 hours) dev/test data: manually transcribed data, each about 27,000 spoken queries (87,000 words) N = 10-best rescoring: 7% oracle WER on dev set, on 15% WER baseline 80% of the test set has 0%-WER at 10-best 04/23/2013 Ciprian Chelba et al., BAM: Large Scale Distributed Acoustic Modeling – p. 11

Experimental Results: Maximum Likelihood Baseline Model ML,λ = 0.6 ML,λ = 1.0 BAM,λ = 0.8 BAM,λ = 0.8 BAM,λ = 0.8 BAM,λ = 0.6 BAM,λ = 0.6 BAM,λ = 0.6 BAM,λ = 0.6

Train Source WER No. M (hrs) (%) Gaussians 1k base AM 11.6 327k — 1k base AM 11.9 327k — 1k base AM 11.5 490k 1 1k 1% logs 11.3 600k 2 1k 1% logs 11.4 720k 1 9k 10% logs 10.9 3,975k 2 9k 10% logs 10.9 4,465k 1 87k 100% logs 10.6 22,210k 2 87k 100% logs 10.6 14,435k 1

BAM steadily improves with more data, and model phonetic context does not really help beyond triphones 1.3% (11% rel) WER reduction on ML baseline 04/23/2013 Ciprian Chelba et al., BAM: Large Scale Distributed Acoustic Modeling – p. 12

Experimental Results: WER with Model Size

M=1, α = 0.7, β = 0.1 M=2, α = 0.3, β = 2.2

11.5

11.4

Word Error Rate (%)

11.3

11.2

11.1

11

10.9

10.8

10.7

10.6 5 10

6

7

10 10 Model Size (Number of Gaussians, log scale)

8

10

04/23/2013 Ciprian Chelba et al., BAM: Large Scale Distributed Acoustic Modeling – p. 13

Experimental Results: WER with Data Size

11.5 M=1, α = 0.7, β = 0.1 M=2, α = 0.3, β = 2.2

11.4

11.3

Word Error Rate (%)

11.2

11.1

11

10.9

10.8

10.7

10.6

10.5 2 10

3

4

10 10 Training Data Size (hours, log scale)

5

10

04/23/2013 Ciprian Chelba et al., BAM: Large Scale Distributed Acoustic Modeling – p. 14

Experimental Results: bMMI Baseline

Model bMMI,λ = 0.6 bMMI,λ = 1.0 BAM,λ = 0.8

Train (hrs) 1k 1k 87k

Source WER No. M (%) Gaussians base AM 9.7 327k — base AM 9.8 327k — 100% logs 9.2 40,360k 3

0.6% (6% rel) WER reduction on tougher 9.8% bMMI baseline

04/23/2013 Ciprian Chelba et al., BAM: Large Scale Distributed Acoustic Modeling – p. 15

Experimental Results: M-phone Hit Ratios 10-best Hypotheses for Test Data for BAM Using M = 3 (7-phones) Trained on the Filtered Logs Data (87 000 hours) left, right context size 0 1 2 3 0 1.1% 0.1% 0.2% 4.3% 1 0.1% 26.0% 0.9% 3.4% 2 0.7% 0.9% 27.7% 2.2% 3 3.8% 2.9% 2.0% 23.6% For large amounts of data, DT clustering of triphone states is not needed

04/23/2013 Ciprian Chelba et al., BAM: Large Scale Distributed Acoustic Modeling – p. 16

Experimental Results: Validation Setup

train on the dev set with Nmin = 1 test on the subset of the dev set with 0% WER at 10-best; 80% utterances; 1st pass AM: 7.6% WER use only BAM AM score, very small LM weight. Context type M CI phones 1 CI phones 5 + word boundary 1 + word boundary 5

WER, (%) 4.5 1.5 1.8 0.6

triphones do not overtrain 04/23/2013 Ciprian Chelba et al., BAM: Large Scale Distributed Acoustic Modeling – p. 17

BAM: Conclusions and Future Work distributed acoustic modeling is promising for improving ASR expanding phonetic context is not really productive, whereas more Gaussians do help Future work: bring to the new world of (D)NN-AM discriminative training wish: steeper learning rate as we add more training data

04/23/2013 Ciprian Chelba et al., BAM: Large Scale Distributed Acoustic Modeling – p. 18

Parting Thoughts on ASR Core Technology

Current state: automatic speech recognition is incredibly complex problem is fundamentally unsolved data availability and computing have changed significantly: 2-3 orders of magnitude more of each Challenges and Directions: re-visit (simplify!) modeling choices made on corpora of modest size multi-linguality built-in from start better modeling: feature extraction, acoustic, pronunciation, and language modeling

04/23/2013 Ciprian Chelba et al., BAM: Large Scale Distributed Acoustic Modeling – p. 19

ASR Success Story: Google Search by Voice What contributed to success: DNN acoustic models clearly set user expectation by existing text app excellent language model built from query stream clean speech: users are motivated to articulate clearly app phones do high quality speech capture speech tranferred error free to ASR server over IP

04/23/2013 Ciprian Chelba et al., BAM: Large Scale Distributed Acoustic Modeling – p. 20

Google Search by Voice - Research at Google

Kim et al., “Recent advances in broadcast news transcription,” in IEEE. Workshop on Automatic ... M-phones (including back-off) in an N-best list .... Technology.

236KB Sizes 1 Downloads 539 Views

Recommend Documents

Google Search by Voice - Research at Google
May 2, 2011 - 1.5. 6.2. 64. 1.8. 4.6. 256. 3.0. 4.6. CompressedArray. 8. 2.3. 5.0. 64. 5.6. 3.2. 256 16.4. 3.1 .... app phones (Android, iPhone) do high quality.

Google Search by Voice - Research at Google
Feb 3, 2012 - 02/03/2012 Ciprian Chelba et al., Voice Search Language Modeling – p. 1 ..... app phones (Android, iPhone) do high quality speech capture.

Voice Search for Development - Research at Google
26-30 September 2010, Makuhari, Chiba, Japan. INTERSPEECH ... phone calls are famously inexpensive, but this is not true in most developing countries.).

Search by Voice in Mandarin Chinese - Research at Google
client application running on an Android mobile telephone with an intermittent ... 26-30 September 2010, Makuhari, Chiba, Japan .... lar Mandarin phone.

Google Search by Voice: A case study - Research at Google
of most value to end-users, and supplying a steady flow of data for training systems. Given the .... for directory assistance that we built on top of GMM. ..... mance of the language model on unseen query data (10K) when using Katz ..... themes, soci

google's cross-dialect arabic voice search - Research at Google
our DataHound Android application [5]. This application displays prompts based on common ... pruning [10]. All the systems described in this paper make use of ...

Google Search by Voice
Mar 2, 2012 - Epoch t+1. SSTable. Feature-. Weights: Epoch t. SSTable. Utterances. SSTableService. Rerank-Mappers. Identity-Mappers. Reducers. Cache.

japanese and korean voice search - Research at Google
iPhone and Android phones in US English [1]. Soon after it was ..... recognition most important groups of pronunciations: a) the top 10k words as occurring in the ...

Improving Keyword Search by Query Expansion ... - Research at Google
Jul 26, 2017 - YouTube-8M Video Understanding Challenge ... CVPR 2017 Workshop on YouTube-8M Large-Scale Video Understanding ... Network type.

Query-Free News Search - Research at Google
Keywords. Web information retrieval, query-free search ..... algorithm would be able to achieve 100% relative recall. ..... Domain-specific keyphrase extraction. In.

QUERY LANGUAGE MODELING FOR VOICE ... - Research at Google
ABSTRACT ... data (10k queries) when using Katz smoothing is shown in Table 1. ..... well be the case that the increase in PPL for the BIG model is in fact.

ICMI'12 grand challenge: haptic voice recognition - Research at Google
Oct 26, 2012 - Voice Recognition (HVR) [10], a novel multimodal text en- try method for ... that on desktop and laptop computers with full-sized key- board [4].

recurrent neural networks for voice activity ... - Research at Google
28th International Conference on Machine Learning. (ICML), 2011. [7] R. Gemello, F. Mana, and R. De Mori, “Non-linear es- timation of voice activity to improve ...

Deploying Google Search by Voice in Cantonese - CiteSeerX
Aug 28, 2011 - tonese Google search by voice was launched in December 2010. Index Terms: .... phones in a variety of acoustic environments, including use at home, on the ... ers using our DataHound Android application [3], which dis-.

Deploying Google Search by Voice in Cantonese - Semantic Scholar
Aug 31, 2011 - web scores for both Hong Kong and Guangzhou data. Can- ... The efficient collection of high quality data thus became a cru- cial issue in ...

Deploying Google Search by Voice in Cantonese - CiteSeerX
Aug 28, 2011 - believe our development of Cantonese Voice Search is a step to- wards solving ... ers using our DataHound Android application [3], which dis-.

Scalable all-pairs similarity search in metric ... - Research at Google
Aug 14, 2013 - call each Wi = 〈Ii, Oi〉 a workset of D. Ii, Oi are the inner set and outer set of Wi ..... Figure 4 illustrates the inefficiency by showing a 4-way partitioned dataset ...... In WSDM Conference, pages 203–212, 2013. [2] D. A. Arb

Query Suggestions for Mobile Search ... - Research at Google
Apr 10, 2008 - suggestions in order to provide UI guidelines for mobile text prediction ... If the user mis-entered a query, the application would display an error ..... Hart, S.G., Staveland, L.E. Development of NASA-TLX Results of empirical and ...

Incremental Clicks Impact Of Mobile Search ... - Research at Google
[2]. This paper continues this line of research by focusing exclusively on the .... Figure 2: Boxplot of Incremental Ad Clicks by ... ad-effectiveness-using-geo.html.

On the Difficulty of Nearest Neighbor Search - Research at Google
plexity to find the nearest neighbor (with a high prob- ability)? These questions .... σ is usually very small for high dimensional data, e.g., much smaller than 0.1).

Evaluating Web Search Using Task Completion ... - Research at Google
for two search algorithms which we call search algorithm. A and search algorithm B. .... if we change a search algorithm in a way that leads users to take less time that ..... SIGIR conference on Research and development in information retrieval ...

Topical Clustering of Search Results - Research at Google
Feb 12, 2012 - that the last theme is easily identifiable even though the last three ..... It goes without saying that we have to add the cost of annotating the short ...

Automata Evaluation and Text Search Protocols ... - Research at Google
Jun 3, 2010 - out in the ideal world; of course, in the ideal world the adversary can do almost ... †Dept. of Computer Science and Applied Mathematics, Weizmann Institute and IDC, Israel. ... Perhaps some trusted certification authorities might one