Application of Pretrained Deep Neural Networks to Large Vocabulary Speech Recognition Navdeep Jaitly 1 , Patrick Nguyen 2 , Andrew Senior 2 , Vincent Vanhoucke 2 1

Department of Computer Science, University of Toronto 2 Google Inc. Methods and Experiments

Summary of Baselines

Abstract The use of Deep Belief Networks (DBN) to pretrain Deep Neural Networks (DNN) has recently led to a resurgence in the use of Artificial Neural Network - Hidden Markov Model (ANN/HMM) hybrid systems for Automatic Speech Recognition (ASR). In this paper we report results of a DBNpretrained context-dependent ‘DNN/HMM’ system trained on two datasets that are much larger than any reported previously with DBN-pretrained ANN/HMM systems - 5870 hours of Voice Search and 1400 hours of YouTube data. On the first dataset, the pretrained ANN/HMM system outperforms the best Gaussian Mixture Model - Hidden Markov Model (GMM/HMM) baseline, built with a much larger dataset by 3.7% absolute WER, while on the second dataset, it outperforms the GMM/HMM baseline by 4.7% absolute. Maximum Mutual Information (MMI) fine tuning and model combination using Segmental Conditional Random Fields (SCARF) give additional gains of 0.1% and 0.4% on the first dataset and 0.5% and 0.9% absolute on the second dataset.

Name Voice Search YouTube

# of hours CMLLR? 6K No 1400 Yes

WER 16.0 52.3

Table 1: Baselines used for study. Overview

Introduction Recent advances in Machine Learning have led to the development of algorithms which can be used to train deep models. One of these approaches is the Deep Belief Network (DBN), a multi-layered generative model which can be trained greedily, layer by layer, using a model known as a Restricted Boltzmann Machine at each layer [1]. It has been empirically observed that using the parameters of a Deep Belief Network to initialize (a.k.a “pretrain”) a deep neural network before fine tuning with backpropagation leads to improved performance of the deep neural network on discriminative tasks. The successful training of deep neural networks (DNN) on several tasks (with or without pretraining) has led to its widespread adoption in speech recognition systems where DNN/HMM hybrid systems have demonstrated tremendous gains [2, 4, 5, 3]. In this paper we report our results on experiments with DNN/HMM hybrids on Google’s datasets and language models that are much larger than datasets and language models previously reported in the literature in this area. Datasets and Baselines Voice Search The training data for the Voice Search system consisted of approximately 5780 hours of data from mobile Voice Search and Android Voice Input. The baseline model used was a triphone HMM with decision-tree clustered states. The acoustic data was contiguous frames of PLP features that were transformed by Linear Discriminant Analysis (LDA). SemiTied Covariances (STC) were used in the GMMs to model the LDA transformed features. Boosted-MMI was used to train the model discriminatively. This generated a CD model with 7969 states. You-tube The training data for the YouTube system consisted of approximately 1400 hours of data from YouTube. The system used 9-frame MFCCs that were transformed by LDA and SAT was performed. Decision-tree clustering was used to obtain 17552 triphone states, and STCs were used in the GMMs to model the features. The acoustic models were further improved with BMMI. During decoding, Constrained Maximum Likelihood Linear Regression (CMLLR) and Maximum Likelihood Linear Regression (MLLR) transforms were applied.

Figure 1. Pipeline for training ANN/HMM hybrid system The ANN/HMM hybrid models were trained in three stages as shown in figure 1. First, a baseline GMM/HMM system was trained and forced alignment was used to associate each frame of data with a target HMM state. Then, a DBN was trained on the acoustic data (which may be MFCC vectors, log filterbanks, or speaker adapted features stacked together) and the weights of the DBN were used to initialize a neural network, which was then trained to predict the HMM state from the acoustic data, using back-propagation. Further discriminative training of the learnt neural network was then performed using MMI. Lastly, SCARF was used for model combination of DNN/HMM results with the GMM/HMM system. Results

Name

Voice Search

YouTube

Model GMM-HMM baseline DBN pretrained ANN/HMM with sparsity + MMI + system combination with SCARF GMM-HMM baseline DBN pretrained ANN/HMM with sparsity + MMI + system combination with SCARF Table 2: Summary of Results

WER(%) 16.0 12.3 12.2 11.8 52.3 47.6 47.1 46.2

Neural Network Architecture Based on exploratory experiments with the Broadcast news database, we chose to use four hidden layers with 2560 units per layers as the architecture of choice for Voice Search. For You-tube we also used a neural network with 4 hidden layers. However, we chose to use 1000 units at all layers but the lowest layer (where we used 2000 units), because of computational considerations - the targets had a very high output dimensionality of 17552. Neural Network Training The models were trained on a dual CPU Intel Xeon DP Quad Core E5640 machine with Ubuntu OS equipped with four NVIDIA Tesla C2070 Graphics Processing Units. Each job was performed on a single CPU with a single GPU board. Data were loaded on to CPU memory in big mini-batches of 20 hours for Voice Search, and 17.5 hours for YouTube. These were then loaded into the GPU, and randomly permuted. Mini-batches of size 200 for Voice Search and 500 for YouTube were built by cycling through these permuted vectors. Model parameters were all kept and updated on GPU memory itself. Average gradients were computed on the minibatches and parameters were updated with a learning rate of .04 for the top two layers of the network and 0.02 for the others, with a momentum of 0.9. Each DBN layer was pre-trained for one epoch as an RBM and then the resulting ANN was discriminatively finetuned for one epoch. Weights with magnitudes below a threshold were then permanently set to zero before a further quarter epoch of training. All the computations involved in training the DBN (matrix multiplications, sampling etc) and the Neural Network (matrix multiplications, etc) were performed on the GPU using the Cudamat library [8]. Discriminative training with MMI Discriminative training of the DNN/HMM model was performed using a gradient update rule similar to that described in [6]. Gradient descent learning with momentum was performed over large mini-batches of size equal to 1/20th of the entire training data set. Model Combination with GMM/HMMs using SCARF SCARF was used to combine results from the GMM/HMM model with the results from the DNN/HMM model[7]. References [1] G.E. Hinton and S. Osindero and Y. Teh. A Fast Learning Algorithm for Deep Belief Nets. Neural computation (18), 1527–54, 2006 [2] G.E. Dahl and D. Yu and L. Deng and A. Acero. ContextDependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition. IEEE Trans. Audio, Speech, and Language Processing, June, 2012. [3] F. Seide and G. Li and D. Yu. Conversational Speech Transcription Using Context-Dependent Deep Neural Networks. Interspeech, 437–440, 2012. [4] A. Mohamed and G.E. Dahl and G.E. Hinton. Acoustic Modeling using Deep Belief Networks. IEEE Trans. Audio, Speech, and Language Processing (99), 2012. [5] A. Mohamed and T.N. Sainath and G.E. Dahl and B. Ramabhadran and G.E. Hinton and M.A. Picheny. Deep Belief Networks using Discriminative Features for Phone Recognition Proceedings, ICASSP, 5060–5063, 2012 [6] B. Kingsbury. Lattice-based Optimization of Sequence Classification Criteria for Neural-Network Acoustic Modeling. Proceedings, ICASSP, 3761–3764, 2009 [7] G. Zweig, P. Nguyen et. al. Speech Recognition with Segmental Conditional Random Fields: A summary of the JHU CLSP 2010 Summer Workshop. Proceedings, ICASSP, 5044–5047, 2011 [8] V. Mnih. Cudamat: a CUDA-based Matrix Class for Python. Technicla Report 004 University of Toronto, 2009

Application of Pretrained Deep Neural Networks ... - Vincent Vanhoucke

Voice Search The training data for the Voice Search system consisted of approximately 5780 hours of data from mobile Voice Search and Android Voice Input. The baseline model used was a triphone HMM with decision-tree clustered states. The acoustic data was contiguous frames of PLP features that were trans-.

438KB Sizes 0 Downloads 239 Views

Recommend Documents

Thu.P10b.03 Application of Pretrained Deep Neural Networks to Large ...
of Voice Search and Android Voice Input data 1 using a CD system with 7969 ... procedure similar to [10] and another 0.9% absolute from model combination by ...

Application of Pretrained Deep Neural Networks to Large
of Artificial Neural Network - Hidden Markov Model. (ANN/HMM) hybrid .... The test sets on the other hand were hand-transcribed. 3.1. ... Because of computational speed limitations, a model was trained for 6 .... Audio, Speech, and Lan-.

mixtures of inverse covariances: covariance ... - Vincent Vanhoucke
Jul 30, 2003 - archive of well-characterized digital recordings of physiologic signals ... vein, the field of genomics has grown around the collection of DNA sequences such ... went the transition from using small corpora of very constrained data (e.

mixtures of inverse covariances: covariance ... - Vincent Vanhoucke
Jul 30, 2003 - of the software infrastructure that enabled this work, I am very much indebted to. Remco Teunen .... 6.2.2 Comparison against Semi-Tied Covariances . ...... Denoting by ⋆ the Kronecker product of two vectors, the Newton update can be

SPEAKER-TRAINED RECOGNITION USING ... - Vincent Vanhoucke
approach has been evaluated on an over-the-telephone, voice-ac- tivated dialing task and ... ments over techniques based on context-independent phone mod-.

SPEAKER-TRAINED RECOGNITION USING ... - Vincent Vanhoucke
advantages of this approach include improved performance and portability of the ... tion rate of both clash and consistency testing has to be minimized, while ensuring that .... practical application using STR in a speaker-independent context,.

Deep Learning and Neural Networks
Online|ebook pdf|AUDIO. Book details ... Learning and Neural Networks {Free Online|ebook ... descent, cross-entropy, regularization, dropout, and visualization.

DEEP NEURAL NETWORKS BASED SPEAKER ...
1National Laboratory for Information Science and Technology, Department of Electronic Engineering,. Tsinghua .... as WH×S and bS , where H denotes the number of hidden units in ..... tional Conference on Computer Vision, 2007. IEEE, 2007 ...

Scalable Object Detection using Deep Neural Networks
neural network model for detection, which predicts a set of class-agnostic ... way, can be scored using top-down feedback [17, 2, 4]. Us- ing the same .... We call the usage of priors for matching ..... In Proceedings of the IEEE Conference on.

Asynchronous Stochastic Optimization for ... - Vincent Vanhoucke
send parameter updates to the parameter server after each gradient computation. In addition, in our implementation, sequence train- ing runs an independent ...

lecture 17: neural networks, deep networks, convolutional ... - GitHub
As we increase number of layers and their size the capacity increases: larger networks can represent more complex functions. • We encountered this before: as we increase the dimension of the ... Lesson: use high number of neurons/layers and regular

Variable Length Mixtures of Inverse Covariances - Vincent Vanhoucke
In that situation, a standard Newton algorithm can be used to optimize d [3, Chapter 9]. For that, we compute the gradient. ¥ with respect to v and the Hessian ¦ . Note that the Hessian is diagonal here. For simplicity we'll denote by § the diagon

Variable Length Mixtures of Inverse Covariances - Vincent Vanhoucke
vector and a parameter vector, both of which have dimension- .... Parametric Model of d ... to optimize with a family of barrier functions which satisfy the inequality ...

Improving the Robustness of Deep Neural Networks ... - Stephan Zheng
Google, Caltech [email protected]. Yang Song. Google [email protected]. Thomas Leung. Google [email protected]. Ian Goodfellow. Google.

T81-558: Applications of Deep Neural Networks Spring 2018 ... - GitHub
Jan 16, 2018 - mented, cited, and attributed, regardless of media or distribution. Even in the case of work licensed as public domain or Copyleft, (See: http://creativecommons.org/) the student must provide attri- bution of that work in order to upho

T81-559: Applications of Deep Neural Networks, Washington ... - GitHub
Oct 25, 2016 - T81-559: Applications of Deep Neural Networks, Washington University ... network and display a statistic showing how good of a fit you got.

T81-559: Applications of Deep Neural Networks, Washington ... - GitHub
Sep 11, 2016 - 9 from scipy.stats import zscore. 10 from .... submission please include your Jupyter notebook and any generated CSV files that the ques-.

Design of Compact Acoustic Models through ... - Vincent Vanhoucke
Clustering of Tied-Covariance Gaussians. Mark Z. Mao†* and Vincent Vanhoucke*. † Department of Electrical Engineering, Stanford University, CA, USA. * Nuance Communications, Menlo Park, CA, USA [email protected], [email protected]. Abstract.

Design of Compact Acoustic Models through ... - Vincent Vanhoucke
there are sufficient commonalities across languages for an effi- cient sharing of parameters at the Gaussian level and below. The difficulty resides in the fact that ...

On Rectified Linear Units for Speech Processing - Vincent Vanhoucke
ronment using several hundred machines and several hundred hours of ... They scale better ... Machine (RBM) [2], as a way to provide a sensible initializa-.

Confidence Scoring and Rejection using Multi ... - Vincent Vanhoucke
the system, however, the optimization of a multi-pass system has to obey a different set of .... ances which are not well modeled in the search space, it is often observed that an ... The first pass recognition engine used is a context-dependent ...

DeepPose: Human Pose Estimation via Deep Neural Networks
art or better performance on four academic benchmarks of diverse real-world ..... Combined they contain 11000 training and 1000 testing im- ages. These are images from ..... We present, to our knowledge, the first application of. Deep Neural ...

Deep Neural Networks for Object Detection - NIPS Proceedings
This method combines a set of discriminatively trained .... network to predict the object box mask and four additional networks to predict four ... In order to complete the detection process, we need to estimate a set of bounding ... training data.