End-to-End Attention-based Large Vocabulary Speech Recognition

D Bahdanau, JK Chorowski, D Serdyuk, P Brakel, Y Bengio

End-to-end trainable systems What is end-to-end: • “training all the modules to optimize a global performance criterion” (“Gradient-based learning applied to document recognition”, LeCun et al., 98) • present a system for recognizing checks in which segmentation and character recognition are trained jointly with word constraints taken into account (the approach would now be called Conditional Random Fields) Not end-to-end: hand-crafted feature engineering, manual integration of separately trained modules. Why end-to-end: better performance, better portability

End-to-end trainable systems are the future Recent examples of end-to-end systems: • convolutional networks for object recognition from raw pixels (Krizhevsky et al., 12) • Neural Machine Translation: takes raw words as the input, all components trained together (Sutskever et al., 14, Bahdanau et al., 15) • Neural Caption Generation: produce image descriptions from raw images (many recent papers)

Are DNN-HMM systems end-to-end trainable? Without sequence discriminative training: no • Lexicon and HMM structure are not optimized with the rest of the system • Acoustic model (DNN) is trained to predict the states of the HMM in isolation from the language model With sequence discriminative training: more end-to-end, but still no • Lexicon and HMM structure …

Our (more-) end-to-end approach • Direct transduction from speech to characters (Graves et al., 14) • Based on Bidirectional Recurrent Neural Networks (BiRNN) and an attention mechanism, like in Neural Machine Translation approach from Bahdanau et al., 14. • Additional language model is added after training - not yet fully end-to-end

Recurrent Neural Networks optional outputs

Generic RNN: 𝑦" ∼ 𝑔 𝑠" 𝑠" = 𝑓 𝑠" , 𝑥"*+ , 𝑦"


“Simple RNN” with hyperbolic tangent units 𝑝 𝑦" |𝑠" = 𝟏 𝑦" / softmax 𝑉𝑠" + 𝑏; 𝑠"*+ = tanh 𝑊𝑠" + 𝑥"*+ + 𝑈𝟏(𝑦" + 𝑏A ) We use Gated Recurrent Units (GRU) inputs

Deep Bidirectional RNN with Subsampling BiRNN = forward RNN + backward RNN (both without outputs)

Deep BiRNN = states of the layer K are the inputs of the layer K + 1

Attention Mechanism At each step 1. Compute attention weights 𝛼",D with MLP 2. Compute weighted sum 𝑐" 3. Use weighted sum 𝑐" as the input to the RNN.

The System at a Glance The network defines

𝑃G (𝑌|𝑋)

where 𝑌 = 𝑦+ 𝑦J … 𝑦/ are characters 𝑋 = 𝑥+𝑥J … 𝑥L are feature vectors 𝜃 are the parameters

Attention mechanism

4 GRU BiRNN layers with 250 units each Subsampling after the layers 2 and 3

GRU RNN with 250 units

Training We train the network to maximize log-likelihood of the correct output W

1 P log 𝑃G 𝑌S 𝑋S → 𝑚𝑎𝑥 𝑁 SX+

𝑃G is differentiable with respect to 𝜃 => we can use gradient based methods

Decoding We use beam search to find 𝑌Y = 𝑎𝑟𝑔𝑚𝑎𝑥[ log 𝑃G (𝑌|𝑋). Problem: not enough text in the training data Workaround: add an additional language model 𝑌Y = 𝑎𝑟𝑔𝑚𝑎𝑥[ (𝛼 log 𝑃G 𝑌 𝑋 + 𝛽 log 𝑃L] 𝑌 + 1 − 𝛼 − 𝛽 |𝑌|) Problem: we have a word-level, we need a character-level model Workaround: • compose “spelling” FST with the FST obtained from n-gram LM • minimize the new FST and push the weights

Comparison with other approaches Our approach: the alignment 𝛼 is computed. More traditional approach: make 𝛼 a latent variable and marginalize it out: 𝑃G 𝑌 𝑋 = P 𝑃G 𝑌, 𝛼 𝑋 _

• Connectionist Temporal Classification (CTC, Graves et al., 06), 𝑃G 𝑌 𝛼, 𝑋 is factorized • RNN Transducer (Graves et al., 12): another RNN is used to model dependencies in Y Our model and RNN Transducer are equal in terms of expressive power.

Experiment Data details • Wall Street Journal (WSJ) dataset, ~80 hours of training data • mel-scale filterbank + deltas + delta-deltas + energies = 123 features • model selection on dev93, evaluation on eval92 Training details: • ADADELTA learning rule, anneal 𝜖 from 10bc to 10b+d • adaptive gradient clipping

Tricks of the Trade • Windowing: let 𝑝" = median 𝛼" . Set all 𝛼",D outside of 𝑝" − 𝑙; 𝑝" + 𝑟 to zero. Used during training and during testing, especially important to decode with LM. Latest findings: not so necessary when subsampling is used. • Regularization: constraining the norm of incoming weights to 1 for every unit of the network brings ~30% (!!!) performance improvement.

Results Model

Language Model
















extended trigram



CTC, phonemes, Miao et al. (2015)




CTC, characters, Miao et al. (2015)




CTC, characters, Miao et al. (2015)

extended trigram



DNN-HMM (Kaldi), Miao et al. (2015)




DNN-HMM, seq. dis. training, extended lexicon




Alignment example

Discussion and Future Work What is better for LVCSR, alignment computation or alignment marginalization? • Enough evidence that both are feasible (this paper, “Listen Attend and Spell”, “DeepSpeech2”, “EESEN”) • But fully end-to-end training is yet to be tried for both In our future work we want to train the network to work well will the LM. Thank you for attention! Code: https://github.com/rizar/attention-lvcsr (research quality…)

