Gated Recurrent Units Based Hybrid Acoustic Models ...

Viewer
Transcript

Gated Recurrent Units Based Hybrid Acoustic Models for Robust Speech Recognition Jian Kang1 , Wei-Qiang Zhang1 and Jia Liu1 1

Tsinghua National Laboratory for Information Science and Technology Department of Electronic Engineering, Tsinghua University, Beijing 100084, China [email protected]

Abstract Recurrent neural networks (RNNs) have shown an ability to model temporal dependencies. However the problem of exploding or vanishing gradients has limited their application. In recent years, long short-term memory RNNs (LSTM RNNs) have been proposed to solve this problem, and have achieved excellent results. However, because of the large size of LSTM RNNs, they more easily suffer from overfitting, especially for low resource tasks. In addition, because the output of LSTM units is bounded, there is often still a vanishing gradient issue over multiple layers. In this work, we evaluate an architecture called gated recurrent units (GRU) to solve these two problems. In comparison with LSTM RNN, the size of the GRU network is smaller, so this model can more easily avoid overfitting. Furthermore, the output of the GRU is not constrained to be bounded, which helps alleviate the negative impact of vanishing gradients across multiple layers. We propose using deep bidirectional GRUs as hybrid acoustic models and compare the similarities and differences between LSTM and GRU. We evaluate this architecture on the CHIME2 dataset, a robust low resource speech recognition task. Results demonstrate that our architecture outperforms LSTM, relatively decreasing the WER about 6% in the bidirectional case and, when combined with a baseline GMM system, achieves 1.1% absolute WER reduction comparing to a strong baseline mixing system. Index Terms: gated recurrent units, long short-term memory, robust speech recognition

However, traditional LSTM RNNs are sophisticated and require complex training mechanisms tricks which make them difficult to implement. To address this, in recent years, another variant of conventional RNN called Gated Recurrent Units (GRU) [14] has been proposed. [15, 16] has shown that by using fixed number of parameters in polyphonic music modeling and speech signal modeling, GRU models can outperform LSTM units. Although some early attempts have used GRUs as an endto-end recognition system [17, 18], GRU networks have rarely been applied as hybrid acoustic models to predict HMM states. In this paper, we evaluate GRU networks and bidirectional GRU variants as hybrid acoustic models to estimate posterior probabilities for physical states on the CHIME2 datasets, a robust low resource speech recognition task. From our results, we demonstrate that GRU networks can achieve better results than LSTM networks, about 6% average relative error reduction in the bidirectional case. Furthermore, we combine our models with the baseline model of [19] and show that the combined model outperforms the best result of [19]. We also analyze two different training parameters of bidirectional GRU networks, demonstrating the roles for these parameters. The remaining parts of this paper are organized as follows: In Section 2, we briefly review the LSTM unit. In Section 3, we propose the GRU, use it to construct deep GRU and bidirectional GRU networks and analyze these architectures. We report our experimental results in detail in Section 4 , with conclusions in Section 5.

1. Introduction

2. Review of LSTM unit

In recent years, deep neural networks (DNNs) combined with hidden Markov models (HMM) have become the dominant approach in acoustic modeling [1, 2, 3]. Based on increased computation power and quantity of data, substantial error rate reduction has been achieved for speech recognition tasks. Recurrent neural networks (RNNs) are a neural network framework with self connections from the previous time step as inputs, thus the hidden state contains a dynamic history of the input features sequence instead of a fixed-size window. RNN architectures have performed better than traditional DNNs, because they can detect events outside of a fixed temporal window size and are less affected by temporal distortion. For these reasons, RNNs as well as Long Short-Term Memory RNNs (LSTM RNNs) [4] are more suitable for sequence tasks such as sequence modeling and prediction [5, 6, 7, 8], and have been helpful to improve robustness in speech recognition [9, 10, 11, 12, 13].

In this section, we give a brief introduction of LSTM structure. LSTM units were first introduced in [4] and a popular LSTM structure is shown in Fig.1. The key ideas behind the LSTM unit are memory cell block to maintain temporal information and non-linear activation gates to control the information flow into the memory cell and out of the unit. Each LSTM unit consists of one cell unit and three control gates. These gates control the input and output as well as forgetting activation functions of the memory cell respectively. The memory cell can also directly control the gates. The LSTM units implement this by peephole connections from the memory cell to the gates to learn precise timing information. Although LSTM RNNs have achieved excellent results, this architecture has some weaknesses. The architecture is so sophisticated that it can be overfit relatively easily, especially for low resource tasks. In addition, training requires several complex mechanisms and computational tricks [11] which may make it difficult to tune the parameters. Furthermore, although the LSTM unit alleviates the vanishing gradient problem with

This work is supported by National Natural Science Foundation of China under Grant No. 61370034, No. 61273268 and No. 61403224.

ht ht ×

Ot

σ

+

xt +

ht-1

Output gate

×

tanh

T

xt

Reset gate

T

+

ht-1

σ

ft

×

×

mt

tanh

+

Ct

Forget gate

xt

T

×

tanh

it

σ Input gate

xt

nt +

+

σ

rt

1-

×

ht-1

ht-1

zt Update gate

+

σ +

xt ht-1

xt ht-1

Figure 1: Long short-term memory unit. ’T’ denotes a delay of one time step, σ denotes the sigmoid function, ’tanh’ denotes hyperbolic tangent function.

Figure 2: Gated recurrent unit. σ denotes the sigmoid function, ’tanh’ denotes hyperbolic tangent function.

3.2. deep GRU and bidirectional GRU respect to time, these gradients may still easily become small when the error signal passes through multiple LSTM layers, because the output and thus the gradient of the LSTM unit is bounded. To address these problems, we will adopt a simple architecture, described in the next section.

3. GRUs and their variants In this section, we present gated recurrent units (GRUs), another type of recurrent unit. 3.1. GRU The GRU was recently proposed by Cho et al [14]. Like LSTM, it was designed to adaptively reset or update memory content. As is shown in Fig.2, each GRU has a reset gate and an update gate, which control the memory flow. The GRU fully exposes its memory content at each timestep and balances output between the previous memory state and the new candidate memory state. The GRU reset gate rt is computed by rt = σ(Wr xt + Ur ht−1 + br ),

(1)

where σ is the sigmoid function, and xt and ht−1 are the input to the GRU and the previous output of the GRU. Wr , Ur and br are forward matrices, recurrent matrices and biases for reset gate, respectively. Similarly, the update gate zt is computed by zt = σ(Wz xt + Uz ht−1 + bz ),

In this subsection, we apply deep and bidirectional extensions to the GRU model. Inspired by DNNs, multiple GRU layers can be stacked to build deep GRU, like deep LSTM RNNs [9]. When input features propagating through the recurrent layer, the output features at one time step incorporate the history of temporal features. Compared to a standard shallow GRU, the features generated by a deep GRU are more abstract and suitable for prediction. Thus a deep GRU framework takes advantage of the merits of both DNNs and conventional GRUs. Another GRU variant is the bidirectional GRU (BGRU). One shortcoming of conventional GRUs is that they are only able to make use of previous context, not future context. In speech recognition, there is no reason not to exploit future context as well. The BGRU does this by processing the data in both directions with two separate parameters. After propagating through both forward and backward directions, the two directions outputs are concatenated and fed forward to the next hidden layer. Because bidirectional systems can yield more diversity of time dependencies, the systems more fully take advantage of the input features. As a hybrid acoustic model, the network is trained to predict HMM states using a forced alignment. For both networks, a softmax layer is added at the top of these recurrent layers to generate posterior possibilities. The output of the softmax layer provides an estimation of the posterior probabilities p(s|o) for states s, with given features o. The output in the softmax layer is computed by

(2) P (s|o) = softmax(Ws hout + bs ),

where the parameters are as above. Next, the candidate memory state mt is calculated by mt = ϕ(W xt + U (rt ∗ ht−1 ) + b),

(3)

where ϕ is the hyperbolic tangent function and * denotes element-wise multiplication. Lastly, the output of the GRU is calculated by ht = zt ∗ ht−1 + (1 − zt ) ∗ mt .

(4)

(5)

where (Ws , bs ) is the connection weight matrix and bias vector for the softmax layer, hout is the output of the top recurrent layer. 3.3. Analysis and comparisons between GRU and LSTM Both GRU and LSTM recurrent structures use gates to control information flow and effectively create shortcut paths across multiple temporal steps. These gates and shortcuts help to

detect and obtain the existence of an important feature in the input sequence. In addition they allow the error to be backpropagated easily, thus reducing the difficulty due to vanishing or exploding gradients with respect to time [20]. The update gate helps the GRU to capture long term dependencies and plays a role like that of the forget gate in LSTM. The reset gate helps the GRU to reset whenever the detected feature is not necessary anymore. So when the GRU tries to learn temporally changed features, these gates activate differently. The main difference between LSTM units and GRUs is that there is no output activation function or output gate to control the output in GRU. Intuitively, because the output may be unbounded, it will hurt the performance significantly. However, this is not true for GRUs, perhaps because coupling the reset gate and update gate will avoid this problem and make the use of output gate or activation function less valuable [15, 21]. Further, because an output gate is not used in GRUs, the total size of GRU layers is smaller than that of LSTM layers, which helps the GRU network avoid overfitting.

4. Experimental results 4.1. Experiment setup In order to evaluate our model, we implement experiments on Track 2 of the CHiME2 challenge [22]. This challenge is a medium vocabulary (5k) task under a reverberated and noisy environment. Three sets of data, including clean, reverberated and isolated, are contained in this task. Clean data are extracted from the WSJ0 database. Clean speech utterances are convolved with a background noise model to obtain speech SNR levels of −6, −3, 0, 3, 6 and 9 dB. The training set contains 7138 speech utterances with SNR ranging from −6 to 9 dB. The development set and evaluation set contain 2460 and 1980 utterances, respectively. A GMM-HMM system is built using the Kaldi toolkit [23]. The number of triphone states is 1992, which is optimized with MFCC features and their LDA and MLLT on the training sets. In addition, fMLLR is used for speaker adaptation. For the baseline system, details are described in [6]. All experiments use a standard 5k trigram language model. All alignments are generated using clean signals with the assumption that the clean speech data is available. 4.2. Experimental results for different GRU parameters In these experiments, we evaluate the impact of hyperparameters in BGRU. We focus on two parameters. The first is the number of sentences which are calculated in parallel. The second is the hidden layer size. For testing the first parameter, we fix the hidden layer size as 300. For testing the second parameter, we fix the number of parallel sentences as 20. For all these BGRU systems, two recurrent layers and a softmax layer are used. All parameters are randomly initialized. The results are shown in Table.1 and Table.2. From the results, it can be seen that when the size of hidden layer increases, the results become more stable, which is a general phenomenon in deep networks. We can also conclude that when the number of parallel sentences increases, the WER decreases to some extent. This may be a surprising finding, because we can achieve a better result using less time.

Table 1: WER results vs. the number of parallel sentences. Sentence Number 4 20 50

9dB 13.14 13.04 13.58

6dB 14.43 14.48 14.32

SNR(dB) 3dB 0dB 17.04 22.53 17.68 22.7 17.91 22.98

-3dB 27.26 26.4 26.88

Avg

-6dB 35.83 35.82 36.09

21.71 21.69 21.96

Calc Time 100% 35.3% 33.5%

Table 2: WER results vs. hidden layer size. Unit Number 300 400 500 600 700

9dB 13.04 13.53 12.3 12.9 11.98

6dB 14.48 14 14.09 14.13 14.3

SNR(dB) 3dB 0dB 17.68 22.7 16.95 21.24 16.89 21.34 16.88 21.64 16.85 20.8

-3dB 26.4 25.69 26.48 26.35 26.55

-6dB 35.82 34.7 34.6 34.79 34.28

Avg 21.69 21.02 20.95 21.12 20.79

4.3. Experimental results for comparisons between LSTM and GRU networks In this section, we compare results achieved by deep LSTM networks and deep GRU networks, including both unidirectional and bidirectional cases. All these models use 40-dim fBank features with zero mean and unit variance as input, with a learning rate of 0.0001 and update block size of 20 sentences. All parameters are randomly initialized. Results are shown in Table.3 and Fig.3. In Table.3, we test the effect of the number of recurrent layers. Here, the hidden layer size is 500 for both unidirectional and bidirectional cases. From the results, we can see that for both unidirectional and bidirectional LSTM layers, when the number of layers is larger than 2, the performance declines. In comparison, for the GRU layer, the performance is stable, even achieving some improvements. In Fig.3, we test the effect of the hidden layer size. Here, the number of hidden layer is 2, according to above experiments. From Fig.3, we can confirm that the GRU based systems outperform the LSTM based systems. The GRU based systems can match LSTM in the unidirectional case and relatively decrease the WER about 6% in the bidirectional case. This may be because that the GRU based systems can better avoid overfitting, and the unbounded output and gradient help the GRU based systems train sufficiently. So when the size of models increases, the results will not decline. Besides, because training tricks are not needed, the GRU models are easier to tune. In addition, in the unidirectional case, the size of both architectures is small, so the merits of GRU networks are not prominent. While in the bidirectional case, the merits of GRU can be fully demonstrated. 4.4. Experimental results for different systems In these experiments, we compare the results achieved by deep GRU and BGRU networks to some other systems. From the above results, balancing the training time and the performance, we select the best configuration, for both GRU systems and LSTM systems. We build our systems as follows: all recurrent systems contain two hidden layers and a softmax layer. All parameters are randomly initialized. For both the deep GRU

Table 4: Comparisons using different systems. Here, ’DL’ means using discriminative learning methods, ’FT’ means using feature transform, ’enh’ means using speech enhancement algorithms. System GMM (DL+FT)[19] GMM (DL+FT,enh)[19] rDNN[6] LSTM500 GRU500 BLSTM500[13] BGRU500 BGRU700 GMM (DL+FT,enh) + LSTM[19] GMM (DL+FT) + BGRU500 GMM (DL+FT) + BGRU700

9dB 15.7 14.1 13.02 13.37 13.1 13.18 12.3 11.98 11.9 11.12 10.82

6dB 17.9 15.7 14.50 14.8 14.64 14.93 14.09 14.3 13.0 12.87 12.39

SNR(dB) 3dB 0dB 21.6 28.5 18.8 24.5 17.32 22.40 17.98 22.63 17.78 23.01 17.81 22.42 16.89 21.34 16.85 20.8 15.5 20.3 15.36 19.76 14.87 18.83

Table 3: Comparisons between LSTM and GRU networks with varying the number of layer.

GRU LSTM BGRU BLSTM

1 24.64 24.12 23.04 23.28

-6dB 46.4 40.0 39.14 36.72 36.57 35.53 34.6 34.28 33.8 33.81 32.65

Avg 27.7 24.0 22.63 22.5 22.45 21.87 20.95 20.79 20.0 19.67 18.94

Model Size (M) 131 20.7 18.1 39.7 31.7 57.1 -

GRU Vs LSTM 25 GRU LSTM BGRU BLSTM

24.5

4 22.42 22.55 20.9 21.95

24 23.5 WER result

Model

# layers 2 3 22.45 22.4 22.5 22.5 20.95 20.88 21.87 21.9

-3dB 36.2 30.8 29.42 29.52 29.59 27.35 26.48 26.55 25.7 25.13 24.06

23 22.5 22

network and deep LSTM network, each hidden layer contains 500 units. For BGRU networks, we test 500 and 700 units per direction and for BLSTM, we test 500 units per direction. For all these recurrent systems, the input features are 40-dim fBank features with zero mean and unit variance , the learning rate is 0.0001, and the update block size is 20 sentences calculated in parallel. For direct comparison of the acoustic models, we use the same alignment as that of above systems. Results are shown in Table 4. From the results, we can see that our BGRU systems achieve better WER than that achieved by rDNN [6], while the model size is smaller than this system. In order to compare widely, we combine the the baseline GMM of [19]. The combining algorithm use the minimum Bayes risk decoding [24]. This algorithm create a union of lattices, normalizing the scores calculating by separate lattices from different systems. The baseline GMM is discriminatively trained and feature transformations (LDA, MLLT, SAT) are applied. Combining this, the new models can outperform the best result in [19]. Because we use the features generated by original noisy signals as input and do not use any enhanced or separation algorithms, our results do not match that achieved by [25, 26], but we plan to add such input enhancement in the future.

5. Conclusion In this paper, we have applied a RNN architecture called gated recurrent units (GRUs) to hybrid HMM based acoustic models. GRU networks use gates to control the information flow, rather than let the information flow propagate randomly over time. In addition, because there is no output gate to

21.5 21 20.5 200

300

400

500 600 The number of unit

700

800

Figure 3: Comparisons between LSTM and GRU networks with varying units per layer.

control the output of GRU, the size of GRU networks is smaller, so they alleviate overfitting problems better than LSTM based systems. Furthermore, the output of the GRU is unbounded, which helps alleviate the negative impact of vanishing gradients across multiple layers. We have evaluated this architecture on the CHIME2 dataset. Results show that our architecture can match LSTM performance in the unidirectional case and obtain 6% average relative error reduction in the bidirectional case. In addition, combining with a baseline GMM, our model outperform other baseline systems using speech enhancement algorithms, achieving 1.1% absolute WER reduction. In the future, this architecture will be applied to large vocabulary continuous speech recognition as well as other speech tasks such as speech enhancement. In addition, we will try to make the bidirectional GRU as well as LSTM more suitable for online decoder.

6. References [1] F. Seide, G. Li, and D. Yu, “Conversational speech transcription using context-dependent deep neural networks.” in Interspeech, 2011, pp. 437–440. [2] F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering in context-dependent deep neural networks for conversational speech transcription,” in Automatic Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on. IEEE, 2011, pp. 24–29. [3] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 30–42, 2012. [4] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997. [5] T. Mikolov, M. Karafi´at, L. Burget, J. Cernock`y, and S. Khudanpur, “Recurrent neural network based language model.” in Interspeech, vol. 2, 2010, p. 3. [6] C. Weng, D. Yu, S. Watanabe, and B.-H. F. Juang, “Recurrent deep neural networks for robust speech recognition,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014, pp. 5532–5536. [7] N. Boulanger-Lewandowski, J. Droppo, M. Seltzer, and D. Yu, “Phone sequence modeling with recurrent neural networks,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014, pp. 5417–5421. [8] T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, “Convolutional, long short-term memory, fully connected deep neural networks,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 4580–4584. [9] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 2013, pp. 6645–6649. [10] A. Graves, N. Jaitly, and A.-r. Mohamed, “Hybrid speech recognition with deep bidirectional lstm,” in Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on. IEEE, 2013, pp. 273–278. [11] H. Sak, A. W. Senior, and F. Beaufays, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling.” in INTERSPEECH, 2014, pp. 338–342. [12] A. Maas, Q. V. Le, T. M. Oneil, O. Vinyals, P. Nguyen, and A. Y. Ng, “Recurrent neural networks for noise reduction in robust asr,” 2012. [13] J. T. Geiger, Z. Zhang, F. Weninger, B. Schuller, and G. Rigoll, “Robust speech recognition using long short-term memory recurrent neural networks for hybrid acoustic modelling.” in INTERSPEECH, 2014, pp. 631–635. [14] K. Cho, B. Van Merri¨enboer, D. Bahdanau, and Y. Bengio, “On the properties of neural machine translation: Encoder-decoder approaches,” arXiv preprint arXiv:1409.1259, 2014. [15] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014. [16] W. Zaremba, “An empirical exploration of recurrent network architectures,” 2015. [17] D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, G. Diamos et al., “Deep speech 2: End-to-end speech recognition in english and mandarin,” arXiv preprint arXiv:1512.02595, 2015. [18] J. Chorowski, D. Bahdanau, K. Cho, and Y. Bengio, “End-to-end continuous speech recognition using attention-based recurrent nn: first results,” arXiv preprint arXiv:1412.1602, 2014. [19] J. T. Geiger, F. Weninger, J. F. Gemmeke, M. W¨ollmer, B. Schuller, and G. Rigoll, “Memory-enhanced neural networks and nmf for robust asr,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 6, pp. 1037–1046, 2014.

[20] Y. Bengio, P. Simard, and P. Frasconi, “Learning longterm dependencies with gradient descent is difficult,” IEEE transactions on neural networks, vol. 5, no. 2, pp. 157–166, 1994. [21] K. Greff, R. K. Srivastava, J. Koutn´ık, B. R. Steunebrink, and J. Schmidhuber, “Lstm: A search space odyssey,” arXiv preprint arXiv:1503.04069, 2015. [22] E. Vincent, J. Barker, S. Watanabe, J. Le Roux, F. Nesta, and M. Matassoni, “The second chimespeech separation and recognition challenge: Datasets, tasks and baselines,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2013, pp. 126–130. [23] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding, no. EPFLCONF-192584. IEEE Signal Processing Society, 2011. [24] V. Goel, S. Kumar, and W. Byrne, “Segmental minimum bayes-risk decoding for automatic speech recognition,” IEEE transactions on Speech and Audio Processing, vol. 12, no. 3, pp. 234–249, 2004. [25] A. Narayanan and D. Wang, “Improving robustness of deep neural network acoustic models via speech separation and joint adaptive training,” IEEE/ACM transactions on audio, speech, and language processing, vol. 23, no. 1, pp. 92–101, 2015. [26] Z. Chen, S. Watanabe, H. Erdo˘gan, and J. R. Hershey, “Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks.” ISCA, 2015.

Gated Recurrent Units Based Hybrid Acoustic Models ...

exploding or vanishing gradients has limited their application. In recent ... connections from the memory cell to the gates to learn precise .... The development set and evaluation .... âPhone sequence modeling with recurrent neural networks,â in.

Download PDF

165KB Sizes 2 Downloads 278 Views

Report

Gated Recurrent Units Based Hybrid Acoustic Models ...

Recommend Documents