Sparse Multilayer Perceptron for Phoneme Recognition G.S.V.S. Sivaram, Student Member, IEEE, Hynek Hermansky, Fellow, IEEE

Abstract—This paper introduces the sparse multilayer perceptron (SMLP) which jointly learns a sparse feature representation and nonlinear classifier boundaries to optimally discriminate multiple output classes. SMLP learns the transformation from the inputs to the targets as in multilayer perceptron (MLP) while the outputs of one of the internal hidden layers is forced to be sparse. This is achieved by adding a sparse regularization term to the cross-entropy cost and updating the parameters of the network to minimize the joint cost. On the TIMIT phoneme recognition task, SMLP based systems trained on individual speech recognition feature streams perform significantly better than the corresponding MLP based systems. Phoneme error rate of 19.6% is achieved using the combination of SMLP based systems, a relative improvement of 3.0% over the combination of MLP based systems. Index Terms—Phoneme multilayer perceptron.

recognition,

sparse

features,

I. I NTRODUCTION

H

IERARCHIES of multilayer perceptron (MLP) classifiers have been shown to be useful for acoustic modeling in speech recognition [1], [2], [3], model adaptation [4] and language identification [5]. A hierarchical MLP consists of two MLPs in series which are sequentially trained. The first MLP uses standard acoustic feature vectors to estimate the posterior probabilities of various output classes such as phonemes. The second MLP is then trained on the same targets using long temporal spans of posterior probabilities estimated by the first MLP as inputs. Sparse feature representations have been shown to have biological underpinnings in the visual area of the mammalian cortex [14], [15], and recently, many pattern classification applications have made use of sparse signal representations [6], [7], [8], [9], [10], [11], [12], [13]. Most of these methods treat sparse representation as features and train an additional classifier for making decisions. However, in only a few instances have sparse representations been optimized in conjunction with the classifier for discriminative classification. Some of the previous works have attempted to address this issue. For example, a two-class classification problem with Copyright (c) 2010 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected]. G.S.V.S. Sivaram and H. Hermansky are affiliated with the ECE Dept., Center for Language and Speech Processing, and Human Language Technology Center of Excellence, Johns Hopkins University, Baltimore, USA, (phone:+1410-516-7031; fax +1-410-516-5566; email: sivaram, [email protected]). The research presented in this paper was partially funded by IARPA BEST program under contract Z857701 and DARPA RATS program under D10PC20015. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the IARPA or DARPA.

a linear or bilinear classifier has been considered in [9]. In a different work, Fisher’s linear discrimination criterion with sparsity is used [6]. In this paper, we propose to jointly learn both sparse features and nonlinear classifier boundaries that best discriminate multiple output classes. Specifically, we propose to learn sparse features at the output of a hidden layer of a MLP trained to discriminate multiple output classes. This is achieved by adding a sparse regularization term to the conventional cross-entropy cost between the target values and their predicted values at the output layer. The parameters of the MLP are learned to minimize the joint cost using the standard back-propagation algorithm which takes the additional sparse regularization term into consideration. The resultant model is referred to as the sparse multilayer perceptron (SMLP) throughout this paper. Further, under certain conditions the SMLP estimates the Bayesian a posteriori probabilities of the output classes conditioned on the sparse representation. The SMLP is tested on the TIMIT phoneme recognition using three different conventional speech recognition features. Estimates of the posterior probabilities of SMLP are refined by training a second MLP on these estimates as in hierarchical estimation of posterior probabilities [2]. The phoneme recognition system used in our experiments is based on a hybrid Hidden Markov Model (HMM)-MLP approach [16], where the posterior probability estimates are converted to the scaled likelihoods and then used to model the HMM states. Experimental results indicate the effectiveness of the SMLP, as it consistently outperforms MLP based systems for each feature stream. The combination of SMLP based systems yields a 19.6% phoneme error rate, a relative improvement of 3.0% over the combination of MLP based systems. Preliminary phoneme recognition experiments of SMLP on PLP feature stream [17] have been reported in [18]. The rest of the paper is organized as follows. Related work is summarized in section II. Section III presents the theory of SMLP. In section IV, we describe the phoneme recognition system and the experimental results achieved by the proposed SMLP classifier. Finally, conclusions are provided in section VI. II. R ELATED W ORK A. MLP based acoustic modeling It has been shown that MLP, trained to minimize the crossentropy cost between outputs and hard targets1 using sufficient 1 A hard target vector consist of all zeros except a one at the index corresponding to the phoneme to which current input feature vector belongs.

2

amount of data, accurately estimates the posterior probabilities of output classes conditioned on the input feature vector in a discriminative manner [19]. This has led to an extensive use of MLP in state-of-the-art automatic speech recognition systems [16], [20], [21], [22], [23], [24], [25], [26]. In [16], an MLP is used as an alternative to the Gaussian Mixture Model (GMM) to model the HMM states. In another approach, MLP posterior probability estimates are first gaussianized and then modeled using GMMs [21]. They can also be directly modeled using Dirichlet Mixture Model (DMM) [22]. Alternatively, discriminative features derived from the hidden layer of an MLP [23], [24] are used as features in large vocabulary continuous speech recognition [25], [26]. Further improvements are reported with the hierarchical estimation of the posterior probabilities [1], [2], [3]. Recently, Restricted Boltzmann Machine (RBM) is also used for estimating the posterior probabilities for phoneme recognition [27], [28].

The goal of an SMLP classifier is to jointly learn sparse features at the output of its pth layer and estimate posterior probabilities of multiple classes at its output layer. In the case of MLP, estimates of the posterior probabilities are typically obtained by minimizing the cross-entropy cost between the output layer values (after the softmax) and the hard targets. We modify this cost function for SMLP as follows. A. Cost function The two objectives of the SMLP are • minimize the cross-entropy cost between the output layer values and the hard targets, and th • force the outputs of the p layer to be sparse for a particular p ∈ {2, 3, ..., m − 1}. The instantaneous2 cross-entropy cost is Nm X dj log yjm + (1 − dj ) log 1 − yjm . L=−

B. Sparse features in machine learning Sparse features have been successfully used in many machine learning applications such as face recognition [8], handwritten digit recognition [6], [7], [9], and phoneme recognition [10], [12], [13]. Most of the approaches obtain sparse features by expressing a signal as a linear combination of a minimal number of vectors in a (possibly overcomplete) basis [8], [9], [10], [12], [29]. Alternatively, the transformation from input to sparse features can be learned using a trainable network that is optimized to minimize the reconstruction error [7], [13]. Once these sparse features are derived, a separate classifier is then used for making decisions. The overall classification accuracy depends on the choice of the basis, the algorithm for identifying sparse linear combination weights, and the final classification decision algorithm. Aforementioned works differ in some of these aspects. For example, a predefined basis is used in [6], [8], [11]. Alternatively, learning of a basis is unsupervised in [7], [12], [13] and supervised in [9]. Sparse representation is obtained by solving an ℓ1 norm minimization problem in [8], [12], non-negative matrix factorization in [11], or by solving an optimization problem which has both the reconstruction and class discrimination terms in [6], [9]. The final classifier is residual energy based in [8], [11], linear in [7], [9], a multilayer perceptron in [12], [13], and a support vector machine in [6]. III. T HEORY

OF

SMLP

The notations used in this paper are as follows. m - number of layers (including input and output layers) Nl - number of neurons (or nodes) in the lth layer φl - output nonlinearity at the lth layer th th xlj - input tol the j neuron inththe l layer l φl xj = yj - output of the j neuron in the lth layer l−1 - weight connecting the ith neuron in (l − 1)th layer wij th and j neuron in lth layer dj - target of the j th neuron in the output layer . ej = dj − yjm - error of the j th neuron in the output layer

(1)

j=1

To obtain the SMLP instantaneous cost function we add an additional sparse regularization term to the cross-entropy cost (1), yielding Np

λX L˜ = L + log 1 + (yjp )2 . 2 j=1

(2)

where λ is a positive scalar controlling the trade-off between the sparsity and the cross-entropy cost. The function Np P log 1 + (yjp )2 which is continuous and differentiable j=1

everywhere, was successfully used in previous works to obtain a sparse representation [14], [7], [12], [13]. The weights of the SMLP are adjusted to minimize (2), and is discussed below. B. Error back-propagation training

Stochastic gradient descent is applied for updating the SMLP weights. The conventional error back-propagation training algorithm is a result of applying the chain rule of calculus to compute the gradient of a cross-entropy cost (1) function with respect to the weights. For training SMLP, the error backpropagation must be modified in order to accommodate the additional sparse regularization term. In the rest of this section, we derive update equations for training the SMLP by minimizing the cost function (2) with respect to weights3 over the training data. Since the learning is based on stochastic gradient descent, the key is to determine the gradient of the cost function (2) with respect to the weights. 1) Gradient of L˜ w.r.t. yjl : From (1) and (2), ∀l ∈ {p + 1, p + 2, ..., m}, ∀j ∈ {1, 2, ..., Nl }, ∂ L˜ ∂L = . ∂yjl ∂yjl 2 By

(3)

instantaneous we mean corresponding to a single input pattern. bias values at any layer can be interpreted as weights connecting an imaginary node in the previous layer, with its output being unity, and all the nodes in the current layer. 3 The

3

Using (2), for layer p, ∀j ∈ {1, 2, ..., Np }, ∂ L˜ ∂L = +λ ∂yjp ∂yjp

yjp

1 + (yjp )2

!

.

(4)

Using (2) and chain rule of calculus, ∀ (l − 1) ∈ {2, 3..., p − 1}, ∀i ∈ {1, 2..., Nl−1 }, ! ! ! Nl X ∂yjl ∂xlj ∂ L˜ ∂ L˜ = ∂yjl ∂xlj ∂yil−1 ∂yil−1 j=1 ! Nl X l−1 ′ ∂ L˜ . (5) φl xlj wij = l ∂yj j=1

The above equations (3),(4) and (5) indicate that the gradients of L˜ w.r.t. yjl can be computed from the gradients of L w.r.t. yjl . Specifically, we need the gradients of L w.r.t. yjl , ∀l ∈ {p, p + 1, ..., m}, ∀j ∈ {1, 2, ..., Nl } in order to compute gradients of L˜ w.r.t. yjl , ∀l ∈ {2, 3, ..., m}, ∀j ∈ {1, 2, ..., Nl }. The computation of these gradients is described in Appendix A. l−1 l−1 2) Gradient of L˜ w.r.t. wij : By definition, wij denotes the weight connecting the ith neuron in (l − 1)th layer and j th neuron in lth layer. Thus by using the chain rule, ! ! ! ∂yjl ∂xlj ∂ L˜ ∂ L˜ = l−1 l−1 ∂yjl ∂xlj wij ∂wij ! ′ ∂ L˜ (6) φl xlj yil−1 . = l ∂yj 3) Update equations: SMLP weights are updated using stochastic gradient descent. The gradient of the cost function (6) with respect to a particular weight is accumulated for several input patterns and then the weight is updated using l−1 l−1 wij ← wij −η h

∂ L˜ i, l−1 wij

(7) ˜

L where η is a small positive learning rate, and h w∂l−1 i is the

accumulated value of the gradient.

ij

C. SMLP as a posterior probability estimator The number of input and output nodes of the SMLP is set to be equal to the dimensionality of its input acoustic feature vector and the number of output phoneme classes, respectively. Softmax nonlinearity is used at its output layer, and weights are adjusted to minimize (2) when the hard targets are being used. Note from the equations (3), (4), (5) and (6) that the sparse regularization term affects the update of only l those weights wij , ∀l ∈ {1, 2, ..., p − 1}. This implies that l the weights wij , ∀l ∈ {p, p + 1, ..., m − 1} can be adjusted to minimize the cross-entropy term of (2) without affecting the sparse regularization term. If p < m− 1 and one of the hidden layers between pth and mth layers is sigmoidal (nonlinear) then the pth layer outputs can be nonlinearly transformed to the SMLP outputs. Therefore, in such a case, SMLP estimates the posterior probabilities of output classes conditioned on the pth layer outputs (sparse representation). This follows

from the fact that an MLP with a single nonlinear hidden layer estimates the posterior probabilities of output classes conditioned on the input features [19], and SMLP outputs are independent of the inputs given the outputs of pth layer. IV. E XPERIMENTAL R ESULTS A. Database Phoneme recognition experiments are conducted on the TIMIT database [30]. It consists of 630 speakers with 10 utterances per speaker sampled at 16 kHz. The two SA dialect sentences per speaker are excluded from the setup as they are identical across all the speakers. The original TIMIT train and test sets consist of 462 and 168 speakers respectively [30]. We further divide the original train set into training and validation sets having 425 and 37 speakers, and keep the original test set unchanged. Thus in all our experiments, the training, validation and test sets consist of 3400, 296 and 1344 utterances from 425, 37 and 168 speakers, respectively. B. Feature streams Speech recognition features are usually extracted from the two-dimensional representation of speech such as spectrogram. Depending on the manner in which features are derived, they can be broadly classified into spectral [17], [31], temporal [32], [33] or spectro-temporal [34], [35], [36] features. Alternatively, discriminative features derived using a MLP [21], [23], [24] are also extensively used in the large vocabulary continuous speech recognition [25], [26]. To test the proposed SMLP classifier, we developed systems using three different feature streams, namely PLP cepstral coefficients [17], FDLP temporal features [33] and MLDA spectro-temporal features [36]. These features are extracted for every 10 ms of speech, and they are normalized for speaker specific mean and variance. A detailed description of each feature stream is provided below. 1) PLP cepstral coefficients: Short Time Fourier Transform (STFT) is applied on the speech signal with an analysis window of length 25 ms and a frame shift of 10 ms. The squared magnitude values of the STFT output are then projected on a set of frequency weights which are equally spaced on the Bark frequency scale to obtain the spectral energies in various critical bands. Nonlinear compressions such as equal loudness and cubic root are applied for reducing the dynamic range. The resultant spectral envelopes are smoothed by the twelfth order linear prediction analysis [17]. The top 13 cepstral coefficients are concatenated with the corresponding delta and delta-delta features to obtain 39 dimensional feature vector. A nine frame context of these vectors is used as the input PLP feature stream. 2) FDLP temporal features: Speech is transformed to the frequency domain by applying discrete cosine transform (DCT) on the full utterance. The full band DCT signal is windowed into critical band DCTs and linear prediction analysis is done to obtain the smooth sub-band temporal envelopes. These temporal envelopes are passed through nonlinearities such as logarithmic and adaptive compression loops. The resultant compressed sub-band envelopes are divided into 200

4

MLP

SMLP

input

hidden sparse layer−2 hidden outputs

hidden layer output

input temporal context of 230 ms

Acoustic features

output

Single state phoneme posterior probabilities

(23 frames)

3−state phoneme posterior probabilities Fig. 1. SMLP based hierarchical estimation of posterior probabilities of phonemes. Though both the networks are fully connected, only a portion of the connections are shown for clarity.

ms segments with a shift of 10 ms. DCT is applied on each segment to derive the FDLP modulation spectra. The first 14 modulation frequencies are concatenated from each sub-band to form the FDLP temporal feature stream [33]. 3) MLDA spectro-temporal features: Spectro-temporal patterns corresponding to each phoneme are obtained from the spectro-temporal (log critical band energies) representation of speech. A set of spectro-temporal discriminative patterns are designed using modified linear discriminant analysis (MLDA) to separate each phoneme from the rest of the phonemes. Projections of a given spectro-temporal patch on these discriminative patterns are concatenated to form MLDA feature stream [36]. C. System description The phoneme recognition system in our experiments is based on a hybrid HMM/MLP approach, where the posterior probability estimates of various phonemes are used to model the HMM states [16]. In all our experiments, posterior probabilities are estimated in a hierarchical manner for each acoustic feature stream as shown in the Fig. 1. Initially, a four layer SMLP is trained to estimate the 3-state phoneme posterior probabilities. Subsequently, another three layer MLP is trained on a long temporal span of these posteriors to estimate the single state phoneme posterior probabilities. Both these networks are initialized randomly using uniform noise and trained using back-propagation. We have modified the Quicknet package [37] (software for MLP training) to perform SMLP training. For each feature stream, the value of λ in the SMLP cost function (2) is chosen to minimize the phoneme error rate (PER) on the validation data. Fig. 2 shows the effect of λ on PER of the validation data for various feature streams. 1) Estimation of 3-state phoneme posterior probabilities: The 61 hand-labeled phone symbols are mapped to 49 phoneme classes by treating each of the following set of phonemes as a single class: {/tcl/, /pcl/, /kcl/}, {/gcl/, /dcl/, /bcl/}, {/h#/, /pau/}, {/eng/, /ng/}, {/axr/, /er/}, {/axh/, /ah/}, {/ux/, /uw/}, {/nx/, /n/}, {/hv/, /hh/}, and {/em/, /m/}.

As shown in Fig. 1, the SMLP used for estimating the 3state phone posterior probabilities consists of four layers (m = 4): an input layer to receive a given feature stream, two hidden layers with a sigmoid nonlinearity, and an output layer with a softmax nonlinearity. The number of nodes in the input and output layers is set to be equal to the dimensionality of the input feature vector and the number of phoneme states (i.e., 49 x 3 = 147) respectively. The outputs of the first hidden layer (p = 2) are forced to be sparse with number of nodes in it being same as that of the input layer. The number of nodes in the second hidden layers is chosen to be 1000. In the first pass of SMLP classifier training, 3-state hard phoneme targets are obtained by segmenting each phoneme in the training data equally into three states i.e., start, middle and end. This classifier is retrained in a second pass using the hard targets corresponding to the best state alignment obtained by applying the Viterbi algorithm on 3-state posterior probability estimates of the first pass. Frame classification accuracy on the validation set is used to control the learning rate and to terminate training. In order to gauge the effect of the sparse regularization term λ, an identically configured four layer MLP with λ = 0 is also trained to estimate 3-state phoneme posteriors. For an additional comparison, we also estimate the 3-state phoneme posteriors using a conventional three layer MLP which has a sigmoid nonlinearity at the hidden layer and a softmax nonlinearity at the output layer. The number of hidden layer nodes in this system are chosen such that the total number of parameters match approximately that of the SMLP. 2) Hierarchical estimation of posterior probabilities: Better estimates of the posterior probabilities often improve the performance of the phoneme recognition system. One way to enhance these estimates is to train a second MLP on a relatively larger context of posterior probabilities estimated from the first classifier. This is known as the hierarchical estimation of posterior probabilities and has been shown to be useful for speech recognition in [1], [2]. As shown in the Fig. 1, 3-state phoneme posterior probability estimates are mapped to single state phoneme posterior probability estimates by training an MLP which operates on a context of 230 ms or 23 posterior probability vectors. Its

5

system performs better than conventional three layer MLP based system for each feature stream. Moreover, the SMLP based system outperforms the baseline four layer MLP based system for each feature stream. This improved performance can be attributed to the sparse regularization term.

PLP FDLP MLDA

22

PER

21.6 21.2

TABLE I PER ( IN %) ON TIMIT TEST SET FOR VARIOUS ACOUSTIC FEATURE STREAMS USING HIERARCHY OF MULTILAYER PERCEPTRONS . L AST

20.8

COLUMN INDICATES THE RESULTS OF FEATURE STREAM COMBINATION AT THE HIERARCHICAL POSTERIOR LEVEL USING THE D EMPSTER -S HAFER THEORY OF EVIDENCE .

20.4 20.0 0

0.01

0.05

0.08

Lambda

Fig. 2. Phoneme error rate of the validation data as a function of λ for various feature streams.

hidden layer consists of 3500 nodes with a sigmoid nonlinearity, and output layer consists of 49 nodes with a softmax nonlinearity. 3) Hybrid HMM decoding: The 49 phoneme classes are mapped to 39 phoneme classes for decoding4 [38]. The posterior probabilities of phoneme classes are converted to the scaled likelihoods by dividing them by the corresponding prior probabilities of phonemes obtained from the training data. A 3-state HMM (connected from left to right) with equal selftransition and state transition probabilities is used to model each phoneme. The emission likelihood of each state is set to be the scaled likelihood. A bigram phonotactic language model is used in all the experiments. Finally, the Viterbi algorithm is applied for decoding the phoneme sequence. The PER is obtained by comparing the decoded phoneme sequence against the reference sequence. While evaluating the performance on the test set, the language model scaling factor is chosen to minimize the PER of the validation data. D. System combination Hierarchically estimated posterior probabilities (section IV-C2) corresponding to each feature stream are combined using Dempster-Shafer (DS) theory of evidence [39]. The DS theory is a generalization of the Bayesian probability framework which allows characterization of ignorance through basic probability assignments (BPAs). The individual posterior streams are converted to BPAs which are then combined using DS orthogonal sum. The resultant BPAs are renormalized to obtain the estimates of posterior probabilities, which are decoded as described in section IV-C3. E. Results Table I shows the PER of the proposed SMLP based hierarchical hybrid system and the baseline MLP based hierarchical hybrid systems for various feature streams on the TIMIT test set. As described earlier in section IV-C1, proposed and baseline systems differ only in the way 3-state phoneme posteriors are estimated. Results indicate that four layer MLP based 4 The

appropriate subsets of 49 phoneme posterior probability estimates are summed to get 39 phoneme probability estimates.

MLP (3 layers) MLP (4 layers) SMLP (4 layers)

PLP 22.9 22.6 21.9

FDLP 23.2 22.8 22.1

MLDA 22.8 22.4 21.9

PLP+FDLP+MLDA 20.5 20.2 19.6

To exploit the complementary nature of these acoustic feature streams, they are combined at the hierarchical posterior probability level as described in section IV-D. The system combination results are shown in the last column of Table I. It can be observed that the combination of SMLP based systems yields a PER of 19.6%, a relative improvement of 3.0% over the combination of four layer MLP (i.e., λ = 0) based systems. On the TIMIT core test set consisting of 192 utterances (a subset of the test set provided by LDC [30]), we obtain a PER of 20.7% using the combination of SMLP based systems. These performances are comparable to the existing state-ofthe-art systems5 . F. Analysis First, we quantify the sparsity of pth hidden layer outputs using the following measure (κ) [41]. N2 P p |yi | p 1 i=1 . κ= √ (8) N2 − s N2 − 1 N 2 P p 2 (yi ) i=1

yip

It is to be noted that represents the output of a node i in layer p of a SMLP. Furthermore, 0 ≤ κ ≤ 1, and the value of κ is one for maximally sparse and close to zero for minimally sparse representations. Table II lists the average κ value of the first hidden layer outputs over the validation data for various phoneme recognition systems. As expected, SMLP based systems have significantly higher κ values than four layer MLP systems which indicates the effectiveness of the sparse regularization term. Second, we experimentally verify whether sparse features tend to be more linearly separable than non-sparse counterparts. After training the hierarchical system for each feature stream, the first hidden layer outputs of SMLP or four layer MLP classifier are used as input features for training a single layer perceptron (linear classifier) to estimate the 3state phoneme posterior probabilities. The second MLP in the 5 Note that some of the TIMIT phoneme recognition systems use part of the original test set as a validation set [10], [27], [28]. However, as mentioned earlier, we kept the the original test set unchanged as in [40].

6

TABLE II AVERAGE MEASURE OF SPARSITY OF THE FIRST HIDDEN LAYER OUTPUTS OF SMLP AND FOUR LAYER MLP FOR VARIOUS FEATURE STREAMS .

MLP (4 layers) SMLP (4 layers)

PLP 0.275 0.496

FDLP 0.278 0.552

MLDA 0.282 0.540

sparse while learning the mapping from the inputs to the targets. We proposed a new cost function and derived the update equations for training the SMLP. Further, we experimentally showed that an SMLP based system outperforms the state-ofthe-art MLP based phoneme recognition system on TIMIT and achieves PER of 19.6%. A PPENDIX

hierarchy remains unchanged. Table III shows the PER of the resulting hierarchical system for various acoustic feature streams. The linear classifier is able to model the SMLP features better than it models the MLP features.

S TANDARD

Gradients of L w.r.t. yjm can be computed from the equation (1) as follows, ∀j ∈ {1, 2, ..., Nm}, yjm − dj ∂L . = ∂yjm yjm 1 − yjm

TABLE III PER ( IN %) ON TIMIT TEST SET FOR VARIOUS ACOUSTIC FEATURE STREAMS WHEN THE 3- STATE PHONEME POSTERIORS ARE OBTAINED USING A SINGLE LAYER PERCEPTRON TRAINED ON THE FIRST HIDDEN LAYER OUTPUTS OF SMLP OR MLP CLASSIFIER .

Hidden layer features from MLP (4 layers) Hidden layer features from SMLP (4 layers)

PLP 26.9 25.0

FDLP 27.0 25.5

MLDA 26.5 24.6

Finally, results using only a single multilayer perceptron (without hierarchy) are analyzed. A single multilayer layer perceptron (SMLP or MLP) is trained directly to estimate the single state phoneme posterior probabilities which are decoded as described in section IV-C3. The value of λ in SMLP cost function is optimized on the validation data. Table IV summarizes the PER of various feature streams. It can be observed from this table that the SMLP system consistently outperforms the corresponding baseline MLP systems. TABLE IV PER ( IN %) ON TIMIT TEST SET USING A SINGLE MULTILAYER PERCEPTRON ( WITHOUT HIERARCHY ).

MLP (3 layers) MLP (4 layers) SMLP (4 layers)

PLP 27.2 27.3 26.6

FDLP 26.3 27.5 25.8

MLDA 27.1 27.0 26.0

V. D ISCUSSION In order to estimate posterior probabilities in a hierarchical manner, an MLP is trained on long temporal span of posterior probabilities estimated by the first SMLP as shown in Fig. 1. We noticed that replacing the second MLP with an SMLP yields comparable results. This may be attributed to the sparsity of the input posterior probability vectors. In other words, gain is limited when the input to the SMLP is already sparse. The weights of the network are initialized randomly in this work. Another aspect that needs to be explored is the usefulness of unsupervised auto-encoder pre-training when sufficiently large amount of data is available for training. VI. C ONCLUSIONS In this paper, we introduced the theory of SMLP in which the outputs of one of its hidden layers output are forced to be

ERROR BACK - PROPAGATION

(9)

Standard error back-propagation algorithm expresses gradients of L w.r.t. yjm−1 in terms of previously computed gradients in (9). In general, given gradients of L w.r.t. yjl , ∀j ∈ {1, 2, ..., Nl }, gradient of L w.r.t. yil−1 for any i ∈ {1, 2, ..., Nl−1 } is given by (similar to (5)), ! ! ! Nl X ∂yjl ∂xlj ∂L ∂L = ∂yjl ∂xlj ∂yil−1 ∂yil−1 j=1 ! Nl X l−1 ′ ∂L φl xlj wij = l ∂yj j=1 ! Nl X ∂L l−1 yjl (1 − yjl ) wij = l ∂y j j=1 where, yjl = φl xlj =

1 . 1 + exp(−xlj )

ACKNOWLEDGMENT Authors would like to thank Samuel Thomas and Balakrishnan Varadarajan for sharing some scripts used in the baseline phoneme recognition system. R EFERENCES [1] J. Pinto, B. Yegnanarayana, H. Hermansky, and M. Magimai.-Doss, “Exploiting contextual information for improved phoneme recognition,” Proc. of INTERSPEECH-2007, pp. 1817–1820. [2] J. Pinto, G.S.V.S. Sivaram, M. Magimai.-Doss, H. Hermansky and H. Bourlard, “Analyzing MLP Based Hierarchical Phoneme Posterior Probability Estimator,” IEEE Transactions on Audio, Speech, and Language Processing, 2010. [3] H. Ketabdar and H. Bourlard, “Enhanced Phone Posteriors for Improving Speech Recognition Systems,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 6, pp. 1094–1106, Aug. 2010. [4] J. Pinto, M. Magimai.-Doss, and H. Bourlard, “MLP Based Hierarchical System for Task Adaptation in ASR,” Proc. of IEEE workshop on Automatic Speech Recognition and Understanding, pp. 365–370, 2009. [5] D. Imseng, M. Magimai.-Doss, and H. Bourlard, “Hierarchical Multilayer Perceptron based Language Identification,” Proc. of INTERSPEECH2010. [6] K. Huang and S. Aviyente, “Sparse representation for signal classification,” Advances in neural information processing systems 19, pp. 609– 616, 2006. [7] M. Ranzato, Y. Boureau, and Y. LeCun, “Sparse Feature Learning for Deep Belief Networks,” Advances in neural information processing systems 20, pp. 1185–1192, 2007. [8] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 2, pp. 210–227, 2008.

7

[9] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Supervised dictionary learning,” Advances in neural information processing systems 21. pp. 1033–1040, 2008. [10] T.N. Sainath, A. Carmi, D. Kanevsky, and B. Ramabhadran, “Bayesian compressive sensing for phonetic classification,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4370–4373, 2010. [11] J.F. Gemmeke and T. Virtanen, “Noise robust exemplar-based connected digit recognition,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4546–4549, 2010. [12] G.S.V.S. Sivaram, S.K. Nemala, M. Elhilali, T. Tran and H. Hermansky, “Sparse Coding for Speech Recognition,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4346–4349, 2010. [13] G.S.V.S. Sivaram, G. Sriram and H. Hermansky, “Sparse Autoassociative Neural Networks: Theory and Application to Speech Recognition,” Proc. of INTERSPEECH-2010. [14] B.A. Olshausen and D.J. Field, “Sparse coding with an overcomplete basis set: A strategy employed by V1?,” Vision research, vol. 37, no. 23, pp. 3311–3325, 1997 [15] H. Lee, C. Ekanadham, and A. Ng, “Sparse deep belief net model for visual area V2,” Advances in neural information processing systems 20, 2007. [16] H. Bourlard and N. Morgan, “Connectionist speech recognition: a hybrid approach,” Springer, 1994. [17] H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech,” Journal of the Acoustical Society of America, vol. 87, no. 4, pp. 1738– 1752, 1990. [18] G.S.V.S. Sivaram and H. Hermansky, “Multilayer Perceptron with Sparse Hidden Outputs for Phoneme Recognition,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011. [19] M.D. Richard and R.P. Lippmann, “Neural network classifiers estimate Bayesian a posteriori probabilities,” Neural computation, vol. 3, no. 4, pp. 461–483, 1991. [20] N. Morgan, Q. Zhu, A. Stolcke, K. Sonmez, S. Sivadas, T. Shinozaki, M. Ostendorf, P. Jain, H. Hermansky, D.P.W. Ellis, G. Doddington, B. Chen, O. Cretin, H. Bourlard, and M. Athineos, “Pushing the Envelope - Aside,” IEEE Signal Processing Magazine, vol.22, no.5, pp. 81–88, Sept. 2005. [21] H. Hermansky, D.P.W. Ellis, and S. Sharma, “Tandem connectionist feature extraction for conventional HMMsystems,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1635–1638, 2000. [22] Balakrishnan. V, G.S.V.S. Sivaram, and S. Khudanpur, “Dirchlet Mixtures to Model Neural Network Posteriors in the HMM Framework,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011. [23] B. Chen, Q. Zhu, and N. Morgan, “Learning long-term temporal features in LVCSR using neural networks,” Proc. of INTERSPEECH-2004, pp. 925–928. [24] F. Gr´ezl, M. Karafi´at, S. Kont´ar, and J. Cernocky, “Probabilistic and bottle-neck features for LVCSR of meetings,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 757–760, 2007. [25] Q. Zhu, B. Chen, N. Morgan, and A. Stolcke, “On using MLP features in LVCSR,” Proc. of INTERSPEECH-2004, pp. 921–924. [26] J. Park, F. Diehl, M.J.F. Gales, M. Tomalin, and P.C. Woodland, “Training and adapting MLP features for Arabic speech recognition,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4461–4464, 2009. [27] G.E. Dahl, M. Ranzato, A. Mohamed, and G. Hinton, “Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine,” Advances in neural information processing systems, 2010. [28] A. Mohamed and G. Hinton, “Phone recognition using Restricted Boltzmann Machines,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.4354–4357, 2010. [29] R. Grosse, R. Raina, H. Kwong, and A.Y. Ng, “Shift-invariant sparse coding for audio classification”, Conference on Uncertainty in Artificial Intelligence (UAI), 2007. [30] http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1 [31] S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 28, no. 4, pp. 357–366, 1980. [32] B.E.D. Kingsbury, N. Morgan, and S.Greenberg, “Robust speech recognition using the modulation spectrogram,” Speech Communication, vol. 25, no. 1-3, pp. 117–132, 1998.

[33] S. Ganapathy, S. Thomas, and H. Hermansky, “Modulation Frequency Features For Phoneme Recognition In Noisy Speech,” Journal of Acoustical Society of America - Express Letters, vol. 125, no. 1, pp. 8–12, Jan 2009. [34] M. Kleinschmidt and D. Gelbart, “Improving word accuracy with Gabor feature extraction,” Proc. of ICSLP-2002, pp. 25–28. [35] S. Zhao and N. Morgan, “Multi-stream spectro-temporal features for robust speech recognition,” Proc. of INTERSPEECH-2008, pp. 898–901. [36] N. Mesgarani, G.S.V.S. Sivaram, S.K. Nemala, M. Elhilali, and H. Hermansky, “Discriminant Spectrotemporal Features for Phoneme Recognition,” Proc. of INTERSPEECH-2009, pp. 2983–2986. [37] “The ICSI Quicknet Software Package,” Available:http://www.icsi.berkeley.edu/Speech/qn.html [38] K.F. Lee and H.W. Hon, “Speaker-independent phone recognition using hidden Markov models,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 37, no. 11, pp. 1641–1648, 1989. [39] F. Valente and H. Hermansky, “Combination of acoustic classifiers based on dempster-shafer theory of evidence,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1129–1132, 2007. [40] P. Schwarz, P. Matejka and J. Cernocky, “Hierarchical structures of neural networks for phoneme recognition,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2006. [41] P.O. Hoyer, “Non-negative matrix factorization with sparseness constraints,” The Journal of Machine Learning Research, vol. 5, pp. 1457– 1469, 2004. G.S.V.S. Sivaram is a PhD student in the Department of Electrical and Computer Engineering and a research assistant in the Center for Language and Speech Processing at the Johns Hopkins University in Baltimore, Maryland. He received the M.E. degree in Signal Processing from the Indian Institute of Science, Bangalore, India in 2006. From September 2006 to August 2007, he was a software engineer at Muvee Technologies, Singapore, where he worked on video transitions and effects. He was a research assistant in IDIAP research institute, Switzerland, during September 2007 and January 2009, where he worked on automatic speech recognition. His research interests include acoustic modeling for speech recognition, speaker verification, and machine learning.

Hynek Hermansky is a Professor of the Electrical and Computer Engineering and an Interim Director of the Center for Language and Speech Processing at the Johns Hopkins University in Baltimore, Maryland. He is also a Professor at the Brno University of Technology, Czech Republic, an Adjunct Professor at the Oregon Health and Sciences University, Portland, Oregon, and an External Fellow at the International Computer Science Institute at Berkeley, California. He is a Fellow of IEEE for Invention and development of perceptually-based speech processing methods, is in charge of plenary sessions at the upcoming 2011 ICASSP in Prague, was the Technical Chair at the 1998 ICASSP in Seattle and an Associate Editor for IEEE Transaction on Speech and Audio. Further, he is Member of the Editorial Board of Speech Communication, holds 6 US patents and authored or co-authored over 200 papers in reviewed journals and conference proceedings. He has been working in speech processing for over 30 years, previously as a Director of Research at the IDIAP Research Institute, Martigny and an Adjunct Professor at the Swiss Federal Institute of Technology in Lausanne, Switzerland, a Professor and Director of the Center for Information Processing at OHSU Portland, Oregon, a Senior Member of Research Staff at U S WEST Advanced Technologies in Boulder, Colorado, a Research Engineer at Panasonic Technologies in Santa Barbara, California, and a Research Fellow at the University of Tokyo. He holds Dr.Eng. Degree from the University of Tokyo, and Dipl. Ing. Degree from Brno University of Technology, Czech Republic. His main research interests are in acoustic processing for speech recognition.