audio-visual speech integration using coupled hidden ...

Viewer
Transcript

AUDIO-VISUAL SPEECH INTEGRATION USING COUPLED HIDDEN MARKOV MODELS FOR CONTINUOUS SPEECH RECOGNITION Subramanya Amarnag, Sabri Gurbuz, Eric Patterson and John N. Gowdy Department of Electrical and Computer Engineering Clemson University, Clemson, SC 29634, USA. Email: {asubram, sabrig, epatter, jgowdy}@clemson.edu

Abstract In recent years considerable interest has developed in the use of visual information for the purposes of Automatic Speech Recognition (ASR), making ASR systems bimodal in nature. The advantage of using visual features for ASR is based on the fact that visual features are far less susceptible to additive noise when compared to their audio counterparts. In this paper we propose, an audio-visual fusion technique that uses a Coupled Hidden Markov Model (CHMM), audio SNR and noise type for the purposes of audio-visual integration. The properties of CHMM allow us to model the asynchrony of the audio and visual observation sequences while still preserving their natural correlation over time. Our experimental results indicate that the CHMM system trained using the knowledge of the SNR and the noise type outperforms the conventional Multistream Hidden Markov Models (MSHMM) by as much as 8% for an audio SNR of 6db.

1.

Introduction

Automatic Speech Recognition (ASR) systems have undergone many advancements over the last few decades. During recent years the field of Audio-Visual Speech Recognition (AVSR) has received much attention. An Audio Visual Speech recognizer uses the visual modality to aid the conventional audio modality in order to improve the performance of ASR systems. There is evidence of use of the visual information in humans [1], for example when someone is in an airport listening to someone in the presence of large background noise which completely ‘masks’ the audio information, they tend to look at the speaker’s lips for additional information in order to understand what the other person is trying to communicate to them. In the recent past audiovisual speech recognition (AV-ASR) systems have been shown to outperform ASR systems. However there are essentially two problems associated with AV-ASR systems. First, we need to develop algorithms to extract lip information real time and second, we need to integrate the audio and visual modalities in order to have a working AV ASR system [2]. There have been a number of algorithms developed in order to solve the first

problem [3]; however, the second problem is still open to research. In AV-ASR systems the audio and visual modalities carry both complementary as well as supplementary information, i.e., there are times when we are over-defining the problem (of integration) by relying on both the audio and visual modalities, and times when we are under-specifying the problem by relying on only one modality. Another interesting problem in audiovisual speech integration is that the audio and visual modalities are both inherently ‘synchronous and asynchronous’. For example, it has been shown that during the onset of speech the audio and visual modalities tend to be asynchronous [4]. The above issues associated with audio and visual modalities make the problem of audio-visual speech integration (AVSI) a very interesting and challenging one. In general there have been two approaches to AVSI, the early integration approach and the late integration approach[2]. (Most of the other approaches are variations of the above techniques). The late integration approach assumes that the two modalities are asynchronous (which is not true) and consists of two recognizers, one for the audio stream and the other for the video stream. The early integration approach, assumes that the audio and visual streams are synchronous in nature [5]. In most cases of early integration the visual features are interpolated to make them synchronous with audio features.

2. Audio Visual Speech Integration using CHMM CHMM’s are a subset of a larger class of networks known as Dynamic Bayesian Networks (DBN) [6]. A DBN is essentially a directed graph between a set of variables, with the edges in the graph defining the influence that each variable has on the others. A CHMM may be imagined as two HMM’s where the state sequence in either one of them is dependent on the state sequence in the other. The DBN’s generalize HMM’s by representing the hidden states as state variables and allow the states to have complex interdependencies. Figure 1 shows a sample AVSI setup using a continuous mixture, two stream, coupled HMM, where the circles represent the hidden discrete nodes and the squares represent the continuous observable nodes.

As stated earlier any integration strategy for AVSR should take into account, the fact that the audio and visual modalities are both synchronous and asynchronous. Modeling the integration by the use of the CHMM takes into account the above fact and hence proves to be an efficient technique for AVSI. For example if the state sequence in Figure 1 is A1V1-A2V2-A3V3, then the CHMM is modeling the synchrony, whereas if the state sequence were to be A1V1-A2V1-A3V2, the CHMM is modeling the asynchrony between the two steams.

Audio Channel

A1

A2

V1

The speech input was processed using a 30ms Hamming window, with the frame period set to 10ms. For each frame, 15 MFCC coefficients were calculated. Delta coefficients were appended to the static features resulting 30 dimensional audio feature vector. The visual features were extracted using the two dimensional Discrete Cosine Transform (DCT). The region around the lip of each speaker was first located, and then this area was down-sampled to a 16X16 grayscale intensity image matrix and the DCT was applied to this matrix. The upper left 6X6 block of the resultant matrix was chosen and the DC value dropped resulting in 35 DCT coefficients. However, it has been shown that the use of dynamic features rather than static features yields good results [10], hence each feature vector consisted of 35 dynamic DCT features. 2.3. Database Description

A3

V2

2.2. Audio and Video Feature Extraction.

All the experiments for this research have been carried out using the Clemson University Audio Visual Experiments (CUAVE) database [8]. This database consists of 36 speakers (19 male and 17 female) speaking connected digits. The entire database is MPEG-2 encoded. The database consists of the above speakers uttering the digits zero through nine in a continuous format.

V3

2.4. Training Phase: Parameter Estimation

CHMM parameters are estimated using the expectation maximization (EM) algorithm. The process of training a DBN has been explained in [6]. The iterative maximum likelihood estimation of the parameters depends on the starting point and converges to local optima. However [7] presents a novel technique for initialization of the parameters for training, guaranteeing convergence to the global optimum. The optimal weighing value and HMM parameters P for clean speech are calculated using

Visual Channel Figure 1: Coupled Hidden Markov Model (CHMM)

It should be noted that there are two state variables in this directed graph. Also the state of each variable during a time slice is dependent on the states of both its parents during the previous time slice.

2.1. Stream Weighing Any integration strategy for AVSR must take into account the fact that the reliability on either the audio or the video stream is continuously changing (with changes in audio SNR), and hence we need to adapt our system to such changes. Therefore, the observation probability at time t for the observation vector Ota, v becomes a,v

bt

Where

a,v

(i) = bt (Ota,v | qt

= i)

λa , v

λ a + λ v = 1 and λ a , λ v ≥ 0

(1) are the exponents of the

audio and video streams. The above values are obtained experimentally to maximize the average recognition rate.

P = arg{ max( , WA | cleanspeech)}

(2)

λ∈[0,1]

It should be noted that the stream weights are not a byproduct of the EM algorithm. The resulting AV model parameters P (transition probabilities, state means and variances) from (2) are used for all conditions. 2.5. Stream Weight Estimation

In this part of the training, we estimate the behavior of the stream confidence value for various acoustic conditions. Using the clean speech CHMM model parameters estimated as above, the optimal value of stream weighing value is estimated for a given SNR and noise type as shown below

λopt = arg{ max , (WA | P, SNR, noisetype)}

(3)

λ∈[0,1]

Using the above equation, the λ values are varied in steps of 0.05 from 0 through 1, and the value that gives the best results on the training set is chosen as λopt . It should be noted that

though P is estimated using clean speech, λopt is estimated using corrupted speech. 3.

Experimental Goals

We essentially had three experimental goals. The first was to investigate the improvement in recognition scores when the noise type is included in the estimation of the stream weight value. The second was to estimate the dependence of the stream weighing value on the additive noise type. The third was to compare the performance of CHMM based AVSI to that of the MSHMM, for various SNR’s. In order to access the performance of the system at various SNR’s we used the NOISEX database to corrupt the clean speech. The MSHMM setup used in this paper is as described in [5]. The CHMM we used consisted of four states in the audio channel and three states in the video channel, no back transitions, and four mixtures per state. CHMM can be transformed in to a HMM by considering all the combinations of the states sequences as explained in [9]. Hence, when the above CHMM is transformed, it results in a HMM having 12 hidden states. The MSHMM system and the CHMM system were implemented using Hidden Markov Model Toolkit (HTK). All the recognition accuracies in this paper are calculated using 100[(N-S-D-I)/N] where N is the number of tokens and S, D, I are the number of substitution, deletion and insertion errors, respectively.

4.1. Comparison of a CHMM System Implemented with Knowledge of Noise Type.

Shown in Figure 2 is a plot of the recognition scores for various values of SNR for two setups of CHMM. Here we have used ‘SPEECH’ noise from the NOISEX database in order to corrupt clean speech. In the first setup the stream weight values were estimated using knowledge of the noise type, and in the second setup the stream weight values were estimated without any prior knowledge of the noise type. It was seen that the system that had the knowledge of the noise type performed better than the system that was ‘trained’ without the knowledge of the noise type by as much as 5% at an audio SNR of 6db. 4.2. Dependence of Stream Weight Value on the Noise Type

Shown in Figure 3 are the stream weight values estimated for CHMM with the audio stream corrupted by speech noise and car noise, respectively. It was seen that though the stream weight value is independent of the type of integration used, the type of noise corrupting the audio stream bears a heavy influence on the stream weight value.

We have used 20 speakers for training the models and the other 16 were used in the testing set.

4.

Experimental Results

The stream weight values obtained in case of MSHMM and CHMM are as shown in Table 1 and Table 2, respectively. It is interesting to note that the different systems of integration yield ‘similar’ values of the stream weight values, leading to the conclusion that the stream weight values depend on the reliability of the audio stream, (SNR and noise type) rather the technique used for combining audio and video streams. SNR (db)

18

12

6

λa

0.65

0.50

0.35

Table 1: Stream Weight Values for MSHMM

Figure 2: Comparison of Recognition Accuracy for CHMM trained with Knowledge of Noise Type and CHMM trained without Knowledge of Noise Type.

Hence, the stream weight values may be estimated for one of the techniques of integration and then used to for all other integration techniques. This could be particularly useful when stream weight estimation for a particular technique presents difficulties in real time implementation.

4.3. Comparison of CHMM with MSHMM for Continuous Speech Recognition.

SNR (db)

18

12

6

λa

0.65

0.45

0.35

Table 2: Stream Weight Values for CHMM

The recognition scores are as shown in Table 3. It can be seen that as the SNR reduces audio recognition score drops. However both CHMM and MSHMM outperform the audio only recognizer validating our claim that AVSR systems perform better than audio only recognizers. Another important point, which needs to be noted, is that the video only recognition scores are independent of the SNR of the audio signal. It can also be seen that CHMM based integration scheme does better than

MSHMM. Also as the audio SNR drops, the drop in MSHMM recognition accuracy is much greater than in the case of CHMM (For a change of SNR from 12db to 6db the MSHMM recognition score drops by 12.6% whereas CHMM drops by only 6.4%).

require a simple lookup table with the noise type and SNR as inputs to it. This is useful in cases where real time computation of the stream weight value is not feasible. Future work will include the use of local stream weight values, (we are presently using a global stream weight value). In addition the EM algorithm will be modified in order to include the estimation of the stream weight into the algorithm. Also the effect of video noise on the system will be accessed. 6.

Acknowledgements

We would like to thank Sabri Gurbuz and Eric Patterson [5,10] for providing us with the video feature extraction algorithms and the initial MSHMM setups. 7.

References

[1] H. McGurk and J. Macdonald, “Hearing Lips and Seeing Voices,” pp. 746-748, September 1976. [2] C. Neti, G. Potamianos, J. Luettin, I. Mattews, D. Vergyri, J. Sison, A Marshari, and J. Zhou, “Audio Visual Speech Recognition,” In Final Workshop Report 2000, 2000.

Figure 3: Dependence of Stream Weight Value on the type of Additive Noise

SNR (db)

Audio Only

Video Only

MSHMM

CHMM

18

61.2 %

36.8 %

67.4 %

69.2 %

12

48.7 %

36.8 %

53.8 %

56.2 %

6

28.9 %

36.8 %

41.2 %

49.8 %

Table 3: Recognition Results for various types of Integration on the testing set.

5. Conclusions

Our results indicate that CHMM performs better than MSHMM in the integration of audio and visual modalities for audio-visual speech recognition. Also, since a knowledge of the noise type improves the recognition scores, it would be useful in applications, where prior knowledge of the application of the recognizer is available. For example, if we are developing an application for speech recognition in a shopping mall then we could estimate the stream weight reliability values based on speech noise. Another important aspect is that the stream weight value tends to be independent of the type of integration scheme. Hence, the implementation of any integration scheme would

[3] G. Potamianos, J. Luettin, and C. Neti. “Hierarchical Discriminative Features for Audio-Visual LVCSR,” Proceedings of International Conference on Acoustics Speech and Signal Processing, Salt Lake City, 2001. [4] V. I. Pavlovic, “Dynamic Bayesian Networks for information fusion with applications to Human Computer Interaction,” PhD Dissertation, University of Illinois, UrbanaChampaign, 1999.

[5] S. Gurbuz, Z. Tufekci, E. Patterson and J. Gowdy, “Multi Stream Product Modal Audio Visual Integration for Robust Adaptive Speech Recognition,” Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Orlando, 2002. [6] F. V. Jensen, An Introduction to Bayesian Networks, UCL Press Limited, London UK, 1998. [7] A. Nefian, L. Laing, X. Pi, L. Xioxiang, C. Mao and K. Murphy, “A Coupled HMM for Audio Visual Speech Recognition,” Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Orlando,2002. [8] E. K. Patterson, S. Gurbuz, Z. Tufekci and J. N. Gowdy, “CUAVE: A new Audio-Visual Database for multimodal Human Computer Interaction Research,” Proceedings of IEEE Conference on Acoustics, Speech and Signal Processing, Orlando, 2002. [9] S. Chu, T. Huang, “Audio-Visual Speech Modeling using Coupled Hidden Markov Models,” Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Orlando, 2002. [10] E. K. Patterson, “Audio Visual Speech Recognition for Difficult Environments,” PhD Dissertation, Clemson University, 2002.