Eurospeech 2001 - Scandinavia

Robust Speech Recognition in Noise: An Evaluation using the SPINE Corpus† John H.L. Hansen, Ruhi Sarikaya, Umit Yapanel, Bryan Pellom CSLR: Center for Spoken Language Research; Robust Speech Processing Laboratory University of Colorado at Boulder, Boulder, Colorado 80309-0594, USA {jhlh, sarikaya, yapanel,bp}@cslr.colorado.edu, http://cslr.colorado.edu

ABSTRACT In this paper, methodologies for effective speech recognition are considered along with evaluations of an NRL speech in noise corpus entitled SPINE. When speech is produced in adverse conditions that include high levels of noise, workload task stress, and Lombard effect, new challenges arise concerning how to best improve recognition performance. Here, we consider tradeoffs in (i) robust features, (ii) front-end noise suppression, (iii) model adaptation, and (iv) training and testing in the same conditions. The type of noise and recording conditions can significantly impact the type of signal processing and speech modeling methods that would be most effective in achieving robust speech recognition. We considered alternative frequency scales (M-MFCC, ExpoLog), feature processing (CMN, VCMN, LP-vs-FFT MFCCs), model adaptation (PMC), and combinations of gender dependent with gender independent models. For the purposes of achieving effective speech recognition performance, computational speed and availability of adaptation data greatly impacts final recognition performance. In particular, while reliable algorithm formulations for addressing specific types of distortion can improve recognition rates, these algorithms cannot reach their full potential without proper front-end algorithm data processing to direct compensation. While parallel banks of speech recognizers can improve recognition performance, their significant computational requirements can render the recognizer useless in actual speech applications.

1. Introduction The issue of robustness in speech recognition can take on a broad range of problems. A speech recognizer may be robust in one environment and yet be inappropriate for another. The main reason for this is that performance of existing recognition systems which assume a noise-free tranquil environment, degrade rapidly in the presence of noise, distortion, and speaker stress. In Fig. 1, a general speech recognition scenario is presented which considers a variety of speech signal distortions. For this scenario, we assume that a speaker is exposed to some adverse environment, where ambient noise is present and a stress induced task is required. The adverse environment could be a noisy automobile where hands-free cellular communication is used, military or commercial environments such as noisy high-stress noisy helicopter or aircraft cockpits, factory environments, and others. Since the user task could be demanding, the speaker is required to divert a measured level of cognitive processing, leaving formulation of speech for recognition as a secondary task. Therefore, while microphone mismatch can have a measurable impact on speech recognition performance, this fixed mismatch is virtually trivial when compared to the variability introduced by background noise and speaker stress. Here we emphasize that speech recognition in actual noisy environments cannot be effectively addressed unless speech production variability due to background noise is also addressed, and to a lesser degree the level of speaker task demands while using a speech recognizer. †

Fig. 1 illustrates how acoustic background noise influences speech in several areas prior to speech recognition. Since the talker is producing speech in the presence of noise, he will experience the Lombard effect, a condition where speech production is altered in an effort to communicate more effectively across a noisy environment. The level of Lombard effect may depend on the type and level of ambient noise d1(n). Workload task stress in the presence of high noise levels has also been shown to significantly impact recognition performance[1]. This situational stress or workload task demand will alter the manner in which speech is produced. If s(n) represents neutral noise-free speech, then the acoustic signal at the microphone includes distortion due to additive noise and speaker variability. If the speech recognizer is trained with one microphone and another is used for testing, then distortion due to microphone mismatch can be modeled with a frequency distortion impulse response hMIKE(n). There can be other microphone factors such as partial speech cut-off when using a push-to-talk microphone. Background noise d1(n) will also degrade the speech signal as illustrated in Eqn.1. Cellular channel effects can be modeled as either additive noise d2(n), or a frequency distortion with impulse response hCHAN(n). Furthermore, noise could also be present at the receiver d3(n), Therefore, the Neutral noise-free distortionless speech signal s(n), having been produced and transmitted under noisy adverse conditions, is transformed into the degraded signal y(n) as shown in Eqn. 1 in Fig. 1. This paper focuses on addressing trade-offs in how to improve speech recognition in noisy environments based on the SPINE corpus[2]. We begin with a brief overview of methods for addressing noise in speech recognition. Next, some comments on the SPINE corpus, followed by an analysis of the different audio conditions. Sec. 5 focuses on the CU Speech Recognizer used for this study, with evaluation results presented. Finally, Sec. 6 summarizes the work with several comments on future efforts for speech recognition in noise.

2. Why Speech Recognizers Break There are a variety of approaches that can be used to improve the robustness of speech recognition. These can be partitioned into four general areas as follows: ♦ Robust Speech Features: these methods focus on developing features which are inherently less sensitive to noise/distortion. ♦ Speech & Feature Enhancement: these methods focus on front-end signal or feature processing to suppress the impact of noise or distortion prior to speech recognition. ♦ Recognizer Model Adaptation: these methods focus on adapting recognition models to the noisy speaker conditions. ♦ Modified Training Methods: these approaches consider alternative training using either noisy data, mismatch between training/test data, or modifications which cause the trained models to be more effective for recognizing noisy speech.

This work was supported by DARPA through SPAWAR under Grant No. N66001-002-8906.

Eurospeech 2001 - Scandinavia

NOISE NOISE

y(n) =(({[s(n)

Workload | Task Stress

NOISE

]+d1(n)} * hMIKE(n) + d2(n)) * hCHAN(n)) + d3(n)

[1]

Lombard Effect

Fig. 1. Types of distortion which can effect speech for the problem of robust speech recognition. If we ask ourselves why speech recognizers break, the answer comes from a range of issues. Significant research has been done in the areas of improving performance so it is not possible to cite all studies. However, the following references point to several works that are representative, though clearly not exhaustive[1,3-13]. The main point to emphasize is that for effective speech recognition in noise, it may not be possible to formulate one universally successful signal processing solution. This is the fundamental reason why a speech recognizer which is successful in one scenario cannot hope to be successful when noise or speaking conditions change[1].

3. Purpose of SPINE In an effort to better understand the impact of military noise conditions in present day speech recognition systems, NRL organized the SPINE corpus[2]. The strengths of SPINE include: (i) balance between male/female speakers, (ii) controlled 2-way communication scenarios where speakers were required to complete a Battleship game task, and (iii) actual military background noise played through loudspeakers in sound rooms with various microphones (i.e., push-to-talk mikes, etc.). Since the goal was to establish recognizer performance in actual military scenarios, some weaknesses did exist that included: (i) noise recordings were fixed and playback was the same for each scenario (i.e., no randomization of the noise field for each condition), (ii) the task consisted of a Battleship-like game which in theory would be much less taxing then some military task stress conditions (i.e., flying a helicopter, etc.), and (iii) essentially no task stress or penalty assigned for task completion and no Lombard effect included. Since the data was originally collected for a task other than speech recognition evaluation, it does serve as an important first step in focusing the research community on speech recognition in adverse conditions.

4. Analysis of Acoustic Conditions In this section, we summarize results from our acoustic analysis of the SPINE corpus. There were six recording conditions in the corpus, 4 for training and 2 additional in the evaluation test set. Training set noise includes: quite, office noise, Army jeep/Humwv, Navy aircraft carrier (intercom speaker). In Table 1, we summarize results from a statistical power spectral analysis of data for speakers for channel A and B. We can see that for condition DD, the separation between speech and noise was (A) 23-30dB and (B) 20-42dB, for AR the separation was (A) 18-38dB and (B) 37-46dB; and for NV the separation was (A) 32-51dB and (B) 17-36dB {for 0-4kHz range}. Therefore, for office or quiet conditions, the background noise was quite flat with good speech to noise separation. The Humwv noise was basically confined to 4kHz BW and reasonably stationary, with the additional challenge of a push-to-talk microphone. Finally, the aircraft carrier intercom condition (NV (A)) had noise which was highly non-stationary. Our analysis suggests

that the 3 two-way conversations contained very different noise and microphone conditions.

PSD INTENSITY (dB) OVER FREQUENCY (kHZ) CONDITION 0 1 2 3 4 5 6 7

8 95 78 88 85 75 51 42 37 30 DD (A) S PE EC H 72 59 60 57 45 36 33 32 25 N OISE DD (B) S PE EC H 82 86 80 79 72 59 58 52 46 N OISE 60 44 43 53 48 35 34 35 27 AR (A) S P EE CH 90 95 104 96 96 8 7 70 62 54 N OISE 68 72 80 69 59 37 33 29 25 AR (B) S PE EC H 9 8 1 0 4 1 0 8 1 0 7 9 1 9 2 8 7 7 2 6 4 61 56 53 49 45 43 41 42 31 N OISE NV (A) S P EE CH 1 0 1 1 0 0 9 7 9 6 9 4 7 8 6 5 6 4 5 5 N OISE 69 56 57 65 43 33 30 30 22 NV (B) S P EE CH 81 75 81 77 77 78 76 72 58 N OISE 63 49 49 44 40 41 40 41 30 Table 1: Summary of acoustic analysis of SPINE training data. DD(A) is quiet, DD(B) is office, AR(A) is Humwv, AR(B) is quiet, NV(A) has aircraft carrier intercom noise, NV(B) is office noise.

5. Speech Recognizer Formulation The baseline speech recognition system used was developed at CSLR - Univ. Colorado for the purposes of exploring trade-offs in features, training, and adaptation methods. Training & LVCSR Formulation: Before training can take place, phone level transcriptions were needed. This was performed using an alignment tool developed at RSPL-CSLR. The alignment tool uses 12 MFCCs with 12 deltas, energy, and delta energy. Our aligner employs gender dependent monophones trained with 8kHz TIMIT data using a 3-state, leftto-right topology. SPINE training data was downsampled to 8kHz to obtain transcriptions. During segmentation, features were extracted every 5ms. A text-to-phoneme conversion was used based on a pronunciation dictionary made available by CMU. A set of 16 mixture densities were used to characterize observation probability densities. For the recognizer, we used 12 MFCCs, deltas and deltadeltas along with energy, delta energy, and delta-delta energy (a 39 dimensional vector). The HMMs are trained for base phones and cross word triphones using only the SPINE training data (some additional DRT was later provided, but we opted not to include this in the training scenario). A total of 8 left and right phone classes were defined for triphone modeling, using the CMU Sphinx phone set. Finally, all HMMs have a 3-state, leftto-right topology with 4-to-16 mixtures per state depending on the available training data. The recognizer lexicon uses a linear phone sequence to represent pronunciation for each word in the vocabulary. The vocabulary size is 5249 (with 10 hours of training data, and 20 hours of test data). The language model is a trigram model provided by CMU. Finally, a single pass beam search Viterbi decoding process is used in the recognition phase.

Eurospeech 2001 - Scandinavia

In our simulations we excluded non-speech sounds (cough, breath, laugh, etc). The total number of aligned utterances is 11947, with 5657 utterances from channel A and 6290 from channel B. EVAL1 data set partitioning contains 13565 utterances whereas EVAL2 data set portioning contains 13265 utterances. In EVAL2 5896 utterances belong to channel A and 7369 to channel B. B ASELINE R EC OG NITION E VALUATIONS (a) SYSTE M EVAL1 (8kH Z) EVAL2 (8kH Z) EVAL2 (16kH Z) (b ) F EATUR E SCALE M EL M EL M EL M EL

WER SUB S .

DEL.

INS.

4 7 .2 4 3 .0 4 1 .6

1 4 .6 1 3 .2 1 2 .4

6.2 4.2 4.4

2 6 .4 2 5 .7 2 4 .8

P ROCES SIN G M ETHOD S (8kH Z ) W ER (%) PROCESSING FP, FP, FP, FP,

FFT CMN, FFT VCMN, FFT LP

4 3 .5 4 3 .5 4 4 .3 4 4 .9

A LTERN ATIVE F RE QUE NCY S CAL ES (8kH Z ) W ER (%) SCALE PROCESSING

(c)

M EL M- M EL E XPOL OG U NIFORM

FP, FP, FP, FP,

FFT CMN, FFT VCMN, FFT LP

4 3 .5 4 2 .6 4 4 .4 4 3 .0

(d) S PEEC H E NHA NCEME NT P ROCE SSI NG (8kH Z ) SCALE

ENH AN CE MENT

W ER (%)

M EL M EL M EL

LSS NSS GSS

4 5 .1 4 2 .6 4 3 .1

M ODEL A DAPTAT ION M ETHO DS (16 kH Z ) SYSTE M INS. WER SUB S. DEL. W ITH - C0 4 3 .1 2 5 .2 1 3 .4 4.6 PMC - CLEAN 5 1 .2 2 9 .3 1 3 .7 8.2 PMC - ALL 4 2 .8 2 5 .0 1 3 .7 4.1 NOISY 6 1 .5 3 3 .4 2 3 .4 4.8 PMC-CLEAN + N OISY 4 4 .8 2 6 .9 1 3 .4 4 . 5 1 3 .3 4.0 W ITH -C0 + PMC -A LL 4 2 .0 2 4 .7 ALE EM ALE ENEDE R ODE LS M / F G M (f) SYSTE M WER SUB S . DEL. INS. MALE 6 0 .0 3 6 .8 1 7 .8 5.4 FEMALE 5 3 .1 3 1 .2 1 6 .8 5.2 4 2 .5 2 5 .5 1 2 .4 4.6 MALE+FEMALE 4 0 .5 2 4 .2 1 2 .0 4.3 MALE+FEMALE+GI

(e)

Table 2: LVCSR Recognition Results for SPINE Corpus

Baseline Evaluations: Baseline experiments used both EVAL1 and EVAL2 data sets. An early set of recognition results initially produced an unusually high 74.5% WER. However, after inspecting insertion and deletion rates (27.4%, 25.9%), it was clear that the scoring scheme used by NRL required a different time-stamp marking scheme, since most of the speech was recognized correctly. Revisions to our output hypothesis transcription time-stamps using an 8kHz trained system, produced a WER of 47.2%, with more reasonable insertion and deletion rates (14.6%, 6.2%) (see Table 2(a)). The EVAL2 test set improved by 4.2% over EVAL1 data, which we believe was due to the excess amount of silence present at the beginning and end of each utterance in EVAL1 data (i.e., gap insertion rate was higher for EVAL1, and with gap silence durations removed for EVAL2, gap insertions were reduced). While we believe an 8kHz system is more appropriate for voice communication scenarios, we did consider evaluations with 16kHz data. Using full bandwidth data improved the WER by 1.4%. Therefore, the baseline system WER is 41.6%. Although, our baseline system was not optimized for speed like some systems in SPINE NRL evaluations, it was the fastest recognizer system with a 2.83 real-time factor. All remaining experiments used EVAL2 data.

Robust Features & Feature Processing: The area of robust features focuses on formulating signal processing methods which produce parameters which are inherently less sensitive to background distortion and noise. While features that are less sensitive to noise are important, it should be noted that features less sensitive to noise are not necessarily less sensitive to speaker variability in noise and stress as pointed out in the study[12]. In this section, the results of a number of evaluations for different frequency scales and parameter processing schemes are presented. In the next section, different front-end enhancement techniques are considered. Based on results in [12], we felt it would be useful to explore speech recognition performance using SPINE data for the following features and processing methods: (a) fixed preemphasis (FP), (b) cepstrum mean normalization (CMN), (c) variable cepstrum mean normalization (VCMN) (i.e., different normalization based on voiced/unvoiced information from[7,1]), and (d) use of Fast Fourier Transform (FFT) vs. Linear Prediction (LP) spectrum in the computation of cepstrum coefficients. The results for Mel frequency scale are given in Table 2(b) using fixed pre-emphasis (FP) with either FFT or LP based MFCCs, or fixed or variable cepstral mean normalization (CMN, VCMN). For this data, parameter processing and using LP vs. FFT spectrum did not give measurable improvement. Next, we considered alternative filter bank frequency partitions in computing cepstrum coefficients. Two alternative scales, namely Modified Mel Scale, ExpoLog Scale, proposed for robust stressed speech recognition[12], were used in for evaluation. For completeness, we provide the mathematical expressions for these frequency scales[12]: Mel-Scale = 2595 × log  1+



Modified Mel-Scale = 3070 × log

 1 + f  1000  

 f  700 × (10 3988 - 1) ExpoLog =  f  2595 × log 1 + 700 



f  700

[2] [3]

0 ≤ f ≤ 2000

 

[4] 2000 ≤ f ≤ 4000

Results from Table 2(c), show that Modified Mel-Scale (MMFCC) provides a 0.9% improvement over the usual Mel-Scale (MFCCs), and a uniform equal-partition scale provides a 0.5% improvement. The M-MFCC improvement can be attributed to emphasis of the mid-frequencies, thereby addressing to some extent, variability due to speaker stress since the second formant location does not change as much with stress, and these frequencies are emphasized by the Modified Mel-Scale. Speech Enhancement Processing: Next, we consider different front-end speech enhancement techniques prior to recognition features extraction. Three enhancement schemes were considered: (a) Linear Spectral Subtraction (LSS), (b) Non-Linear Spectral Subtraction (NSS) and (c) Generalized Spectral Subtraction (GSS). Other methods such as constrained iterative AutoLSP [8] and MCE [7] were not considered because of measurable differences in computing requirements. Since the three spectral subtraction methods all require comparable CPU requirements, a direct comparison was deemed appropriate. Results from Table 2(d), show that NSS scheme improved recognition results by 0.9%, while the other two enhancement schemes do not give substantial improvement. Model Adaptation Methods: Next, we used parallel model combination (PMC)[4] to compensate acoustic models. Two issues influence PMC’s effectiveness: first, it assumes acoustic models are trained using clean data, while SPINE training data

Eurospeech 2001 - Scandinavia

contains both clean and noisy data. Second, PMC requires models be trained with the 0th order cepstral coefficient. We addressed the first issue by using only the clean portion of the training data and performing PMC on the clean models. We used PMC compensated models and models trained on noisy data in parallel during decoding. However, splitting the training data into two classes resulted in an insufficient amount of training data to obtain reliable acoustic models. The first row of Table 2(e) shows that using C0 in place of normalized log energy resulted in 1.5% increase in WER (over baseline 16kHz system with WER of 41.6%). The second row shows performance using PMC compensated clean models, where WER increased to 51.2%. From this result we cannot conclude that PMC is degrading the results. The source of the problem is insufficient clean data to train reliable models. In the third row, we performed PMC on the models trained using all the data. Comparing the third line with the first we observe a small 0.3% improvement in WER. Note that the assumption of having clean models is violated here for performing PMC. However, the degradation introduced by having clean as well as noisy data became less than the increase in accuracy due to having reliable acoustic models. Next, we performed recognition using models trained only with noisy data. Using a parallel decoding strategy with both PMC compensated clean models and noisy models, the WER improved but remained higher than 43.1%. Finally, using the two parallel model sets of PMC-all and With-c0, improved the WER by 0.8%. Next, we trained the gender Other Considerations: dependent acoustic models to reduce gender variations and to increase acoustic resolution of the models (see Table 2(f)). Using models trained on the male data did poorly on the female data and vice-verse. Next we used gender dependent acoustic models in parallel expecting male models would do better for male data and female models would do better on female data. However, using male and female models in parallel did not provide improvement over a gender independent model set. We believe this is due to an insufficient amount of gender dependent training data to obtain robust acoustic models. Next, we used gender dependent and gender independent (GI) models in parallel during decoding. We obtained the lowest WER in our simulations: 40.5%. However, the drawback here is an increase in decoding time, with a real-time factor increasing roughly by a factor of 3 when the three models sets are used in parallel.

heavily dependent on reliable noise estimates. This observation was also true for model adaptation using PMC for the 16kHz system. Finally, with gender dependent models, there was no improvement, most likely due to the limited amount of training data. However, by combining gender based models with a gender-independent model set, we were able to reduce the overall WER by 1.1% over the baseline system. Our system proved to be the fastest (2.83 times real-time) of all sites participating in the SPINE summer evaluation at NRL. The results here suggest several points that should be emphasized for future efforts in robust speech recognition. First, that while using multiple recognizers in parallel (i.e., “ROVER” solution) can measurably reduce WERs, this comes at a significant increase in computational requirements (many sites had realtime factors of 50-100. Second, model adaptation using MLLR can clearly improve speech recognition performance, but it might be pointed out that in some respects the entire evaluation test data is being folded back into the training corpus, and clearly these computational costs should be factored into the overall system costs. We conclude that to achieve effective speech recognition performance, computational speed and availability of significant levels of adaptation data can greatly impact final recognition performance, and that in adverse environments it may not be possible to formulate one universal robust feature set or adaptation method. We feel that approaches which integrate environmental sniffing, as discussed in [14,15], can have a more pronounced impact on improving recognition performance while maintaining a reasonable realtime computational factor.

REFERENCES [1] [2] [3] [4] [5] [6] [7]

6. Discussion In this paper, we addressed a number of issues relating to robust speech recognition in noise. We emphasize here that while noise impacts speech features used for recognition, it also has a measurable impact on the speaker, causing production variability resulting in the Lombard Effect and speaker stress. This variability must be addressed in order for effective speech recognition to be achieved in adverse environments. Our study has focused on the SPINE speech in noise corpus developed by NRL. We developed a baseline recognizer and then considered evaluation performance for various (i) robust speech features, (ii) speech and feature processing schemes, and (iii) recognizer model adaptation methods. Our acoustic analysis of SPINE showed a significant differences in noise power spectral and stationarity characteristics. It is important to realize that while these methods have previously been shown to be effective for speech recognition, the so called “data-processing set-up” which is based on extracting reliable information on where noise is present, as well as where speech and speaker data is located for adaptation, can significantly influence final worderror-rates (WER). Our baseline WER was 43.0% for an 8kHz trained system, and 41.6% for the 16kHz system. Feature processing methods such as fixed or variable cepstral mean normalization, and fixed pre-emphasis provided small improvement. Some improvement was also observed with frontend speech enhancement methods, but again their impact is

[8] [9] [10] [11] [12] [13] [14]

[15]

J.H.L. Hansen, “Analysis and Compensation of Speech under Stress and Noise for Environmental Robustness in Speech Recognition,” Speech Communication, 20(2):151-70, Nov. 1996. SPINE Corpus: www.ldc.upenn.edu/Catalog/LDC2000S87.html A. Acero, R.M. Stern, “Environmental Robustness in Automatic Speech Recognition,” IEEE ICASSP-90, pp.849-852. M. Gales, S. Young, “Robust continuous speech recognition using parallel model combination,” IEEE Trans. SAP., 4(5):352-9, 1996. V. Digalakis, D. Rtischev, L. Neumery, “Speaker adaptation using constrained estimation of Gaussian mixtures,” IEEE Trans. SAP 3(5):357-66, 1995. Y. Gong, “Speech recognition in noisy environments: A survey,” Speech Comm., (16):261-291, 1995. J.H.L. Hansen, “Morphological constrained enhancement with adaptive cepstral compensation (MCE-ACC) for speech recognition in noise and Lombard effect,” IEEE Trans. SAP., 2(4):598-614, 1994. J.H.L. Hansen, M.A. Clements, “Constrained Iterative Speech Enhancement with Application to Speech Recognition,” IEEE Trans. Signal Proc., 39(4):795-805, 1991. H. Hermansky, N. Morgan, “RASTA processing of speech,” IEEE Speech & Audio Proc., 2(4):578-589, 1994. C-H. Lee, C. Lin, B. Juang, “A study on speaker adaptation of the parameters of continuous HMMs,” IEEE Trans. Signal Proc., 39(4):806-14, 1991. P. Lockwood, J. Boudy, “Experiments with a Nonlinear Spectral Subtractor (NSS), HMMs and the projection, for robust speech recognition in cars,” Speech Communciation, 11:215-228, 1992. S.E. Bou-Ghazale, J.H.L. Hansen, “A Comparative Study of Traditional and Newly Proposed Features for Recognition of Speech Under Stress,” IEEE Trans. SAP., 8(4):429-442, 2000. D. Mansour, B.H. Juang, “A family of distortion measures based upon projection operation for robust speech recognition,” IEEE Trans. ASSP, 37:1659-71, 1988. J.H.L. Hansen, J. Plucienkowski, S. Gallant, B.L. Pellom, W. Ward, "CU-Move: Robust Speech Processing for In-Vehicle Speech Systems," ICSLP-2000, (1):524-527, China, Oct. 2000. J.H.L. Hansen, B. Zhou, M. Akbacak, R. Sarikaya, B. Pellom, “Audio Stream Phrase Recognition for a National Gallery of the Spoken Word,” ICSLP-2000, (3):1089-1092, China, Oct. 2000.

Robust Speech Recognition in Noise: An Evaluation ...

CSLR: Center for Spoken Language Research; Robust Speech Processing Laboratory ... parallel banks of speech recognizers can improve recognition.

82KB Sizes 3 Downloads 478 Views

Recommend Documents

ROBUST SPEECH RECOGNITION IN NOISY ...
and a class-based language model that uses both the SPINE-1 and. SPINE-2 training data ..... that a class-based language model is the best choice for addressing these problems .... ing techniques for language modeling,” Computer Speech.

CASA Based Speech Separation for Robust Speech Recognition
National Laboratory on Machine Perception. Peking University, Beijing, China. {hanrq, zhaopei, gaoqin, zhangzp, wuhao, [email protected]}. Abstract.

Highly Noise Robust Text-Dependent Speaker ... - ISCA Speech
TIDIGITS database and show that the proposed HWF algorithm .... template is the 'clean' version of the input noisy speech, a column ..... offering a large improvement over the noisy and SS cases. Table 2: Speaker-identification accuracy (%) for 3 non

Highly Noise Robust Text-Dependent Speaker Recognition Based on ...
conditions and non-stationary color noise conditions (factory, chop- per and babble noises), which are also the typical conditions where conventional spectral subtraction techniques perform poorly. Index Terms: Robust speaker recognition, hypothesize

Robust Speech Recognition Based on Binaural ... - Research at Google
degrees to one side and 2 m away from the microphones. This whole setup is 1.1 ... technology and automatic speech recognition,” in International. Congress on ...

A Robust High Accuracy Speech Recognition System ...
speech via a new multi-channel CDCN technique, reducing computation via silence ... phone of left context and one phone of right context only. ... mean is initialized with a value estimated off line on a representative collection of training data.

Robust Audio-Visual Speech Recognition Based on Late Integration
Jul 9, 2008 - gram of the Korea Science and Engineering Foundation and Brain Korea 21 ... The associate ... The authors are with the School of Electrical Engineering and Computer ...... the B.S. degree in electronics engineering (with the.

Preliminary evaluation of speech/sound recognition for ...
application in a real environment. Michel Vacher1, Anthony Fleury2, Jean-François Serignat1, Norbert Noury2, Hubert Glasson1. 1Laboratory LIG, UMR ...

Speech Recognition in reverberant environments ...
suitable filter-and-sum beamforming [2, 3], i.e. a combi- nation of filtered versions of all the microphone signals. In ... microphonic version of the well known TI connected digit recognition task) and Section 9 draws our ... a Recognition Directivi

Optimizations in speech recognition
(Actually the expected value is a little more than $5 if we do not shuffle the pack after each pick and you are strategic). • If the prize is doubled, you get two tries to ...

Emotional speech recognition
also presented for call center applications (Petrushin,. 1999; Lee and Narayanan, 2005). Emotional speech recognition can be employed by therapists as a diag ...

Recurrent Neural Networks for Noise Reduction in Robust ... - CiteSeerX
duce a model which uses a deep recurrent auto encoder neural network to denoise ... Training noise reduction models using stereo (clean and noisy) data has ...

Large vocabulary continuous speech recognition of an ...
art recognition systems are able to use vocabularies with sizes of 20,000 to 100,000 words. These systems .... done for the English language, we will compare the charac- teristics of .... dictionary and modeled on the basis of tri-phone context.

Improved perception of speech in noise and Mandarin ...
A mathematical analysis of the nonlinear distortions caused by ..... A program written ...... Software User Manual (Cochlear Ltd., Lane Cove, Australia). Turner ...

Active EM to Reduce Noise in Activity Recognition
fying email to activities. For example, Kushmerick and Lau's activity management system [17] uses text classification and clustering to examine email activities ...

ROBUST CENTROID RECOGNITION WITH APPLICATION TO ...
ROBUST CENTROID RECOGNITION WITH APPLICATION TO VISUAL SERVOING OF. ROBOT ... software on their web site allowing the examination.

CASA Based Speech Separation for Robust Speech ...
techniques into corresponding speakers. Finally, the output streams are reconstructed to compensate the missing data in the abovementioned processing steps ...

A Distributed Speech Recognition System in Multi-user Environments
services. In other words, ASR on mobile units makes it possible to input various kinds of data - from phone numbers and names for storage to orders for business.

A Study of Automatic Speech Recognition in Noisy ...
each class session, teachers wore a Samson AirLine 77 'True Diversity' UHF wireless headset unidirectional microphone that recorded their speech, with the headset .... Google because it has an easier to use application programming interface (API –

ROBUST CENTROID RECOGNITION WITH ...
... and the desired and actual position vectors in the image plane, 6 the vector of joint ... Computer Graphics and Image Processing. Vol.16,. 1981. pp. 210-239.

BINAURAL PROCESSING FOR ROBUST RECOGNITION OF ...
ing techniques mentioned above, this leads to significant im- provements in binaural speech recognition. Index Terms—. Binaural speech, auditory processing, ...