Temporal Feature Selection for Noisy Speech Recognition Ludovic Trottier Brahim Chaib-draa

Philippe Giguère

[email protected] {chaib,philippe.giguere}@ift.ulaval.ca

Laval University Québec, Canada

June 4, 2015

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

1 / 17

Outline

1

Automatic Speech Recognition

2

Temporal Feature Selection

3

Experimentations

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

2 / 17

Outline

1

Automatic Speech Recognition

2

Temporal Feature Selection

3

Experimentations

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

3 / 17

Problem Description Automatic speech recognition (ASR) tries to answer the question: What is the person saying ?

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

4 / 17

Problem Description Automatic speech recognition (ASR) tries to answer the question: What is the person saying ?

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

4 / 17

Problem Description Automatic speech recognition (ASR) tries to answer the question: What is the person saying ?

What does it mean ? Don’t care . . . as long as it is statistically plausible.

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

4 / 17

Problem Description Automatic speech recognition (ASR) tries to answer the question: What is the person saying ?

What does it mean ? Don’t care . . . as long as it is statistically plausible. How does it work ?

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

4 / 17

Automatic Speech Recognition

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

5 / 17

Automatic Speech Recognition

Feature Extraction: Transform the one-dimensional audio signal into a series of D-dimensional feature vectors.

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

5 / 17

Automatic Speech Recognition

Feature Extraction: Transform the one-dimensional audio signal into a series of D-dimensional feature vectors. Acoustic Model: Quantify the relationship between the audio signature and the words.

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

5 / 17

Automatic Speech Recognition

Feature Extraction: Transform the one-dimensional audio signal into a series of D-dimensional feature vectors. Acoustic Model: Quantify the relationship between the audio signature and the words. Language Model: Assign a likelihood to sentences ("recognize speech" vs "wreck a nice beach").

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

5 / 17

Automatic Speech Recognition

Feature Extraction: Transform the one-dimensional audio signal into a series of D-dimensional feature vectors. Acoustic Model: Quantify the relationship between the audio signature and the words. Language Model: Assign a likelihood to sentences ("recognize speech" vs "wreck a nice beach"). Our contribution is related to feature extraction. Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

5 / 17

Speech Feature Extraction Speech feature extraction is usually divided into two parts:

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

6 / 17

Speech Feature Extraction Speech feature extraction is usually divided into two parts: 1. Static feature extraction

Ludovic Trottier et al.

2. Dynamic feature extraction

Temporal Feature Selection

June 4, 2015

6 / 17

Speech Feature Extraction Speech feature extraction is usually divided into two parts: 1. Static feature extraction

Ludovic Trottier et al.

2. Dynamic feature extraction

Temporal Feature Selection

June 4, 2015

6 / 17

Speech Feature Extraction Speech feature extraction is usually divided into two parts: 1. Static feature extraction

2. Dynamic feature extraction

Remark: The differentiation of a noisy signal amplifies the noise.

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

6 / 17

Speech Feature Extraction Speech feature extraction is usually divided into two parts: 1. Static feature extraction

2. Dynamic feature extraction

Remark: The differentiation of a noisy signal amplifies the noise. Question: Can something else be used ? Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

6 / 17

Contributions

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

7 / 17

Contributions 1

Dynamic Features In place of derivatives, we select adjacent features based on the time and the frequency.

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

7 / 17

Contributions 1

Dynamic Features In place of derivatives, we select adjacent features based on the time and the frequency. Motivations Signal processing theories show that the rate at which information changes in signals is proportional to frequency.

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

7 / 17

Contributions 1

Dynamic Features In place of derivatives, we select adjacent features based on the time and the frequency. Motivations Signal processing theories show that the rate at which information changes in signals is proportional to frequency.

2

Temporal Feature Selection We use the variance of adjacent feature distances to determine the offsets for selecting the most informative dynamic features.

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

7 / 17

Contributions 1

Dynamic Features In place of derivatives, we select adjacent features based on the time and the frequency. Motivations Signal processing theories show that the rate at which information changes in signals is proportional to frequency.

2

Temporal Feature Selection We use the variance of adjacent feature distances to determine the offsets for selecting the most informative dynamic features. Motivations Since the derivative is a linear transformation, we look for statistical linear dependencies by using a variance-based estimator.

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

7 / 17

Outline

1

Automatic Speech Recognition

2

Temporal Feature Selection

3

Experimentations

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

8 / 17

Temporal Feature Selection

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

9 / 17

Temporal Feature Selection Standard dynamic features Compute discrete time derivatives and concatenate them to the static features.

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

9 / 17

Temporal Feature Selection Standard dynamic features Compute discrete time derivatives and concatenate them to the static features. Proposed dynamic features Compute the variance-based estimator and select adjacent (in time) static features. Concatenate them to the static features and apply DCT-II to the final vector.

  (n) (n) Φ(n) = φ:,1 . . . φ:,Tn ∈ RD×Tn , n ∈ {1 . . . N} Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

9 / 17

Proposed Variance-based Estimator The sample variance ΣM of the difference of neighboring static feature frames, taken over all positions t and utterances n:      (n) (n) (n) (n) Var φ1,t − φ1,t+1 . . . Var φ1,t − φ1,t+M     .. .. ΣM =   . .      (n) (n) (n) (n) Var φD,t − φD,t+1 . . . Var φD,t − φD,t+M

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

10 / 17

Proposed Variance-based Estimator The sample variance ΣM of the difference of neighboring static feature frames, taken over all positions t and utterances n:      (n) (n) (n) (n) Var φ1,t − φ1,t+1 . . . Var φ1,t − φ1,t+M     .. .. ΣM =   . .      (n) (n) (n) (n) Var φD,t − φD,t+1 . . . Var φD,t − φD,t+M Use ΣM and hyper-parameter Vthresh to compute the frame position offsets z = [z1 , . . . , zD ]: zi = arg min ΣM i,j − Vthresh j

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

10 / 17

Example on MFCC static features

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

11 / 17

Example on MFCC static features

1 2 3 4 5 6 7 8 9 10 11 12 13

2.25 2.00 1.75 1.50 1.25 1.00 0.75 0.50 0.25

Variance

MFCC

The black dots correspond to the offsets when Vthresh = 1.

1 2 3 4 5

Ludovic Trottier et al.

10

15 Position Offsets

Temporal Feature Selection

20

25

June 4, 2015

11 / 17

Motivations for the Proposed Dynamic Features

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

12 / 17

Motivations for the Proposed Dynamic Features Variation of the intensity of different frequency components. Long and continuous lines implies slow variation. Short lines implies high variation.

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

12 / 17

Outline

1

Automatic Speech Recognition

2

Temporal Feature Selection

3

Experimentations

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

13 / 17

Aurora 2 Task

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

14 / 17

Aurora 2 Task Vocabulary: 11 spoken digits (zero to nine with oh).

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

14 / 17

Aurora 2 Task Vocabulary: 11 spoken digits (zero to nine with oh). Utterance Type: Up to 7 digits in any order with possible pauses.

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

14 / 17

Aurora 2 Task Vocabulary: 11 spoken digits (zero to nine with oh). Utterance Type: Up to 7 digits in any order with possible pauses. Noise: 8 types of noise with SNRs from -5 dB to 20 dB. Noise Types Suburban train Exhibition hall

Ludovic Trottier et al.

Crowd of people (babble) Restaurant

Temporal Feature Selection

Car Street

Airport Train station

June 4, 2015

14 / 17

Aurora 2 Task Vocabulary: 11 spoken digits (zero to nine with oh). Utterance Type: Up to 7 digits in any order with possible pauses. Noise: 8 types of noise with SNRs from -5 dB to 20 dB. Noise Types Suburban train Exhibition hall

Crowd of people (babble) Restaurant

Car Street

Airport Train station

Training set has 8,440 utterances.

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

14 / 17

Aurora 2 Task Vocabulary: 11 spoken digits (zero to nine with oh). Utterance Type: Up to 7 digits in any order with possible pauses. Noise: 8 types of noise with SNRs from -5 dB to 20 dB. Noise Types Suburban train Exhibition hall

Crowd of people (babble) Restaurant

Car Street

Airport Train station

Training set has 8,440 utterances. Test set has 70,070 utterances.

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

14 / 17

Aurora 2 Task Vocabulary: 11 spoken digits (zero to nine with oh). Utterance Type: Up to 7 digits in any order with possible pauses. Noise: 8 types of noise with SNRs from -5 dB to 20 dB. Noise Types Suburban train Exhibition hall

Crowd of people (babble) Restaurant

Car Street

Airport Train station

Training set has 8,440 utterances. Test set has 70,070 utterances. A Hidden Markov Model (HMM) framework was used for recognizing the utterances. We performed two digit recognition tasks:

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

14 / 17

Aurora 2 Task Vocabulary: 11 spoken digits (zero to nine with oh). Utterance Type: Up to 7 digits in any order with possible pauses. Noise: 8 types of noise with SNRs from -5 dB to 20 dB. Noise Types Suburban train Exhibition hall

Crowd of people (babble) Restaurant

Car Street

Airport Train station

Training set has 8,440 utterances. Test set has 70,070 utterances. A Hidden Markov Model (HMM) framework was used for recognizing the utterances. We performed two digit recognition tasks: 1

Connected (whole-word modeling) 11 whole-word HMMs with 18 hidden states per HMM.

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

14 / 17

Aurora 2 Task Vocabulary: 11 spoken digits (zero to nine with oh). Utterance Type: Up to 7 digits in any order with possible pauses. Noise: 8 types of noise with SNRs from -5 dB to 20 dB. Noise Types Suburban train Exhibition hall

Crowd of people (babble) Restaurant

Car Street

Airport Train station

Training set has 8,440 utterances. Test set has 70,070 utterances. A Hidden Markov Model (HMM) framework was used for recognizing the utterances. We performed two digit recognition tasks: 1

Connected (whole-word modeling) 11 whole-word HMMs with 18 hidden states per HMM.

2

Continuous (phoneme modeling) 19 phoneme HMMs with 5 hidden states per HMM.

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

14 / 17

Results Our approach: suffix -T Original approach: suffix -D-A Table : Word accuracy (%) on the Aurora 2 database using whole-word HMMs.

XXX

SNR (dB) XXX Inf 20 15 10 5 0 -5 Avg. R.I. (%) XXX Features X MFCC-E-D-A 98.54 97.14 96.02 93.27 84.86 57.47 23.35 78.66 MFCC-E-T 97.64 97.46 96.68 94.39 88.03 71.31 38.93 83.49 22.63 Table : Word accuracy (%) on the Aurora 2 database using phoneme HMMs.

XXX

SNR (dB) XXX Inf 20 15 10 5 0 -5 Avg. R.I. (%) XXX Features X MFCC-E-D-A 89.89 87.24 84.41 78.87 63.78 29.86 -5.82 61.17 MFCC-E-T 93.02 94.15 92.65 88.84 79.22 56.42 19.58 74.84 35.20

1. Large improvements in very noisy environments (SNR < 10). 2. Outperformed the original dynamic features with phoneme HMMs.

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

15 / 17

Results The seven plots correspond to the seven noise levels.

Wacc (%)

The × markers show the maximum value of each plot. 100 99 98 97 96 95 94 93 92 91 90 80 70 60 50 40 30 20 10

Inf dB 20 dB 0.2

0.4

0.6

15 dB 10 dB 0.8

1

5 dB 0 dB 1.2

Vthresh

1.4

-5 dB 1.6

1.8

2

1. Parameter Vthresh can be adjusted to match the noise intensity. Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

16 / 17

Conclusion

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

17 / 17

Conclusion Contributions 1

In place of derivatives, we used coefficients concatenation based on the time and the frequency.

2

The variance-based estimator automatically selects the optimal neighbors.

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

17 / 17

Conclusion Contributions 1

In place of derivatives, we used coefficients concatenation based on the time and the frequency.

2

The variance-based estimator automatically selects the optimal neighbors.

Results On the Aurora 2 task: 1

Large word accuracy improvements in very noisy environments.

2

With phoneme HMMs, our dynamic features outperformed the original based on derivatives.

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

17 / 17

Conclusion Contributions 1

In place of derivatives, we used coefficients concatenation based on the time and the frequency.

2

The variance-based estimator automatically selects the optimal neighbors.

Results On the Aurora 2 task: 1

Large word accuracy improvements in very noisy environments.

2

With phoneme HMMs, our dynamic features outperformed the original based on derivatives.

Future Work Large-scale speech recognition with deep learning and triphone modeling. Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

17 / 17

Temporal Feature Selection for Noisy Speech ...

Jun 4, 2015 - Acoustic Model: Quantify the relationship between the audio signature and the words. Ludovic Trottier et al. Temporal Feature Selection. June 4 ...

584KB Sizes 2 Downloads 183 Views

Recommend Documents

Feature Selection for SVMs
в AT&T Research Laboratories, Red Bank, USA. ttt. Royal Holloway .... results have been limited to linear kernels [3, 7] or linear probabilistic models [8]. Our.

Unsupervised Feature Selection for Biomarker ... - Semantic Scholar
Feature selection and weighting do both refer to the process of characterizing the relevance of components in fixed-dimensional ..... not assigned.no ontology.

Unsupervised Feature Selection for Biomarker ...
factor analysis of unlabeled data, has got different limitations: the analytic focus is shifted away from the ..... for predicting high and low fat content, are smoothly shaped, as shown for 10 ..... Machine Learning Research, 5:845–889, 2004. 2.

Unsupervised Feature Selection for Biomarker ...
The proposed framework allows to apply custom data simi- ... Recently developed metabolomic and genomic measuring technologies share the .... iteration number k; by definition S(0) := {}, and by construction |S(k)| = k. D .... 3 Applications.

Feature Selection for Ranking
uses evaluation measures or loss functions [4][10] in ranking to measure the importance of ..... meaningful to work out an efficient algorithm that solves the.

ROBUST SPEECH RECOGNITION IN NOISY ...
and a class-based language model that uses both the SPINE-1 and. SPINE-2 training data ..... that a class-based language model is the best choice for addressing these problems .... ing techniques for language modeling,” Computer Speech.

Introducing Temporal Asymmetries in Feature ...
improvement in phoneme error rate on TIMIT database over the MRASTA technique. Index Terms: feature extraction, auditory neurons, speech recognition. 1.

Introducing Temporal Asymmetries in Feature ...
mate posterior probabilities of 29 English phonemes. Around. 10% of the data is used for cross-validation. Log and Karhunen. Loeve (KL) transforms are applied ...

Temporal Filtering of Visual Speech for Audio-Visual ...
performance for clean and noisy images but also audio-visual speech recognition ..... [4] Ross, L. A., Saint-Amour, D., Leavitt, V. M., Foxe, J. J. Do you see what I ...

Temporal Generalizability of Face-Based Affect Detection in Noisy ...
Cameras are a ubiquitous part of modern computers, from tablets to laptops to .... tures, and rapid movements can all cause face registration errors; these issues ...

Implementation of genetic algorithms to feature selection for the use ...
Implementation of genetic algorithms to feature selection for the use of brain-computer interface.pdf. Implementation of genetic algorithms to feature selection for ...

Feature Selection for Density Level-Sets
approach generalizes one-class support vector machines and can be equiv- ... of the new method on network intrusion detection and object recognition ... We translate the multiple kernel learning framework to density level-set esti- mation to find ...

Markov Blanket Feature Selection for Support Vector ...
ing Bayesian networks from high-dimensional data sets are the large search ...... Bayesian network structure from massive datasets: The “sparse candidate” ...

Unsupervised Feature Selection for Outlier Detection by ...
v of feature f are represented by a three-dimensional tuple. VC = (f,δ(·),η(·, ·)) , ..... DSFS 2, ENFW, FPOF and MarP are implemented in JAVA in WEKA [29].

A New Feature Selection Score for Multinomial Naive Bayes Text ...
Bayes Text Classification Based on KL-Divergence .... 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 191–200, ...

Reconsidering Mutual Information Based Feature Selection: A ...
Abstract. Mutual information (MI) based approaches are a popu- lar feature selection paradigm. Although the stated goal of MI-based feature selection is to identify a subset of features that share the highest mutual information with the class variabl

Application to feature selection
[24] M. Abramowitz and I. A. Stegun, Handbook of Mathematical Functions. N.Y.: Dover, 1972. [25] T. Anderson, An Introduction to Multivariate Statistics. N.Y.: Wiley,. 1984. [26] A. Papoulis and S. U. Pillai, Probability, Random Variables, and. Stoch

IC_26.Data-Driven Filter-Bank-Based Feature Extraction for Speech ...
IC_26.Data-Driven Filter-Bank-Based Feature Extraction for Speech Recognition.pdf. IC_26.Data-Driven Filter-Bank-Based Feature Extraction for Speech ...

Canonical feature selection for joint regression and ...
Aug 9, 2015 - Department of Brain and Cognitive Engineering,. Korea University ... lyze the complex patterns in medical image data (Li et al. 2012; Liu et al. ...... IEEE Transactions. Cybernetics. Zhu, X., Suk, H.-I., & Shen, D. (2014a). Multi-modal

A New Feature Selection Score for Multinomial Naive ...
assumptions: (i) the number of occurrences of wt is the same in all documents that contain wt, (ii) all documents in the same class cj have the same length. Let Njt be the number of documents in cj that contain wt, and let. ˜pd(wt|cj) = p(wt|cj). |c

Multi-task GLOH Feature Selection for Human Age ...
public available FG-NET database show that the proposed ... Aging is a very complicated process and is determined by ... training data for each individual.