Temporal Feature Selection for Noisy Speech ...

Viewer
Transcript

Temporal Feature Selection for Noisy Speech Recognition Ludovic Trottier Brahim Chaib-draa

Philippe Giguère

[email protected] {chaib,philippe.giguere}@ift.ulaval.ca

Laval University Québec, Canada

June 4, 2015

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

1 / 17

Outline

1

Automatic Speech Recognition

2

Temporal Feature Selection

3

Experimentations

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

2 / 17

Outline

1

Automatic Speech Recognition

2

Temporal Feature Selection

3

Experimentations

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

3 / 17

Problem Description Automatic speech recognition (ASR) tries to answer the question: What is the person saying ?

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

4 / 17

Problem Description Automatic speech recognition (ASR) tries to answer the question: What is the person saying ?

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

4 / 17

Problem Description Automatic speech recognition (ASR) tries to answer the question: What is the person saying ?

What does it mean ? Don’t care . . . as long as it is statistically plausible.

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

4 / 17

Problem Description Automatic speech recognition (ASR) tries to answer the question: What is the person saying ?

What does it mean ? Don’t care . . . as long as it is statistically plausible. How does it work ?

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

4 / 17

Automatic Speech Recognition

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

5 / 17

Automatic Speech Recognition

Feature Extraction: Transform the one-dimensional audio signal into a series of D-dimensional feature vectors.

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

5 / 17

Automatic Speech Recognition

Feature Extraction: Transform the one-dimensional audio signal into a series of D-dimensional feature vectors. Acoustic Model: Quantify the relationship between the audio signature and the words.

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

5 / 17

Automatic Speech Recognition

Feature Extraction: Transform the one-dimensional audio signal into a series of D-dimensional feature vectors. Acoustic Model: Quantify the relationship between the audio signature and the words. Language Model: Assign a likelihood to sentences ("recognize speech" vs "wreck a nice beach").

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

5 / 17

Automatic Speech Recognition

Feature Extraction: Transform the one-dimensional audio signal into a series of D-dimensional feature vectors. Acoustic Model: Quantify the relationship between the audio signature and the words. Language Model: Assign a likelihood to sentences ("recognize speech" vs "wreck a nice beach"). Our contribution is related to feature extraction. Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

5 / 17

Speech Feature Extraction Speech feature extraction is usually divided into two parts:

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

6 / 17

Speech Feature Extraction Speech feature extraction is usually divided into two parts: 1. Static feature extraction

Ludovic Trottier et al.

2. Dynamic feature extraction

Temporal Feature Selection

June 4, 2015

6 / 17

Speech Feature Extraction Speech feature extraction is usually divided into two parts: 1. Static feature extraction

Ludovic Trottier et al.

2. Dynamic feature extraction

Temporal Feature Selection

June 4, 2015

6 / 17

Speech Feature Extraction Speech feature extraction is usually divided into two parts: 1. Static feature extraction

2. Dynamic feature extraction

Remark: The differentiation of a noisy signal amplifies the noise.

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

6 / 17

Speech Feature Extraction Speech feature extraction is usually divided into two parts: 1. Static feature extraction

2. Dynamic feature extraction

Remark: The differentiation of a noisy signal amplifies the noise. Question: Can something else be used ? Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

6 / 17

Contributions

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

7 / 17

Contributions 1

Dynamic Features In place of derivatives, we select adjacent features based on the time and the frequency.

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

7 / 17

Contributions 1

Dynamic Features In place of derivatives, we select adjacent features based on the time and the frequency. Motivations Signal processing theories show that the rate at which information changes in signals is proportional to frequency.

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

7 / 17

Contributions 1

Dynamic Features In place of derivatives, we select adjacent features based on the time and the frequency. Motivations Signal processing theories show that the rate at which information changes in signals is proportional to frequency.

2

Temporal Feature Selection We use the variance of adjacent feature distances to determine the offsets for selecting the most informative dynamic features.

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

7 / 17

Contributions 1

Dynamic Features In place of derivatives, we select adjacent features based on the time and the frequency. Motivations Signal processing theories show that the rate at which information changes in signals is proportional to frequency.

2

Temporal Feature Selection We use the variance of adjacent feature distances to determine the offsets for selecting the most informative dynamic features. Motivations Since the derivative is a linear transformation, we look for statistical linear dependencies by using a variance-based estimator.

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

7 / 17

Outline

1

Automatic Speech Recognition

2

Temporal Feature Selection

3

Experimentations

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

8 / 17

Temporal Feature Selection

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

9 / 17

Temporal Feature Selection Standard dynamic features Compute discrete time derivatives and concatenate them to the static features.

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

9 / 17

Temporal Feature Selection Standard dynamic features Compute discrete time derivatives and concatenate them to the static features. Proposed dynamic features Compute the variance-based estimator and select adjacent (in time) static features. Concatenate them to the static features and apply DCT-II to the final vector.

(n) (n) Φ(n) = φ:,1 . . . φ:,Tn ∈ RD×Tn , n ∈ {1 . . . N} Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

9 / 17

Proposed Variance-based Estimator The sample variance ΣM of the difference of neighboring static feature frames, taken over all positions t and utterances n:   (n) (n) (n) (n) Var φ1,t − φ1,t+1 . . . Var φ1,t − φ1,t+M     .. .. ΣM =   . .   (n) (n) (n) (n) Var φD,t − φD,t+1 . . . Var φD,t − φD,t+M

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

10 / 17

Proposed Variance-based Estimator The sample variance ΣM of the difference of neighboring static feature frames, taken over all positions t and utterances n:   (n) (n) (n) (n) Var φ1,t − φ1,t+1 . . . Var φ1,t − φ1,t+M     .. .. ΣM =   . .   (n) (n) (n) (n) Var φD,t − φD,t+1 . . . Var φD,t − φD,t+M Use ΣM and hyper-parameter Vthresh to compute the frame position offsets z = [z1 , . . . , zD ]: zi = arg min ΣM i,j − Vthresh j

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

10 / 17

Example on MFCC static features

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

11 / 17

Example on MFCC static features

1 2 3 4 5 6 7 8 9 10 11 12 13

2.25 2.00 1.75 1.50 1.25 1.00 0.75 0.50 0.25

Variance

MFCC

The black dots correspond to the offsets when Vthresh = 1.

1 2 3 4 5

Ludovic Trottier et al.

10

15 Position Offsets

Temporal Feature Selection

20

25

June 4, 2015

11 / 17

Motivations for the Proposed Dynamic Features

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

12 / 17

Motivations for the Proposed Dynamic Features Variation of the intensity of different frequency components. Long and continuous lines implies slow variation. Short lines implies high variation.

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

12 / 17

Outline

1

Automatic Speech Recognition

2

Temporal Feature Selection

3

Experimentations

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

13 / 17

Aurora 2 Task

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

14 / 17

Aurora 2 Task Vocabulary: 11 spoken digits (zero to nine with oh).

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

14 / 17

Aurora 2 Task Vocabulary: 11 spoken digits (zero to nine with oh). Utterance Type: Up to 7 digits in any order with possible pauses.

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

14 / 17

Aurora 2 Task Vocabulary: 11 spoken digits (zero to nine with oh). Utterance Type: Up to 7 digits in any order with possible pauses. Noise: 8 types of noise with SNRs from -5 dB to 20 dB. Noise Types Suburban train Exhibition hall

Ludovic Trottier et al.

Crowd of people (babble) Restaurant

Temporal Feature Selection

Car Street

Airport Train station

June 4, 2015

14 / 17

Aurora 2 Task Vocabulary: 11 spoken digits (zero to nine with oh). Utterance Type: Up to 7 digits in any order with possible pauses. Noise: 8 types of noise with SNRs from -5 dB to 20 dB. Noise Types Suburban train Exhibition hall

Crowd of people (babble) Restaurant

Car Street

Airport Train station

Training set has 8,440 utterances.

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

14 / 17

Aurora 2 Task Vocabulary: 11 spoken digits (zero to nine with oh). Utterance Type: Up to 7 digits in any order with possible pauses. Noise: 8 types of noise with SNRs from -5 dB to 20 dB. Noise Types Suburban train Exhibition hall

Crowd of people (babble) Restaurant

Car Street

Airport Train station

Training set has 8,440 utterances. Test set has 70,070 utterances.

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

14 / 17

Aurora 2 Task Vocabulary: 11 spoken digits (zero to nine with oh). Utterance Type: Up to 7 digits in any order with possible pauses. Noise: 8 types of noise with SNRs from -5 dB to 20 dB. Noise Types Suburban train Exhibition hall

Crowd of people (babble) Restaurant

Car Street

Airport Train station

Training set has 8,440 utterances. Test set has 70,070 utterances. A Hidden Markov Model (HMM) framework was used for recognizing the utterances. We performed two digit recognition tasks:

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

14 / 17

Aurora 2 Task Vocabulary: 11 spoken digits (zero to nine with oh). Utterance Type: Up to 7 digits in any order with possible pauses. Noise: 8 types of noise with SNRs from -5 dB to 20 dB. Noise Types Suburban train Exhibition hall

Crowd of people (babble) Restaurant

Car Street

Airport Train station

Training set has 8,440 utterances. Test set has 70,070 utterances. A Hidden Markov Model (HMM) framework was used for recognizing the utterances. We performed two digit recognition tasks: 1

Connected (whole-word modeling) 11 whole-word HMMs with 18 hidden states per HMM.

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

14 / 17

Aurora 2 Task Vocabulary: 11 spoken digits (zero to nine with oh). Utterance Type: Up to 7 digits in any order with possible pauses. Noise: 8 types of noise with SNRs from -5 dB to 20 dB. Noise Types Suburban train Exhibition hall

Crowd of people (babble) Restaurant

Car Street

Airport Train station

Training set has 8,440 utterances. Test set has 70,070 utterances. A Hidden Markov Model (HMM) framework was used for recognizing the utterances. We performed two digit recognition tasks: 1

Connected (whole-word modeling) 11 whole-word HMMs with 18 hidden states per HMM.

2

Continuous (phoneme modeling) 19 phoneme HMMs with 5 hidden states per HMM.

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

14 / 17

Results Our approach: suffix -T Original approach: suffix -D-A Table : Word accuracy (%) on the Aurora 2 database using whole-word HMMs.

XXX

SNR (dB) XXX Inf 20 15 10 5 0 -5 Avg. R.I. (%) XXX Features X MFCC-E-D-A 98.54 97.14 96.02 93.27 84.86 57.47 23.35 78.66 MFCC-E-T 97.64 97.46 96.68 94.39 88.03 71.31 38.93 83.49 22.63 Table : Word accuracy (%) on the Aurora 2 database using phoneme HMMs.

XXX

SNR (dB) XXX Inf 20 15 10 5 0 -5 Avg. R.I. (%) XXX Features X MFCC-E-D-A 89.89 87.24 84.41 78.87 63.78 29.86 -5.82 61.17 MFCC-E-T 93.02 94.15 92.65 88.84 79.22 56.42 19.58 74.84 35.20

1. Large improvements in very noisy environments (SNR < 10). 2. Outperformed the original dynamic features with phoneme HMMs.

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

15 / 17

Results The seven plots correspond to the seven noise levels.

Wacc (%)

The × markers show the maximum value of each plot. 100 99 98 97 96 95 94 93 92 91 90 80 70 60 50 40 30 20 10

Inf dB 20 dB 0.2

0.4

0.6

15 dB 10 dB 0.8

1

5 dB 0 dB 1.2

Vthresh

1.4

-5 dB 1.6

1.8

2

1. Parameter Vthresh can be adjusted to match the noise intensity. Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

16 / 17

Conclusion

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

17 / 17

Conclusion Contributions 1

In place of derivatives, we used coefficients concatenation based on the time and the frequency.

2

The variance-based estimator automatically selects the optimal neighbors.

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

17 / 17

Conclusion Contributions 1

In place of derivatives, we used coefficients concatenation based on the time and the frequency.

2

The variance-based estimator automatically selects the optimal neighbors.

Results On the Aurora 2 task: 1

Large word accuracy improvements in very noisy environments.

2

With phoneme HMMs, our dynamic features outperformed the original based on derivatives.

Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

17 / 17

Conclusion Contributions 1

In place of derivatives, we used coefficients concatenation based on the time and the frequency.

2

The variance-based estimator automatically selects the optimal neighbors.

Results On the Aurora 2 task: 1

Large word accuracy improvements in very noisy environments.

2

With phoneme HMMs, our dynamic features outperformed the original based on derivatives.

Future Work Large-scale speech recognition with deep learning and triphone modeling. Ludovic Trottier et al.

Temporal Feature Selection

June 4, 2015

17 / 17