Temporal Feature Selection for Noisy Speech Recognition Ludovic Trottier Brahim Chaib-draa
Philippe Giguère
[email protected] {chaib,philippe.giguere}@ift.ulaval.ca
Laval University Québec, Canada
June 4, 2015
Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
1 / 17
Outline
1
Automatic Speech Recognition
2
Temporal Feature Selection
3
Experimentations
Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
2 / 17
Outline
1
Automatic Speech Recognition
2
Temporal Feature Selection
3
Experimentations
Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
3 / 17
Problem Description Automatic speech recognition (ASR) tries to answer the question: What is the person saying ?
Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
4 / 17
Problem Description Automatic speech recognition (ASR) tries to answer the question: What is the person saying ?
Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
4 / 17
Problem Description Automatic speech recognition (ASR) tries to answer the question: What is the person saying ?
What does it mean ? Don’t care . . . as long as it is statistically plausible.
Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
4 / 17
Problem Description Automatic speech recognition (ASR) tries to answer the question: What is the person saying ?
What does it mean ? Don’t care . . . as long as it is statistically plausible. How does it work ?
Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
4 / 17
Automatic Speech Recognition
Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
5 / 17
Automatic Speech Recognition
Feature Extraction: Transform the one-dimensional audio signal into a series of D-dimensional feature vectors.
Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
5 / 17
Automatic Speech Recognition
Feature Extraction: Transform the one-dimensional audio signal into a series of D-dimensional feature vectors. Acoustic Model: Quantify the relationship between the audio signature and the words.
Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
5 / 17
Automatic Speech Recognition
Feature Extraction: Transform the one-dimensional audio signal into a series of D-dimensional feature vectors. Acoustic Model: Quantify the relationship between the audio signature and the words. Language Model: Assign a likelihood to sentences ("recognize speech" vs "wreck a nice beach").
Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
5 / 17
Automatic Speech Recognition
Feature Extraction: Transform the one-dimensional audio signal into a series of D-dimensional feature vectors. Acoustic Model: Quantify the relationship between the audio signature and the words. Language Model: Assign a likelihood to sentences ("recognize speech" vs "wreck a nice beach"). Our contribution is related to feature extraction. Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
5 / 17
Speech Feature Extraction Speech feature extraction is usually divided into two parts:
Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
6 / 17
Speech Feature Extraction Speech feature extraction is usually divided into two parts: 1. Static feature extraction
Ludovic Trottier et al.
2. Dynamic feature extraction
Temporal Feature Selection
June 4, 2015
6 / 17
Speech Feature Extraction Speech feature extraction is usually divided into two parts: 1. Static feature extraction
Ludovic Trottier et al.
2. Dynamic feature extraction
Temporal Feature Selection
June 4, 2015
6 / 17
Speech Feature Extraction Speech feature extraction is usually divided into two parts: 1. Static feature extraction
2. Dynamic feature extraction
Remark: The differentiation of a noisy signal amplifies the noise.
Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
6 / 17
Speech Feature Extraction Speech feature extraction is usually divided into two parts: 1. Static feature extraction
2. Dynamic feature extraction
Remark: The differentiation of a noisy signal amplifies the noise. Question: Can something else be used ? Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
6 / 17
Contributions
Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
7 / 17
Contributions 1
Dynamic Features In place of derivatives, we select adjacent features based on the time and the frequency.
Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
7 / 17
Contributions 1
Dynamic Features In place of derivatives, we select adjacent features based on the time and the frequency. Motivations Signal processing theories show that the rate at which information changes in signals is proportional to frequency.
Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
7 / 17
Contributions 1
Dynamic Features In place of derivatives, we select adjacent features based on the time and the frequency. Motivations Signal processing theories show that the rate at which information changes in signals is proportional to frequency.
2
Temporal Feature Selection We use the variance of adjacent feature distances to determine the offsets for selecting the most informative dynamic features.
Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
7 / 17
Contributions 1
Dynamic Features In place of derivatives, we select adjacent features based on the time and the frequency. Motivations Signal processing theories show that the rate at which information changes in signals is proportional to frequency.
2
Temporal Feature Selection We use the variance of adjacent feature distances to determine the offsets for selecting the most informative dynamic features. Motivations Since the derivative is a linear transformation, we look for statistical linear dependencies by using a variance-based estimator.
Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
7 / 17
Outline
1
Automatic Speech Recognition
2
Temporal Feature Selection
3
Experimentations
Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
8 / 17
Temporal Feature Selection
Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
9 / 17
Temporal Feature Selection Standard dynamic features Compute discrete time derivatives and concatenate them to the static features.
Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
9 / 17
Temporal Feature Selection Standard dynamic features Compute discrete time derivatives and concatenate them to the static features. Proposed dynamic features Compute the variance-based estimator and select adjacent (in time) static features. Concatenate them to the static features and apply DCT-II to the final vector.
(n) (n) Φ(n) = φ:,1 . . . φ:,Tn ∈ RD×Tn , n ∈ {1 . . . N} Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
9 / 17
Proposed Variance-based Estimator The sample variance ΣM of the difference of neighboring static feature frames, taken over all positions t and utterances n: (n) (n) (n) (n) Var φ1,t − φ1,t+1 . . . Var φ1,t − φ1,t+M .. .. ΣM = . . (n) (n) (n) (n) Var φD,t − φD,t+1 . . . Var φD,t − φD,t+M
Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
10 / 17
Proposed Variance-based Estimator The sample variance ΣM of the difference of neighboring static feature frames, taken over all positions t and utterances n: (n) (n) (n) (n) Var φ1,t − φ1,t+1 . . . Var φ1,t − φ1,t+M .. .. ΣM = . . (n) (n) (n) (n) Var φD,t − φD,t+1 . . . Var φD,t − φD,t+M Use ΣM and hyper-parameter Vthresh to compute the frame position offsets z = [z1 , . . . , zD ]: zi = arg min ΣM i,j − Vthresh j
Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
10 / 17
Example on MFCC static features
Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
11 / 17
Example on MFCC static features
1 2 3 4 5 6 7 8 9 10 11 12 13
2.25 2.00 1.75 1.50 1.25 1.00 0.75 0.50 0.25
Variance
MFCC
The black dots correspond to the offsets when Vthresh = 1.
1 2 3 4 5
Ludovic Trottier et al.
10
15 Position Offsets
Temporal Feature Selection
20
25
June 4, 2015
11 / 17
Motivations for the Proposed Dynamic Features
Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
12 / 17
Motivations for the Proposed Dynamic Features Variation of the intensity of different frequency components. Long and continuous lines implies slow variation. Short lines implies high variation.
Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
12 / 17
Outline
1
Automatic Speech Recognition
2
Temporal Feature Selection
3
Experimentations
Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
13 / 17
Aurora 2 Task
Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
14 / 17
Aurora 2 Task Vocabulary: 11 spoken digits (zero to nine with oh).
Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
14 / 17
Aurora 2 Task Vocabulary: 11 spoken digits (zero to nine with oh). Utterance Type: Up to 7 digits in any order with possible pauses.
Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
14 / 17
Aurora 2 Task Vocabulary: 11 spoken digits (zero to nine with oh). Utterance Type: Up to 7 digits in any order with possible pauses. Noise: 8 types of noise with SNRs from -5 dB to 20 dB. Noise Types Suburban train Exhibition hall
Ludovic Trottier et al.
Crowd of people (babble) Restaurant
Temporal Feature Selection
Car Street
Airport Train station
June 4, 2015
14 / 17
Aurora 2 Task Vocabulary: 11 spoken digits (zero to nine with oh). Utterance Type: Up to 7 digits in any order with possible pauses. Noise: 8 types of noise with SNRs from -5 dB to 20 dB. Noise Types Suburban train Exhibition hall
Crowd of people (babble) Restaurant
Car Street
Airport Train station
Training set has 8,440 utterances.
Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
14 / 17
Aurora 2 Task Vocabulary: 11 spoken digits (zero to nine with oh). Utterance Type: Up to 7 digits in any order with possible pauses. Noise: 8 types of noise with SNRs from -5 dB to 20 dB. Noise Types Suburban train Exhibition hall
Crowd of people (babble) Restaurant
Car Street
Airport Train station
Training set has 8,440 utterances. Test set has 70,070 utterances.
Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
14 / 17
Aurora 2 Task Vocabulary: 11 spoken digits (zero to nine with oh). Utterance Type: Up to 7 digits in any order with possible pauses. Noise: 8 types of noise with SNRs from -5 dB to 20 dB. Noise Types Suburban train Exhibition hall
Crowd of people (babble) Restaurant
Car Street
Airport Train station
Training set has 8,440 utterances. Test set has 70,070 utterances. A Hidden Markov Model (HMM) framework was used for recognizing the utterances. We performed two digit recognition tasks:
Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
14 / 17
Aurora 2 Task Vocabulary: 11 spoken digits (zero to nine with oh). Utterance Type: Up to 7 digits in any order with possible pauses. Noise: 8 types of noise with SNRs from -5 dB to 20 dB. Noise Types Suburban train Exhibition hall
Crowd of people (babble) Restaurant
Car Street
Airport Train station
Training set has 8,440 utterances. Test set has 70,070 utterances. A Hidden Markov Model (HMM) framework was used for recognizing the utterances. We performed two digit recognition tasks: 1
Connected (whole-word modeling) 11 whole-word HMMs with 18 hidden states per HMM.
Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
14 / 17
Aurora 2 Task Vocabulary: 11 spoken digits (zero to nine with oh). Utterance Type: Up to 7 digits in any order with possible pauses. Noise: 8 types of noise with SNRs from -5 dB to 20 dB. Noise Types Suburban train Exhibition hall
Crowd of people (babble) Restaurant
Car Street
Airport Train station
Training set has 8,440 utterances. Test set has 70,070 utterances. A Hidden Markov Model (HMM) framework was used for recognizing the utterances. We performed two digit recognition tasks: 1
Connected (whole-word modeling) 11 whole-word HMMs with 18 hidden states per HMM.
2
Continuous (phoneme modeling) 19 phoneme HMMs with 5 hidden states per HMM.
Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
14 / 17
Results Our approach: suffix -T Original approach: suffix -D-A Table : Word accuracy (%) on the Aurora 2 database using whole-word HMMs.
XXX
SNR (dB) XXX Inf 20 15 10 5 0 -5 Avg. R.I. (%) XXX Features X MFCC-E-D-A 98.54 97.14 96.02 93.27 84.86 57.47 23.35 78.66 MFCC-E-T 97.64 97.46 96.68 94.39 88.03 71.31 38.93 83.49 22.63 Table : Word accuracy (%) on the Aurora 2 database using phoneme HMMs.
XXX
SNR (dB) XXX Inf 20 15 10 5 0 -5 Avg. R.I. (%) XXX Features X MFCC-E-D-A 89.89 87.24 84.41 78.87 63.78 29.86 -5.82 61.17 MFCC-E-T 93.02 94.15 92.65 88.84 79.22 56.42 19.58 74.84 35.20
1. Large improvements in very noisy environments (SNR < 10). 2. Outperformed the original dynamic features with phoneme HMMs.
Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
15 / 17
Results The seven plots correspond to the seven noise levels.
Wacc (%)
The × markers show the maximum value of each plot. 100 99 98 97 96 95 94 93 92 91 90 80 70 60 50 40 30 20 10
Inf dB 20 dB 0.2
0.4
0.6
15 dB 10 dB 0.8
1
5 dB 0 dB 1.2
Vthresh
1.4
-5 dB 1.6
1.8
2
1. Parameter Vthresh can be adjusted to match the noise intensity. Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
16 / 17
Conclusion
Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
17 / 17
Conclusion Contributions 1
In place of derivatives, we used coefficients concatenation based on the time and the frequency.
2
The variance-based estimator automatically selects the optimal neighbors.
Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
17 / 17
Conclusion Contributions 1
In place of derivatives, we used coefficients concatenation based on the time and the frequency.
2
The variance-based estimator automatically selects the optimal neighbors.
Results On the Aurora 2 task: 1
Large word accuracy improvements in very noisy environments.
2
With phoneme HMMs, our dynamic features outperformed the original based on derivatives.
Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
17 / 17
Conclusion Contributions 1
In place of derivatives, we used coefficients concatenation based on the time and the frequency.
2
The variance-based estimator automatically selects the optimal neighbors.
Results On the Aurora 2 task: 1
Large word accuracy improvements in very noisy environments.
2
With phoneme HMMs, our dynamic features outperformed the original based on derivatives.
Future Work Large-scale speech recognition with deep learning and triphone modeling. Ludovic Trottier et al.
Temporal Feature Selection
June 4, 2015
17 / 17