Discriminative Acoustic Language Recognition via Channel-Compensated GMM Statistics Niko Brümmer, Albert Strasheim, Valiantsina Hubeika, Pavel Matějka, Lukáš Burget and Ondřej Glembek

Outline • • • • •

Introduction Relevant prior work Proposed method Experimental results Conclusion

Introduction 1. What is GMM-based acoustic language recognition? 2. Focus of this talk.

General recipe for GMM-based Acoustic Language Recognition 1. Build a feature extractor which maps: speech segment --> sequence of feature vectors

2. Pretend these features are produced by language-dependent Gaussian mixture models (GMMs). 3. Train GMM parameters on typically several hours of speech per language. 4. For a new test speech segment of unknown language: • •

Compute language likelihoods, Given priors and costs, make minimum-expectedcost language recognition decisions.

Introduction 1. What is GMM-based acoustic language recognition? 2. Focus of this talk.

Discriminative Acoustic Language Recognition via Channel-Compensated GMM Statistics • Channel-compensation works for both generatively and discriminatively trained language models. • This talk will emphasize the more interesting channel compensation part. • For details of the discriminative training, please see the full paper.

Outline • • • • •

Introduction Relevant prior work Proposed method Experimental results Conclusion

GMM generations Prior work: 1G: One MAP-trained GMM per language. 2G: One MMI-trained GMM per language.

This paper: 3G: One GMM per test segment.

1G: Training Language GMM’s are trained independently for every language, with a MAP criterion: e.g.: English GMM = arg max P( GMM parameters, English data )

1G: Test New test speech segments are scored by directly evaluating GMM likelihoods, e.g.: English score = P( test speech | English GMM )

2G: Training GMM parameters for all languages are adjusted simultaneously with a discriminative MMI criterion, to maximize the product of posteriors: P( true language | training example, parameters) over all the training examples of all the languages.

2G: Test Test scoring is identical to 1G, e.g: English score = P(test speech | English GMM)

Comparison: MMI vs MAP • Accuracy: MMI much better than MAP • Training: MMI requires significantly more CPU and memory resources than MAP • Test scoring: Pure GMM solutions are slow e.g. compared to some GMM-SVM hybrid solutions.

Outline • • • • •

Introduction Prior work Proposed method Experimental results Conclusion

Proposed method • • • • • •

Motivation Advantages Key differences Training Testing Results

Motivation This work on language recognition was motivated by recent advances in GMM text-independent speaker recognition and is based on Patrick Kenny’s work on Joint Factor Analysis.

Proposed Method vs MAP&MMI • Advantages: – Matches or exceeds accuracy of MMI – Faster to train than MMI – Very fast test scoring, similar to fast SVM solutions.

• Disadvantage: – More difficult to explain, but that is what we will attempt in the rest of this talk.

Key differences from prior work • Simplifying approximation to P(data|GMM), makes training and test scoring fast. • 2-layer generative modeling

Approximation to P(data|GMM) • We use the auxiliary function for the classical EM algorithm for GMMs, which is a lower-bound approximation to the GMM log-likelihood. • The approximation is done relative a language-independent GMM called the UBM.

GMM likelihood approximation log likelihood

log P(data | GMM)

Quadratic approximation: EM auxiliary function = Q(GMM; UBM,data)

UBM

GMM parameter space

Sufficient stats • The EM-auxiliary approximation allows us to replace the variable-length sequences of feature vectors with sufficient statistics of fixed size. • The whole input speech segment (e.g. 30s long) is mapped to a sufficient statistic. • This allows us to iterate our algorithms over thousands of segment-statistics, rather than over millions of feature vectors.

Key differences • Simplifying approximation to P(data|GMM), makes training and test scoring fast. • 2-layer generative modeling – Generative model for GMMs – GMMs generate feature vectors

2-layer generative GMM modeling 1. In the hidden layer a new GMM is generated for every speech segment, according to language conditional probability distribution of GMMs. 2. In the output layer, the segment GMM generates the sequence of feature vectors of a speech segment.

Central French GMM

Central English GMM

feature space

Intersession, or ‘channel’ variability

English speech segment

English speech segment

English speech segment

French speech segment

French speech segment

French speech segment

GMM parameter supervectors • All GMM variances and weights are constant. • Different GMMs are represented as the concatenation of the mean vectors of all the components. • These vectors are known as supervectors. • We used 2048 x 56-dimensional GMM components to give a supervector of size ≈ 105.

GMM supervector space

lis g En

h

nc e Fr

h

feature sequence of a German speech segment

h s i an p S G

M

M

rm e G

Segment GMMs are normally distributed with: • language-dependent means and • a low-rank, shared, languageindependent covariance.

an

another German segment

Proposed method • • • • • •

Motivation Advantages Key differences Training Testing Results

Training • Training the language recognizer is the estimation of: – the language-dependent means and – shared covariance

of the GMM distributions. • Done via an EM algorithm to maximize an ML-criterion, over all of the training data for all of the languages.

GMM supervector space

Training data for distributions of GMM supervectors every dot is a GMM other languages English segments

German segments French segments

GMM supervector space

Training data for distributions of GMM supervectors Every dot is a GMM other languages

Problem: training data is English segments hidden: These GMMs are not given, we French are given only the observed segments German feature sequences. segments

EM algorithm The problem is solved with an EM algorithm, which iteratively: 1. Estimates hidden GMMs 2. Estimates distribution of those GMMs.

EM Algorithm

English

French

Initialization: Random within-class covariance

EM Algorithm

English

E-step: Estimate hidden GMMs, given current within-class covariance.

French

EM Algorithm

English

French

M-step: Maximize likelihood of within-class covariance, given current GMM estimates.

EM Algorithm

English

E-step: Estimate hidden GMMs, given current within-class covariance.

French

and so on: MEMEMEMEMEMEMEMEM, …

Proposed method • • • • • •

Motivation Advantages Key differences Training Testing Results

To score a new test speech segment of unknown language: 1. `Channel’ compensation: Approximately remove intra-class variation from sufficient statistic of each test segment. 2. Score compensated statistic against the central GMM of each language, as if there were no intra-class variance.

SegmentGMM feature sequence of unknown language

UBM

language-independent estimate of within-class deviation

within-class variability

SegmentGMM feature sequence of unknown language

UBM

`Channel’ compensation: Modify statistic to shift GMM estimate. (Shift is confined to 50dimensional subspace of 100 000 dimensional GMM space, Segment GMM and UBM do not coincide.) within-class variability

To score a new test speech segment of unknown language: 1. Channel compensation: Approximately remove intra-class variation from sufficient stat of each test segment. 2. Score compensated statistic against the central GMM of each language, as if there were no intra-class variance.

Test scoring • Channel compensated test-segment statistic is scored against each language model, using a simplified, fast approximation to the language likelihood, e.g.: English score ≈ log P(test data | central English GMM)

Outline • • • • •

Introduction Relevant prior work Proposed method Experimental results Conclusion

Does it work? Error-rate on NIST LRE’07, 14 languages, 30 sec test segments. Baseline: One MAP GMM per language.

11.32%

Proposed method: One MAP GMM per language, with channel compensation of each test segment.

1.74%

Results* for NIST LRE 2009 (not in paper) Evaluation data, 23 languages

30s

10s

3s

GMM 2048G - Maximum Likelihood

7.33%

10.23%

18.91%

JFA 2048G, U - 200dim

3.25%

6.47%

16.40%

* After bugs were fixed.

Conclusion We have demonstrated by experiments on NIST LRE 2007 and 2009, that recipes similar to Patrick Kenny’s GMM factoranalysis modeling for speaker recognition, as implemented by using sufficient statistics, also work to build fast and accurate acoustic language recognizers.

Discriminative Acoustic Language Recognition via ...

General recipe for GMM-based. Acoustic Language Recognition. 1. Build a feature extractor which maps: speech segment --> sequence of feature vectors. 2.

619KB Sizes 4 Downloads 288 Views

Recommend Documents

Discriminative Acoustic Language Recognition via ...
This talk will emphasize the more interesting channel ... Prior work: 1G: One MAP-trained GMM per language. .... concatenation of the mean vectors of all.

Discriminative Acoustic Language Recognition via ...
ments of recorded telephone speech of varying duration. Every ..... 5, on the 14 languages of the closed-set language detection task of the NIST 2007 Language ...

Language Recognition Based on Acoustic Diversified ...
mation Science and Technology, Department of Electronic Engi- neering ... or lattices are homogeneous since the same training data and phone set are used.

Continuous Space Discriminative Language Modeling - Center for ...
When computing g(W), we have to go through all n- grams for each W in ... Our experiments are done on the English conversational tele- phone speech (CTS) ...

DISTRIBUTED DISCRIMINATIVE LANGUAGE ... - Research at Google
formance after reranking N-best lists of a standard Google voice-search data ..... hypotheses in domain adaptation and generalization,” in Proc. ICASSP, 2006.

Continuous Space Discriminative Language Modeling - Center for ...
quires in each iteration identifying the best hypothesisˆW ac- ... cation task in which the classes are word sequences. The fea- .... For training CDLMs, online gradient descent is used. ... n-gram language modeling,” Computer Speech and Lan-.

DISCRIMINATIVE FEATURES FOR LANGUAGE ... - Research at Google
language recognition system. We train the ... lar approach to language recognition has been the MAP-SVM method [1] [2] ... turned into a linear classifier computing score dl(u) for utter- ance u in ... the error rate on a development set. The first .

Continuous Space Discriminative Language ... - Research at Google
confusion sets, and then discriminative training will learn to separate the ... quires in each iteration identifying the best hypothesisˆW ac- cording the current model. .... n-gram language modeling,” Computer Speech and Lan- guage, vol. 21, pp.

Discriminative Ferns Ensemble for Hand Pose Recognition - Microsoft
At the highest level the best accuracy is often obtained us- ing non-linear kernels .... size) and M (number of ferns) for a given training set size. In other words, if a ...

large scale discriminative training for speech recognition
used to train HMM systems for conversational telephone speech transcription using ..... compare acoustic and language model scaling for several it- erations of ...

Deep Neural Networks for Acoustic Modeling in Speech Recognition
Instead of designing feature detectors to be good for discriminating between classes ... where vi,hj are the binary states of visible unit i and hidden unit j, ai,bj are ...

Continuous Speech Recognition with a TF-IDF Acoustic ...
vantages of the vector space approach include a very simple training process - essentially just computing the term weights - and the potential to scale well with data availability. This paper explores the feasibility of using information re- trieval

Potential for individual recognition in acoustic signals
fundamental frequencies in the slender-billed gull's call and only one in the black-headed ...... [18] T. Aubin, Syntana: a software for the synthesis and analysis of.

Potential for individual recognition in acoustic signals
may rely first on topographical cues to localise its nest. On the contrary, a 'non-nesting' bird during the rear- ing stage must find chicks among numerous others.

Randomized Language Models via Perfect Hash Functions
Randomized Language Models via Perfect Hash Functions ... jor languages, making use of all the available data is now a ...... functions. Journal of Computer and System Sci- ence ... Acoustics, Speech and Signal Processing (ICASSP). 2007 ...

Language Recognition Based on Score ... - Semantic Scholar
1School of Electrical and Computer Engineering. Georgia Institute of ... over all competing classes, and have been demonstrated to be effective in isolated word ...

Language Recognition Based on Score ... - Semantic Scholar
1School of Electrical and Computer Engineering. Georgia Institute ... NIST (National Institute of Standards and Technology) has ..... the best procedure to follow.

Relating Natural Language and Visual Recognition
Grounding natural language phrases in im- ages. In many human-computer interaction or robotic scenar- ios it is important to be able to ground, i.e. localize, ref-.

Large-scale discriminative language model reranking for voice-search
Jun 8, 2012 - The Ohio State University ... us to utilize large amounts of unsupervised ... cluding model size, types of features, size of partitions in the MapReduce framework with .... recently proposed a distributed MapReduce infras-.

Hallucinated N-best Lists for Discriminative Language Modeling
reference text and are then used for training n-gram language mod- els using the perceptron ... Index Terms— language modeling, automatic speech recogni-.

Undersampled Face Recognition via Robust Auxiliary ...
the latter advanced weighted plurality voting [5] or margin ... Illustration of our proposed method for undersampled face recognition, in which the gallery set only contains one or ..... existing techniques such as Homotopy, Iterative Shrinkage-.

Semi-Supervised Discriminative Language Modeling for Turkish ASR
Discriminative training of language models has been shown to improve the speech recognition accuracy by resolving acoustic confusions more effectively [1].

Robust and Practical Face Recognition via Structured ...
tion can be efficiently solved using techniques given in [21,22]. ... Illustration of a four-level hierarchical tree group structure defined on the error image. Each ...... Human Sixth Sense Programme at the Advanced Digital Sciences Center from.

Discriminative Score Fusion for Language Identification
how to fuse the score of multi-systems is growing to be a researching ... lowed by SVM (GMM-SVM)[6], Parallel phone recognizer fol- ..... friend home. OHSU.