Acoustical Society of America 156th Meeting, Speech Communication Special Session A Quantal Transition: Ken Stevens in "Retirement"
Consonant Landmarks: Automatic Detection and Interpretation Chiyoun Park, Nancy Chen
This talk is primarily based on Chiyoun Park’s PhD thesis: Consonant Landmark Detection for Speech Recognition More Information can be found at http://www.mit.edu/~nancyc
Overview Part I: Introduction to Consonant Landmarks What are landmarks? What information can be predicted from landmarks?
Part II: Landmark Detection Step 1: Landmark Candidate Detection Step 2: Landmark Sequence Determination Step 3: Graphical Representation of Reliability and Ambiguity
PART I
Introduction to Consonant Landmarks
Speech Information: Not Uniformly Distributed
Frequency
Abrupt Discontinuity Steady-state Gradual Change
FEEL
Time
Landmarks
Frequency
Consonant
- Information-rich - Abrupt change - Focused analysis required
Vowel
- Steady-state, landmark at max F1 amplitude - Long time-period - Robust to noise Glide
FEEL
- Slow transition, landmark at min. F1 amplitude - Limited phonetic contexts
Time
Speech Information: Not Uniformly Distributed Many acoustic cues near consonant landmarks Spectrum Place of Articulation
Low-Freq Energy Unvoiced
Formants Labial Front / High
Speech Information: Not Uniformly Distributed Much acoustic information at consonant landmarks Spectrum Place of Articulation
Low-Freq Energy Unvoiced
Formants Labial Front / High
Not all speech segments are created equal • Information near consonants are crucial in recognizing words in perceptual studies (Jenkins et al 1983, Furui 1986) – CV transitions are more important than vowels in recognizing words
When do consonant landmarks occur? When the vocal tract changes its shape as below: Open
Vowel
Turbulence
Frication/ Burst
Glottal Source
Glottal Source
Constriction
Oral Cavity
Side-branch
Closure
Nasal Passage
Nasal
Stop Closure
Glottal Source Glottal Source
Closure In the mouth Oral Cavity
Oral Cavity
g-Landmarks Free Glottal-vibration
g-landmarks
Open
Suppressed or no Glottal-vibration
Turbulence Glottal Source
Glottal Source
Constriction
Oral Cavity
Side-branch
Closure
Nasal Passage Glottal Source Glottal Source
Closure In the mouth Oral Cavity
Oral Cavity
g-landmarks
Sonorant sounds (vowels & sonorant consonants)
Obstruent consonants
Direction of Energy Change Free Glottal-vibration
Suppressed or no Glottal-vibration
g-landmarks
+
-
Open
example
Turbulence Glottal Source
Glottal Source
Constriction
Oral Cavity
Side-branch
Closure
Nasal Passage Glottal Source
Closure In the mouth
Glottal Source
Oral Cavity Oral Cavity
g-landmarks
+
Sonorant sounds (vowels & sonorant consonants)
Obstruent consonants
s-Landmarks Free Glottal-vibration
+
-
Open
+
Suppressed or no Glottal-vibration
g-landmarks
Turbulence Glottal Source
Glottal Source
Constriction
Oral Cavity example
Side-branch
Closure
Nasal Passage
-
Glottal Source
s-landmarks
Closure In the mouth
Glottal Source
Oral Cavity Oral Cavity
g-landmarks
+
Sonorant sounds (vowels & sonorant consonants)
Obstruent consonants
b-Landmarks Free Glottal-vibration
+
-
Open
+
Suppressed or no Glottal-vibration
g-landmarks
Turbulence Glottal Source
Glottal Source
+ Constriction
Oral Cavity
example
Side-branch
Closure
Nasal Passage
-
-
Glottal Source
s-landmarks
Closure In the mouth
Glottal Source
Oral Cavity Oral Cavity
g-landmarks
+
Sonorant sounds (vowels & sonorant consonants)
b-landmarks
Obstruent consonants
What do landmarks tell us? Landmark Types +b
+g
Silence Noise
Obstruent Sonorant
-s
Vowel
+s
Nasal Lateral
-g
Sonorant Obstruent
Nasal Vowel Lateral
Broad class of adjacent segments
What do landmarks tell us? Landmark Sequence +b
+g
Stop/ Fricative
-s
Vowel/Glide
+s
Sonorant consonants
-g
Vowel/Glide
Coarse CV Structure
What do landmarks tell us? Possible word candidates 200 candidates out of 20,000-word dictionary +b
+g
Stop/ Fricative
-s
Vowels/Glides
+s
Sonorant consonants
-g
Vowels/Glides
POSSIBLE: penny, canoe, banner, comma, deny, funny, trainer, piano, tomorrow IMPOSSIBLE: pane, center, comet, media, mini, today, yesterday
What do we know from landmarks? Types of acoustic cues Places for cue estimation +b
+g
-s
+s
-g
Average spectrum VOT Formants at vowel onset
Voicing
What do landmarks tell us? Distinctive Features +b
+g
Continuant Stop/ Vowel/Glide Voicing Fricative Place of Articulation Strident
-s
+s
-g
Nasality Place of Articulation Sonorant Vowel/Glide consonants Tongue body position Tense / lax Existence of glide or liquid
Type of features predicted near landmark pairs
PART II
Landmark Detection
Landmark Detection Landmark Candidate Detection
Landmark Sequence Determination Reduce false alarms by using constraints of landmark sequences
Find acoustic discontinuities High-sensitivity detection Calculate probabilities with cues
Graphical Representation Categorize landmarks to reliable vs. ambiguous regions
Database: TIMIT Ground-truth landmarks are derived from TIMIT phonetic transcriptions +b +g -g
+g -g
+g
-g +g
-g
+g
-g
-b
Locating landmark candidates Increased detection sensitivity from Liu’s algorithm (Liu, 1996) +b +g -g
+g -g
+g
-g +g
-g
+g
Six frequency bands 0- 400Hz 800-1500Hz 1200-2000Hz 2000-3500Hz 3000-5000Hz 5000-8000Hz
Energy contours in frequency bands
-g
-b
Energy change: low vs. high frequency regions +b +g -g
+g -g
+g
-g +g
-g
+g
-g
g-Landmark Candidates
b, s-Landmark Candidates
Time points of energy change in frequency bands
-b
Landmark Candidates +b +g -g
+g -g
+g
-g +g
-g
+g
-g
g-Landmark Candidates
b, s-Landmark Candidates
If three peaks are in a cluster: potential candidate
-b
Probability Calculation: Acoustic Cues g-landmark Cues Abruptness Low-freq. Energy on left/right.
b-landmark Cues Abruptness Silence on left/right Frication on right/left
s-landmark Cues Abruptness Energy on left/right Tilt
Detection Results +b +g -g
+g -g
+b -b
-g
+g
-g
+b
-b
-g
-b
-b
-b
+b
+g
+g -b
+b +b
g-landmark: 96% detected b-landmark: 96% detected s-landmark: 75% detected Insertions: 100-250%
+b
-g
Detection Results Detected Landmarks: Many false alarms as expected
How can we convert to a more accurate sequence? Ground-Truth Landmarks
Landmark sequences: “Grammatical” rules Some sequences of landmarks cannot occur consecutively -g
-g
Bigram Restrictions (-b, -g) pair is illegal (-g, -g) pair is illegal … 60% of pairs are impossible!
-b
+g
-g
+b
-b
+s
-s
[end]
+g
X
O
X
X
O
O
X
-g
O
X
O
O
X
X
O
+b
O
X
X
O
X
X
X
-b
O
X
O
X
X
X
O
+s
X
O
X
X
O
O
X
-s
X
O
X
X
O
O
X
[start]
O
X
O
X
X
X
X
Quantifying grammatical rules of landmarks pairs
+g +g
Bigram Restrictions (-b, -g) pair is illegal (-g, -g) pair is illegal … 60% of pairs are impossible
-g
+b
-b
55.8
-g
33.6
+b
90.2
-b
13.2
45.2
+s
-s
9.2
35.0
14.8
6.4
9.8 62.3
24.5
+s
66.3
0.4
33.3
-s
43.3
56.0
0.7
[start]
40.3
[end]
59.7
Computed from TIMIT training data
Transition graph of detected landmarks Detected landmarks on the spectrogram
Nodes: Landmark Candidates Edges: Possible Transitions
Weights: Transition Probability & Individual Probability
Detected landmarks in transition graph
Finding the Best Path
Viterbi Search Result
Finding the Best Path +b +g -g
+g -g
+g
-g
+g
Viterbi Search Result
-g
-b
Viterbi Search Result Detected landmarks
Ground-truth landmarks
Performance Evaluation Type
Error
g-Landmark
b-Landmark
s-Landmark
Total
Detection
86.2%
74.9%
52.3%
76.8%
Deletion
4.4%
12.6%
30.7%
11.6%
Substitution
9.4%
12.5%
17.0%
11.6%
Insertion
7.6%
27.3%
18.8%
14.7%
•
Results consistent across gender and dialects
•
Error analysis: • Errors often indicate systematic variants of canonical landmarks • Most errors are from a small set of phonetic contexts • •
Flaps, dark /l/’s, syllabic nasals and laterals Heavily voiced obstruents (e.g. /v/)
Categorizing landmark sequences Reliably Determined
Ambiguously Determined
Clear canonical landmarks
Variants of canonical landmarks
Word onsets Stressed syllables
Estimate distinctive features with confidence
Function words Limited set of contexts (flap, voiced /v/, syllabic nasal, …)
Resolve ambiguity by further inspection
N-best Search Results
Ambiguous
Reliable
Ambiguous
Problems with N-best Search • How big should N be? – N increases exponentially with respect to the length of the signal.
• Too much redundant information – A compact graphical representation is needed => Propose Pruning Method
Removing Edges with Small Weights Cut this? No! Should consider likelihood of entire path! Small Weight
Likelihood of Path 1: 0.9 x 0.001 x 0.9 = 0.00081 Likelihood of Path 2: 0.09 x 0.01 x 0.9 = 0.00081 Likelihood of Path 3: 0.009 x 0.1 x 0.9 = 0.00081
Convert Likelihoods to Normalized Edge Probability Weights can now be compared appropriately with each other 0.33 0.33 0.33
0.33
1.0
0.33 0.33
Likelihood of Path 1: 0.9 x 0.001 x 0.9 = 0.00081 Likelihood of Path 2: 0.09 x 0.01 x 0.9 = 0.00081 Likelihood of Path 3: 0.009 x 0.1 x 0.9 = 0.00081
Original graph
• •
Graph Pruning Algorithm Normalize local likelihoods to global edge probabilities. Prune edges and corresponding nodes below threshold. Proposed Compact graph
Original Graph
Nodes: Landmark Candidates Edges: Possible Transitions
Weights: Transition Probability & Individual Probability
Proposed Compact Graph +b +g -g
+g -g
+g
+g
-g
-g
-b
-g +g
Reliable Landmarks
Pruning threshold: 0.1
Ambiguous Regions
Conclusions • Reliable landmark sequences are highly informative – Detection results are consistent across dialects and gender – Correspond to canonical landmarks – Useful in analyzing speech at the phoneme, word, and phrase level
• But so are ambiguous landmark sequences! – Systematic variants of canonical landmarks – Only occur in limited phonetic contexts such as flaps and syllabic /l/
References Furui S. (1986): On the role of spectral transition for speech perception. J. Acoust. Soc. Am., 80(4): 1016-1025 Jenkins J. et al. (1983): Identification of vowels in vowelless syllables. Perception & Psychophysics, 34(5): 441-450 Stevens K.N. (2002): Toward a model for lexical access based on acoustic landmarks and distinctive features. J. Acoust. Soc. Am., 111(4):1872-1891. Glass J.R. and Zue V.W. (1988): Multi-level acoustic segmentation of continuous speech. Proc. IEEE ICASSP-88, pp. 429-432. Liu, S.A. (1996): Landmark detection for distinctive feature-based speech recognition. J. Acoust. Soc. Am., 100(5):3417-3430 Juneja A. and Espy-Wilson C.Y. (2002): Segmentation of continuous speech using acoustic-phonetic parameters and statistical learning. Proc. IEEE ICONIP-2002, pp. 726-730 Hasegawa-Johnson M. et al. (2004): Landmark-based speech recognition: Report of the 2004 Johns Hopkins summer workshop. Technical Report, Johns Hopkins University Jansen A. and Niyogi P.(2007): A probabilistic speech recognition framework based on the temporal dynamics of distinctive feature landmark detectors. Technical Report, University of Chicago
Discussion Slides
g-Landmark back
sonorant-obstruent boundary -g
+g -g +g
Did
Mary
not
+g
-g +g
feel
-g
good
Low-frequency energy change (0-400Hz)
b-Landmark
back
turbulent noise: burst of frication +b
+b
Did
+b
+b
Mary
not
feel
+b
good
High-frequency energy change in obstruent region
s-Landmark -s indicates reduction in sonorant energy -s +s
+s
Did
Mary
not
back
-s
feel
good
High-frequency energy change in sonorant region
Future Directions • Improve landmark detector – – – –
Use additional acoustic cues (e.g. timing) Incorporate vowel and glide landmarks Investigate ambiguous regions Analyze systematic error contexts
• Speech analysis applications – How do landmarks relate to supra-segmental features? • e.g. word boundary, lexical stress
– What aspects of landmarks are language-independent and dependent? – Analyze “atypical” speech e.g., children’s speech or speech disorders
Ambiguous Regions
Three Alternatives +g / –g2 / +g : 56.0% likely +g / –g1 / +g : 31.8% likely +g / –g / +g / –g / +g : 12.2% likely Due to heavily voiced obstruent
iy
z
iy
f
er ah
Check if sonorants are in ambiguous region
Ambiguous Regions
Two Alternatives Depending on the existence of frication
Check frication
Extra Slides Speech Recognition System
Speech Recognition System Structure Determine some broad class features Limit distinctive features to be evaluated Locate where acoustic cues can be estimated
Feature Detection
Landmark Detection
+b +g
-s
Lexical Access
-g
Stevens (2002)
Speech Recognition System Structure Sequence of feature bundles
Feature Detection
Landmark Detection
Labial Voiced Stop
Front High Tense Vowel
Lexical Access
Alveolar Nasal
Stevens (2002)
Speech Recognition System Structure
Landmark Detection
Feature Detection
Lexical Access
bean been Coarse CV structure with broad-class classification Possible word candidates
Stevens (2002)
Extra Slides Examples of Graph Pruning Algorithm
Another Example Shift weights to the front
0.05
0.2
0.5
0.7x 0.8
0.001
1.0 0.8
Another Example Shift weights to the front
0.05 x 0.56
0.2
0.5 x 0.56
1.0 0.56
0.001
1.0
Another Example Shift weights to the front
0.028
1.0 0.2 0.28 / (0.28+0.001) x (0.28+0.001)
0.001
1.0
/ (0.28+0.001)
Another Example Normalize Weights to Get Probabilities
0.028
0.0562
0.996
1.0
0.004
1.0
Another Example Normalize Weights to Get Probabilities
0.332
0.668
0.996
1.0
0.004
1.0
Another Example Redistribute Probabilities to Subsequent Edges
0.332
0.668
0.996 x 0.668
1.0
0.004 x 0.668
1.0
Another Example Redistribute Probabilities to Subsequent Edges
0.332
0.668
0.665
1.0 1.0 x (0.665+0.332)
0.003
Another Example Redistribute Probabilities to Subsequent Edges
0.332
0.668
0.665
0.997
0.003
1.0 x 0.997
Another Example Resulting Edge Probability 0.332
0.668
0.665
0.997
Small Probability 0.003
X
0.997
Another Example Recalculated Probabilities in Reduced Graph 0.33
0.66
0.66
Ambiguous Region Probabilities of Alternatives
1.0
1.0
Reliable Region Probability of One