Landmark Detection

Viewer
Transcript

Acoustical Society of America 156th Meeting, Speech Communication Special Session A Quantal Transition: Ken Stevens in "Retirement"

Consonant Landmarks: Automatic Detection and Interpretation Chiyoun Park, Nancy Chen

This talk is primarily based on Chiyoun Park’s PhD thesis: Consonant Landmark Detection for Speech Recognition More Information can be found at http://www.mit.edu/~nancyc

Overview Part I: Introduction to Consonant Landmarks What are landmarks? What information can be predicted from landmarks?

Part II: Landmark Detection Step 1: Landmark Candidate Detection Step 2: Landmark Sequence Determination Step 3: Graphical Representation of Reliability and Ambiguity

PART I

Introduction to Consonant Landmarks

Speech Information: Not Uniformly Distributed

Frequency

Abrupt Discontinuity Steady-state Gradual Change

FEEL

Time

Landmarks

Frequency

Consonant

- Information-rich - Abrupt change - Focused analysis required

Vowel

- Steady-state, landmark at max F1 amplitude - Long time-period - Robust to noise Glide

FEEL

- Slow transition, landmark at min. F1 amplitude - Limited phonetic contexts

Time

Speech Information: Not Uniformly Distributed Many acoustic cues near consonant landmarks Spectrum Place of Articulation

Low-Freq Energy Unvoiced

Formants Labial Front / High

Speech Information: Not Uniformly Distributed Much acoustic information at consonant landmarks Spectrum Place of Articulation

Low-Freq Energy Unvoiced

Formants Labial Front / High

Not all speech segments are created equal • Information near consonants are crucial in recognizing words in perceptual studies (Jenkins et al 1983, Furui 1986) – CV transitions are more important than vowels in recognizing words

When do consonant landmarks occur? When the vocal tract changes its shape as below: Open

Vowel

Turbulence

Frication/ Burst

Glottal Source

Glottal Source

Constriction

Oral Cavity

Side-branch

Closure

Nasal Passage

Nasal

Stop Closure

Glottal Source Glottal Source

Closure In the mouth Oral Cavity

Oral Cavity

g-Landmarks Free Glottal-vibration

g-landmarks

Open

Suppressed or no Glottal-vibration

Turbulence Glottal Source

Glottal Source

Constriction

Oral Cavity

Side-branch

Closure

Nasal Passage Glottal Source Glottal Source

Closure In the mouth Oral Cavity

Oral Cavity

g-landmarks

Sonorant sounds (vowels & sonorant consonants)

Obstruent consonants

Direction of Energy Change Free Glottal-vibration

Suppressed or no Glottal-vibration

g-landmarks

+

-

Open

example

Turbulence Glottal Source

Glottal Source

Constriction

Oral Cavity

Side-branch

Closure

Nasal Passage Glottal Source

Closure In the mouth

Glottal Source

Oral Cavity Oral Cavity

g-landmarks

+

Sonorant sounds (vowels & sonorant consonants)

Obstruent consonants

s-Landmarks Free Glottal-vibration

+

-

Open

+

Suppressed or no Glottal-vibration

g-landmarks

Turbulence Glottal Source

Glottal Source

Constriction

Oral Cavity example

Side-branch

Closure

Nasal Passage

-

Glottal Source

s-landmarks

Closure In the mouth

Glottal Source

Oral Cavity Oral Cavity

g-landmarks

+

Sonorant sounds (vowels & sonorant consonants)

Obstruent consonants

b-Landmarks Free Glottal-vibration

+

-

Open

+

Suppressed or no Glottal-vibration

g-landmarks

Turbulence Glottal Source

Glottal Source

+ Constriction

Oral Cavity

example

Side-branch

Closure

Nasal Passage

-

-

Glottal Source

s-landmarks

Closure In the mouth

Glottal Source

Oral Cavity Oral Cavity

g-landmarks

+

Sonorant sounds (vowels & sonorant consonants)

b-landmarks

Obstruent consonants

What do landmarks tell us? Landmark Types +b

+g

Silence Noise

Obstruent Sonorant

-s

Vowel

+s

Nasal Lateral

-g

Sonorant Obstruent

Nasal Vowel Lateral

Broad class of adjacent segments

What do landmarks tell us? Landmark Sequence +b

+g

Stop/ Fricative

-s

Vowel/Glide

+s

Sonorant consonants

-g

Vowel/Glide

Coarse CV Structure

What do landmarks tell us? Possible word candidates 200 candidates out of 20,000-word dictionary +b

+g

Stop/ Fricative

-s

Vowels/Glides

+s

Sonorant consonants

-g

Vowels/Glides

POSSIBLE: penny, canoe, banner, comma, deny, funny, trainer, piano, tomorrow IMPOSSIBLE: pane, center, comet, media, mini, today, yesterday

What do we know from landmarks? Types of acoustic cues Places for cue estimation +b

+g

-s

+s

-g

Average spectrum VOT Formants at vowel onset

Voicing

What do landmarks tell us? Distinctive Features +b

+g

Continuant Stop/ Vowel/Glide Voicing Fricative Place of Articulation Strident

-s

+s

-g

Nasality Place of Articulation Sonorant Vowel/Glide consonants Tongue body position Tense / lax Existence of glide or liquid

Type of features predicted near landmark pairs

PART II

Landmark Detection

Landmark Detection Landmark Candidate Detection

Landmark Sequence Determination Reduce false alarms by using constraints of landmark sequences

Find acoustic discontinuities High-sensitivity detection Calculate probabilities with cues

Graphical Representation Categorize landmarks to reliable vs. ambiguous regions

Database: TIMIT Ground-truth landmarks are derived from TIMIT phonetic transcriptions +b +g -g

+g -g

+g

-g +g

-g

+g

-g

-b

Locating landmark candidates Increased detection sensitivity from Liu’s algorithm (Liu, 1996) +b +g -g

+g -g

+g

-g +g

-g

+g

Six frequency bands 0- 400Hz 800-1500Hz 1200-2000Hz 2000-3500Hz 3000-5000Hz 5000-8000Hz

Energy contours in frequency bands

-g

-b

Energy change: low vs. high frequency regions +b +g -g

+g -g

+g

-g +g

-g

+g

-g

g-Landmark Candidates

b, s-Landmark Candidates

Time points of energy change in frequency bands

-b

Landmark Candidates +b +g -g

+g -g

+g

-g +g

-g

+g

-g

g-Landmark Candidates

b, s-Landmark Candidates

If three peaks are in a cluster: potential candidate

-b

Probability Calculation: Acoustic Cues g-landmark Cues Abruptness Low-freq. Energy on left/right.

b-landmark Cues Abruptness Silence on left/right Frication on right/left

s-landmark Cues Abruptness Energy on left/right Tilt

Detection Results +b +g -g

+g -g

+b -b

-g

+g

-g

+b

-b

-g

-b

-b

-b

+b

+g

+g -b

+b +b

g-landmark: 96% detected b-landmark: 96% detected s-landmark: 75% detected Insertions: 100-250%

+b

-g

Detection Results Detected Landmarks: Many false alarms as expected

How can we convert to a more accurate sequence? Ground-Truth Landmarks

Landmark sequences: “Grammatical” rules Some sequences of landmarks cannot occur consecutively -g

-g

Bigram Restrictions (-b, -g) pair is illegal (-g, -g) pair is illegal … 60% of pairs are impossible!

-b

+g

-g

+b

-b

+s

-s

[end]

+g

X

O

X

X

O

O

X

-g

O

X

O

O

X

X

O

+b

O

X

X

O

X

X

X

-b

O

X

O

X

X

X

O

+s

X

O

X

X

O

O

X

-s

X

O

X

X

O

O

X

[start]

O

X

O

X

X

X

X

Quantifying grammatical rules of landmarks pairs

+g +g

Bigram Restrictions (-b, -g) pair is illegal (-g, -g) pair is illegal … 60% of pairs are impossible

-g

+b

-b

55.8

-g

33.6

+b

90.2

-b

13.2

45.2

+s

-s

9.2

35.0

14.8

6.4

9.8 62.3

24.5

+s

66.3

0.4

33.3

-s

43.3

56.0

0.7

[start]

40.3

[end]

59.7

Computed from TIMIT training data

Transition graph of detected landmarks Detected landmarks on the spectrogram

Nodes: Landmark Candidates Edges: Possible Transitions

Weights: Transition Probability & Individual Probability

Detected landmarks in transition graph

Finding the Best Path

Viterbi Search Result

Finding the Best Path +b +g -g

+g -g

+g

-g

+g

Viterbi Search Result

-g

-b

Viterbi Search Result Detected landmarks

Ground-truth landmarks

Performance Evaluation Type

Error

g-Landmark

b-Landmark

s-Landmark

Total

Detection

86.2%

74.9%

52.3%

76.8%

Deletion

4.4%

12.6%

30.7%

11.6%

Substitution

9.4%

12.5%

17.0%

11.6%

Insertion

7.6%

27.3%

18.8%

14.7%

•

Results consistent across gender and dialects

•

Error analysis: • Errors often indicate systematic variants of canonical landmarks • Most errors are from a small set of phonetic contexts • •

Flaps, dark /l/’s, syllabic nasals and laterals Heavily voiced obstruents (e.g. /v/)

Categorizing landmark sequences Reliably Determined

Ambiguously Determined

Clear canonical landmarks

Variants of canonical landmarks

Word onsets Stressed syllables

Estimate distinctive features with confidence

Function words Limited set of contexts (flap, voiced /v/, syllabic nasal, …)

Resolve ambiguity by further inspection

N-best Search Results

Ambiguous

Reliable

Ambiguous

Problems with N-best Search • How big should N be? – N increases exponentially with respect to the length of the signal.

• Too much redundant information – A compact graphical representation is needed => Propose Pruning Method

Removing Edges with Small Weights Cut this? No! Should consider likelihood of entire path! Small Weight

Likelihood of Path 1: 0.9 x 0.001 x 0.9 = 0.00081 Likelihood of Path 2: 0.09 x 0.01 x 0.9 = 0.00081 Likelihood of Path 3: 0.009 x 0.1 x 0.9 = 0.00081

Convert Likelihoods to Normalized Edge Probability Weights can now be compared appropriately with each other 0.33 0.33 0.33

0.33

1.0

0.33 0.33

Likelihood of Path 1: 0.9 x 0.001 x 0.9 = 0.00081 Likelihood of Path 2: 0.09 x 0.01 x 0.9 = 0.00081 Likelihood of Path 3: 0.009 x 0.1 x 0.9 = 0.00081

Original graph

• •

Graph Pruning Algorithm Normalize local likelihoods to global edge probabilities. Prune edges and corresponding nodes below threshold. Proposed Compact graph

Original Graph

Nodes: Landmark Candidates Edges: Possible Transitions

Weights: Transition Probability & Individual Probability

Proposed Compact Graph +b +g -g

+g -g

+g

+g

-g

-g

-b

-g +g

Reliable Landmarks

Pruning threshold: 0.1

Ambiguous Regions

Conclusions • Reliable landmark sequences are highly informative – Detection results are consistent across dialects and gender – Correspond to canonical landmarks – Useful in analyzing speech at the phoneme, word, and phrase level

• But so are ambiguous landmark sequences! – Systematic variants of canonical landmarks – Only occur in limited phonetic contexts such as flaps and syllabic /l/

References Furui S. (1986): On the role of spectral transition for speech perception. J. Acoust. Soc. Am., 80(4): 1016-1025 Jenkins J. et al. (1983): Identification of vowels in vowelless syllables. Perception & Psychophysics, 34(5): 441-450 Stevens K.N. (2002): Toward a model for lexical access based on acoustic landmarks and distinctive features. J. Acoust. Soc. Am., 111(4):1872-1891. Glass J.R. and Zue V.W. (1988): Multi-level acoustic segmentation of continuous speech. Proc. IEEE ICASSP-88, pp. 429-432. Liu, S.A. (1996): Landmark detection for distinctive feature-based speech recognition. J. Acoust. Soc. Am., 100(5):3417-3430 Juneja A. and Espy-Wilson C.Y. (2002): Segmentation of continuous speech using acoustic-phonetic parameters and statistical learning. Proc. IEEE ICONIP-2002, pp. 726-730 Hasegawa-Johnson M. et al. (2004): Landmark-based speech recognition: Report of the 2004 Johns Hopkins summer workshop. Technical Report, Johns Hopkins University Jansen A. and Niyogi P.(2007): A probabilistic speech recognition framework based on the temporal dynamics of distinctive feature landmark detectors. Technical Report, University of Chicago

Discussion Slides

g-Landmark back

sonorant-obstruent boundary -g

+g -g +g

Did

Mary

not

+g

-g +g

feel

-g

good

Low-frequency energy change (0-400Hz)

b-Landmark

back

turbulent noise: burst of frication +b

+b

Did

+b

+b

Mary

not

feel

+b

good

High-frequency energy change in obstruent region

s-Landmark -s indicates reduction in sonorant energy -s +s

+s

Did

Mary

not

back

-s

feel

good

High-frequency energy change in sonorant region

Future Directions • Improve landmark detector – – – –

Use additional acoustic cues (e.g. timing) Incorporate vowel and glide landmarks Investigate ambiguous regions Analyze systematic error contexts

• Speech analysis applications – How do landmarks relate to supra-segmental features? • e.g. word boundary, lexical stress

– What aspects of landmarks are language-independent and dependent? – Analyze “atypical” speech e.g., children’s speech or speech disorders

Ambiguous Regions

Three Alternatives +g / –g2 / +g : 56.0% likely +g / –g1 / +g : 31.8% likely +g / –g / +g / –g / +g : 12.2% likely Due to heavily voiced obstruent

iy

z

iy

f

er ah

Check if sonorants are in ambiguous region

Ambiguous Regions

Two Alternatives Depending on the existence of frication

Check frication

Extra Slides Speech Recognition System

Speech Recognition System Structure Determine some broad class features Limit distinctive features to be evaluated Locate where acoustic cues can be estimated

Feature Detection

Landmark Detection

+b +g

-s

Lexical Access

-g

Stevens (2002)

Speech Recognition System Structure Sequence of feature bundles

Feature Detection

Landmark Detection

Labial Voiced Stop

Front High Tense Vowel

Lexical Access

Alveolar Nasal

Stevens (2002)

Speech Recognition System Structure

Landmark Detection

Feature Detection

Lexical Access

bean been Coarse CV structure with broad-class classification Possible word candidates

Stevens (2002)

Extra Slides Examples of Graph Pruning Algorithm

Another Example Shift weights to the front

0.05

0.2

0.5

0.7x 0.8

0.001

1.0 0.8

Another Example Shift weights to the front

0.05 x 0.56

0.2

0.5 x 0.56

1.0 0.56

0.001

1.0

Another Example Shift weights to the front

0.028

1.0 0.2 0.28 / (0.28+0.001) x (0.28+0.001)

0.001

1.0

/ (0.28+0.001)

Another Example Normalize Weights to Get Probabilities

0.028

0.0562

0.996

1.0

0.004

1.0

Another Example Normalize Weights to Get Probabilities

0.332

0.668

0.996

1.0

0.004

1.0

Another Example Redistribute Probabilities to Subsequent Edges

0.332

0.668

0.996 x 0.668

1.0

0.004 x 0.668

1.0

Another Example Redistribute Probabilities to Subsequent Edges

0.332

0.668

0.665

1.0 1.0 x (0.665+0.332)

0.003

Another Example Redistribute Probabilities to Subsequent Edges

0.332

0.668

0.665

0.997

0.003

1.0 x 0.997

Another Example Resulting Edge Probability 0.332

0.668

0.665

0.997

Small Probability 0.003

X

0.997

Another Example Recalculated Probabilities in Reduced Graph 0.33

0.66

0.66

Ambiguous Region Probabilities of Alternatives

1.0

1.0

Reliable Region Probability of One

More Information can be found at http://www.mit.edu/~nancyc ... Focused analysis required .... Computed from TIMIT training data ... How big should N be?

Download PDF

795KB Sizes 2 Downloads 188 Views

Report

Landmark Detection

Argentina's Landmark 2003 Presidential Election

Prevention Prevention and Detection Detection ...

Real Time Landmark based GPS Navigation

FRAUD DETECTION

Landmark Image Super-Resolution by Retrieving Web ... - IEEE Xplore

E-book Download The Landmark Herodotus: The ...