Integrating acoustic cues to phonetic features: A computational approach to cue weighting Joseph C. Toscano and Bob McMurray University of Iowa Dept. of Psychology A central issue in phonology and speech perception is the relationship between acoustic cues in the speech signal and the features they correspond to. Voice onset time (VOT), for example, serves as a primary cue to the phonetic feature voicing. However, these relationships are more complicated when we consider that for any given phonetic feature, there are multiple acoustic cues that contribute to its perception (Lisker, 1978). Listeners’ use of multiple cues has been demonstrated for a large number of phonological distinctions by examining changes in the location of category boundaries along one dimension as a function of a second or third dimension (see Repp, 1982, for a review). Previous perceptual experiments have demonstrated that listeners weight individual acoustic cues differently as they are combined to form a phonological dimension or feature. For example, in determining voicing, VOT is a primary cue, while F0, F1 and vowel length make minor contributions. To date, there have been no theoretical accounts that make specific predictions about how each cue might be weighted during perception. Gestural approaches, for example, might posit that the particular cue weights used for a phonetic feature are determined by the relationship between the articulatory gestures that a speaker uses to produce a particular speech sound and the acoustic effects of that gesture. However, this account requires listeners to have knowledge about the relationships between particular articulatory gestures, their acoustic counterparts, and the relevant phonetic distinctions associated with them. Moreover, given that many relationships between cues differ cross-linguistically (e.g. the relationship between VOT and vowel length is different for languages in which vowel length is phonemic), an account in which cue weights could be learned may be preferable. We propose that these questions can be answered without regard to their gestural origins by weighting acoustic cues as a function of their statistical reliability. That is, a cue that is more reliably correlated with phonetic categories along a given dimension should be weighted higher than less reliable ones. For example, VOT provides an excellent cue to word-initial voicing because VOT values in speech production tend to cluster into distinct categories (Lisker & Abramson, 1964). Vowel length, while also distinguishing word-initial voicing categories (Allen & Miller, 1999), is more variable, and is therefore less reliable than VOT. Thus, a system using the reliability of acoustic cues to weight them would weight VOT higher. This reflects the relationship observed in perceptual experiments between these cues (Summerfield, 1981). We present a method for computing the reliability of acoustic cues on the basis of their statistical distributions and apply this method in a computational model. The model represents individual cues as a mixture of Gaussians (MOG) along an acoustic dimension. Each Gaussian distribution represents one acoustic-phonetic category. The mean and variance of these Gaussians can be extracted from phonetic data on the particular cues being modeled. This yields categories structured as graded prototypes along an acoustic dimension. Each individual cues is then weighted using a metric that takes into account the distance between categories, the variability of individual categories, and the relative frequency of each category along each acoustic dimension. Figure 1 shows examples of distributions that are reliable (1A) and unreliable (1B) using this metric. After computing each cue’s reliability, cues-values are linearly combined to form a graded phonological dimension that is based on the weighted combination of acoustic inputs and corresponds to a particular phonetic feature. From this more abstract

dimension, new Gaussian categories are learned and used to categorize speech sounds for that phonetic distinction. Thus, phonetic features are represented as graded distributions along a single dimension that are computed from weighted inputs of different acoustic cues. Simulations of several trading relations based on different acoustic cues and phonetic features are presented. The model is able to determine cue weights that produce behavior similar to that of human listeners for these different trading relations. Figure 2 shows the results of an identification task for voicing category with stimuli varying in VOT and vowel length. The results indicate a similar trading relation for human listeners and for models trained on these acoustic cues. This suggests that these cue weights can be determined from the statistical properties of the acoustic input without needing to posit innate knowledge about how particular cues should be weighted or what their distributions should be. Further, it provides a computationally explicit account of how the speech system can weight and integrate these cues during speech perception. Finally it suggests that feature dimensions can be constructed solely on the basis of the bottom-up statistical reliability of the acoustic cues they are based on. 0.06

0.02

A

B

0.018

0.05

0.016 0.014

0.04

0.012 0.03

0.01 0.008

0.02

0.006 0.004

0.01

0.002 0

0

10

20

30

40

50

60

70

80

90

100

0

0

10

20

30

40

50

60

70

80

90

100

Figure 1. (A) A reliable cue dimension in which the categories are distinct. (B) A cue dimension with highly overlapping categories that would be less reliable using the cue weighting metric. A

B

Figure 2. Identification functions along a VOT continuum for two different vowel lengths (long and short). (A) Behavioral data from human listeners. (B) Simulation results from the model.

References Allen, J. S. & Miller, J. L. (1999). Effects of syllable-initial voicing and speaking rate on the temporal characteristics of monosyllabic words. J. Acoust. Soc. Am., 106, 2031-2039. Lisker, L. (1978). Rapid vs. rabid: a catalogue of acoustic features that may cue the distinction. Haskins Laboratories Status Report on Speech Research, SR-54, 127-132. Lisker, L. & Abramson, A. S. (1964). A cross-linguistic study of voicing in initial stops: Acoustical measurements. Word, 20, 384-422. Repp, B. H. (1982). Phonetic trading relations and context effects: New experimental evidence for a speech mode of perception. Psychological Bulletin, 92, 81-110. Summerfield, Q. (1981). Articulatory rate and perceptual constancy in phonetic perception. J. Exp. Psychol. Hum. Percept. Perform., 7, 1074-1095.

Integrating acoustic cues to phonetic features: A ...

acoustic cues differently as they are combined to form a phonological dimension or feature. For example, in determining voicing, VOT is a primary cue, while F0, ...

177KB Sizes 0 Downloads 144 Views

Recommend Documents

Phonetic Symbols.pdf
Phonetic Symbols. for Old English through Modern English. Consonants. bilabial labiodental dental alveolar palatoalveolar palatal velar glottal. nasal m. me. n.

Effects of the distribution of acoustic cues on infants ...
empirical support for this statistical learning hypothesis comes from studies where .... Two laboratory-training studies document such changes in perception as a ...... did not pose a problem, while different learning outcomes ensued for the two.

Proposal to Encode Additional Phonetic Symbols in the ...
Jun 9, 2003 - The barred small capital I is also used in some recent Oxford dictionaries (though with a different meaning), as is the barred upsilon: Figure 12.

Infants attempting to learn the phonetic categories of ...
Statistical learning, cross-constraints and the acquisition of speech categories: a computational approach. Joseph Toscano. Bob McMurray [email protected] [email protected]. Dept. of Psychology. Dept. of Psychology. University of Iowa. Un

A phonetic study of voiced, voiceless and alternating ...
The Newsletter of the Center for Research in Language, University of California, San Diego, La Jolla CA 92093-0526 .... between voiced, voiceless and alternating stops is necessary to account for all of the data. * I would ..... Indiana University.

phonetic encoding for bangla and its application to ...
These transformations provide a certain degree of context for the phonetic ...... BHA. \u09AD. “b” x\u09CD \u09AE... Not Coded @ the beginning sরণ /ʃɔroɳ/.

A Bangla Phonetic Encoding for Better Spelling ... - Semantic Scholar
Encode the input word using phonetic coding rules;. 2. Look up a phonetically ..... V. L. Levenshtein, “Binary codes capable of correcting deletions, insertions ...

ACOUSTIC-TO-ARTICULATORY INVERSION OF ...
Phonétiques, Aix-en-Provence, 5, pp. 422-425, 1991. 9.J. R. Westbury, X-Ray Microbeam Speech Production Database User's Handbook. University of ...

A Phonetic Search Approach to the 2006 NIST Spoken ...
Index Terms: spoken term detection, phonetic search, keyword ... available. However, these systems are critically restricted in that the terms that are able to be located are limited to the ... domain representation of the sequence database.

Caddisfly behavioral responses to drying cues in ...
Feb 11, 2016 - period, temperature, and heating degree-days, water chem- istry, and .... Behaviors observed were: 1) total time active, 2) en-. Volume 35.

Inferential Cues for Determining Focus Alternatives: a ...
To illustrate this, consider a shopper described as being at a farmers market vs. one who is at a shopping mall. Potential purchases for the first shopper are most likely confined to produce and other food items, whereas the mall shopper could be buy

Real-time imaging of acoustic waves on a bulk acoustic ...
amplifier. c Schematic cross section of the BAW resonator. d Optical micrograph of the .... V. Knuuttila, J. J. Vartiainen, J. Koskela, V. P. Plessky, C. S. Hartmann,.

Search features
Search Features: A collection of “shortcuts” that get you to the answer quickly. Page 2. Search Features. [ capital of Mongolia ]. [ weather Knoxville, TN ]. [ weather 90712 ]. [ time in Singapore ]. [ Hawaiian Airlines 24 ]. To get the master li

Phonetic Sound Cards - With Pictures.pdf
Page 2 of 42. S.G.SIVAKUMAR D.T.Ed., M.A., B.Ed. P.U.P.SCHOOL,. PERAMBAKKAM. KADAMBATHUR BLOCK. THIRUVALLUR. DISTRICT. www.asiriyar.com. Page 2 of 42. Page 3 of 42. S.G.SIVAKUMAR D.T.Ed., M.A., B.Ed. P.U.P.SCHOOL,. PERAMBAKKAM. KADAMBATHUR BLOCK. THI

Learning features to compare distributions
Goal of this talk. Have: Two collections of samples X Y from unknown distributions. P and Q. Goal: Learn distinguishing features that indicate how P and Q differ. 2/28 ...

Acoustic-Foam.pdf
they are capable of lowering the noise levels of machinery,. instruments, computers, machine rooms, offices, sound. studios, insides of tractor and earthmoving equipment cabins,. vehicle interiors etc. as surrounds, panelling or cladding. They can be

Real-time imaging of acoustic waves on a bulk acoustic ...
1Division of Applied Physics, Graduate School of Engineering, Hokkaido .... m2 active area of the device defined by the AlN layer. .... Microwave Theory Tech.

Acoustic-Foam.pdf
office machinery, circular saws, human voice. • high frequency squeal – 1,000 to 4,000 Hz and above: ultrasonic welders, hydraulic pumps, fast small. engines, some industrial cutting and. polishing processes. Dunlop Acoustic Foams are good absorb

Leveraging Contextual Cues for Generating Basketball Highlights
Permission to make digital or hard copies of part or all of this work for ... ums and gymnasiums of schools and colleges and provides ...... Florida State, 9th.

Capitalization Cues Improve Dependency ... - Stanford NLP Group
39.2. 59.3. 66.9. 61.1. Table 2: Several sources of fragments' end-points and. %-correctness of their derived constraints (for English). notations, punctuation or ...

GPUfs: Integrating a File System with GPUs
to be largely oblivious to where data is located—whether on disk, in main memory, in a .... comprising a kernel into a single global hardware queue. The hard-.

A multimedia recommender integrating object ... - Semantic Scholar
28 Jan 2010 - Users of a multimedia browsing and retrieval system should be able to navigate a repository of multimedia ... 1Here we do not consider link-based systems mainly used in WEB search engines. ..... 4We use cookies to track sessions, and do