Characteristics of speaking style and implications for ...

Viewer
Transcript

Characteristics of speaking style and implications for speech recognition Takahiro Shinozakia兲 Department of Computer Science, Tokyo Institute of Technology, Tokyo 152-8552, Japan

Mari Ostendorf and Les Atlas Department of Electrical Engineering, University of Washington, Seattle, Washington 98195-2500

共Received 25 August 2008; revised 5 April 2009; accepted 30 June 2009兲 Differences in speaking style are associated with more or less spectral variability, as well as different modulation characteristics. The greater variation in some styles 共e.g., spontaneous speech and infant-directed speech兲 poses challenges for recognition but possibly also opportunities for learning more robust models, as evidenced by prior work and motivated by child language acquisition studies. In order to investigate this possibility, this work proposes a new method for characterizing speaking style 共the modulation spectrum兲, examines spontaneous, read, adult-directed, and infant-directed styles in this space, and conducts pilot experiments in style detection and sampling for improved speech recognizer training. Speaking style classification is improved by using the modulation spectrum in combination with standard pitch and energy variation. Speech recognition experiments on a small vocabulary conversational speech recognition task show that sampling methods for training with a small amount of data benefit from the new features. © 2009 Acoustical Society of America. 关DOI: 10.1121/1.3183593兴 PACS number共s兲: 43.72.Ar, 43.72.Ne 关DOS兴

I. INTRODUCTION

It is well known that spoken language varies with different situations, including the formality or informality of the situation, the familiarity of speakers with their conversational partners and relative seniority, whether or not the listener is a language learner, the noise level of the environment, etc. Both the word choices and the speaking style can vary, where by speaking style the authors mean the quality of articulation as well as prosodic characteristics, including intonation, timing, and energy patterns associated with emphasis and phrasing. Both have an effect on speech recognition performance, but in this paper the authors will focus on acoustic characteristics of speaking style, particularly in terms of situational context. Speaking style reflects, in part, the speaker’s effort to be understood. For example, a news announcer or someone giving a speech will tend to articulate more clearly than someone engaged in casual conversation, and people hyperarticulate in a situation where they think they have been misunderstood. Hyperarticulation also appears in language learning contexts, both in speech of second-language teachers and in adults talking to children. Studies on infantdirected speech suggest that its prosodic features attract and hold infant attention and that the phonetic cues are exaggerated and more acoustically distinct,1,2 though another study finds that the vowel space is expanded only for pitchaccented words.3 Further, researchers have found a correlation between the clarity of mothers’ speech and infants’ dis-

a兲

This work was carried out when the author was at the Department of Electrical Engineering, University of Washington, Seattle, WA 981952500.

1500

Pages: 1500–1510

crimination capabilities.4 Overemphasized phonetic contrasts also appear to be useful in second-language learning.5 In terms of recognition accuracy, human listeners are relatively insensitive to the change in the speaking styles as they do not experience special difficulties in listening whether it is read or conversational speech. In fact, the word error rate 共WER兲 by human listeners for the switchboard conversational telephone speech 共CTS兲 corpus was 4%,6 which was not very different from the error rates of 2.6% for the read utterances in the Wall Street Journal corpus.7 On the other hand, the recognition performance of automatic speech recognition 共ASR兲 systems is significantly affected by the difference in the speaking styles, and the error rates often become one or more orders of magnitude higher than that of human.8 When a recognizer that has not been trained with hyperarticulated speech has to recognize it, performance degrades. However, even in matched train/test conditions, style impacts performance. ASR systems designed for conversational speech typically perform much worse than similar systems trained and tested on news recordings, even though the conversational speech task is “simpler” in the sense of language models having lower perplexity. In a study by researchers at SRI,9 the language model is factored out by collecting spontaneous conversational speech and then having the same speakers come back and read their transcripts. The read speech had a lower recognition error, even though the words spoken were the same. These findings were confirmed in a subsequent study on pronunciation modeling using the same data.10 Studies using the same recognition technology on different genres show that broadcast news tends to be easier to recognize than conversational speech genres 共talk shows, telephone conversations, and meetings兲 and that even within

J. Acoust. Soc. Am. 126 共3兲, September 2009 0001-4966/2009/126共3兲/1500/11/$25.00

© 2009 Acoustical Society of America

Downloaded 06 Jan 2012 to 128.95.30.241. Redistribution subject to ASA license or copyright; see http://asadl.org/journals/doc/ASALIB-home/info/terms.jsp

news broadcasts, professional announcers tend to be associated with lower WERs. Another contrast in speaking styles is adult-directed vs infant-directed speech. Analogous to the above results for spontaneous speech, Kirchhoff and Schimmel found that infant-directed speech has a higher recognition error rate than adult-directed speech in matched training conditions.3 In these studies, both conversational and infantdirected speech are shown to have more variability than their counterpart in terms of the spectral realization of phonemes. In addition, they are more dynamic prosodically in terms of fundamental frequency 共F0兲 and speaking rate variation, which may be helpful to human listeners. While variability is problematic for recognition, it can be useful for robust training, i.e., for cases where the ASR system may need to recognize speech in a style that it was not trained on. In an experiment of using Japanese Spontaneous speech corpus11 and Japanese newspaper article sentences corpus,12 the authors observed that training on the spontaneous speech and testing on the read speech gave similar performances to the matched training condition for the read speech, but the reverse led to significant degradation in performance. More precisely, the experiments were performed using spontaneous and read speech models trained, respectively, from 52 h of gender balanced training data from the corpora and using standard test sets associated with the corpora. The WER for the read speech by the spontaneous speech model was 9.5%, and this was similar to 8.7% by the read speech model. On the other hand, the WER for the spontaneous speech by the read speech model was 38.2%, which was significantly higher than 25.0% by the spontaneous speech model. Similarly, the mismatched train/test condition for adult- and infant-directed speech has greater degradation in performance relative to the matched condition for the case using adult-directed training compared to using infant-directed training. One of the possible hypotheses for this is that the more careful types of speech lead to models with tighter variances, which are less able to handle cases in the overlap regions associated with less careful speech. Variability in training is leveraged even for matched training conditions in the sense that it has been proposed to put a greater weight on “difficult cases,” either through sampling13 or boosting.14 However, many studies of human language acquisition suggest that infant-directed speech might be useful in providing better prototypes for different speech sounds, assuming that children are focusing on the emphasized examples. These suggest very different methods for sampling speech in learning: bringing in hyperarticulated examples as outliers later in training vs initializing with exaggerated examples. The prior work thus suggests two possible reasons for recognizing speech style: detecting different styles in order to adjust the recognition models and selecting or weighting speech for training. As a first step in exploring these ideas, this paper proposes a new method for characterizing speaking style, examines spontaneous, read, adult-directed, and infant-directed styles in this space, and conducts pilot experiments in style detection and sampling for improved ASR training. In particular, the authors propose the use of the modulation spectrum to characterize the acoustics of speakJ. Acoust. Soc. Am., Vol. 126, No. 3, September 2009

ing style, with the idea of representing the greater variation that they anecdotally perceive in spontaneous 共vs read兲 and infant-directed 共vs adult-directed兲 speech. In the sections that follow, the authors begin by motivating the modulation approach to speech analysis and introduce the basic mathematical framework in Sec. II. In Sec. III, they provide analyses of speaking styles in terms of acoustic dynamics using the modulation spectrum as well as traditional F0 and energy measures. Section IV presents results of style recognition experiments with some of these features. In Sec. V, several sampling methods and training strategies are investigated for ASR. Finally, a summary and conclusions are given in Sec. VI. II. MODULATION ANALYSIS OF SPEECH

There is substantial evidence that many natural signals can be represented as low frequency modulators, which modulate higher frequency carriers. Many researchers have observed that this concept, loosely called “modulation frequency,” is useful for describing, representing, and modifying broadband acoustic signals such as speech and music. Modulation frequency representations usually consist of a transform of a one-dimensional broadband signal into a twodimensional joint frequency representation, where one dimension is a standard Fourier frequency and the other dimension is a modulation frequency. In 1939, Dudley concluded his now famous paper on speech analysis15 with … the basic nature of speech as composed of audible sound streams on which the intelligence content is impressed of the true message-bearing waves which, however, by themselves are inaudible. In other words, he observed that speech and other audio signals, such as music, are actually low bandwidth processes that modulate higher bandwidth carriers. Over the years, research in auditory science has supported this idea, including findings that two-dimensional spectro-temporal modulation transfer functions can model many of the observed effects of auditory sensitivity to amplitude modulation16 and that frequency and modulation periodicity are represented via orthogonal maps in the human auditory cortex.17 In signal processing, the modulation spectrum is a representation of speech that gives both acoustic and modulation frequency information.18 In its simplest form, the modulation spectrum can be considered to be a Fourier transform 共in time兲 of each row of the magnitude of the short-time Fourier transform 共STFT兲 or the magnitude spectrogram. In general, modulation spectral analysis involves a base transform on short-term windows of speech, followed by a nonlinear detection operation, and then a second transform. In the specific implementation used here, the modulation spectrum is obtained by first generating a STFT vector sequence, taking the magnitude of the result for each frequency bin, and then applying a second Fourier transform magnitude to each time series corresponding to a frequency bin. The result is a two-dimensional matrix with frequency and modulation axes. The parameters for modulation spectral analysis consists of those for the base STFT 关e.g., window, overlap, Shinozaki et al.: Speaking style and speech recognition

Downloaded 06 Jan 2012 to 128.95.30.241. Redistribution subject to ASA license or copyright; see http://asadl.org/journals/doc/ASALIB-home/info/terms.jsp

1501

8000

III. ACOUSTIC ANALYSIS

7000

In this section, acoustic analyses on speaking styles are performed based on features extracted from utterances. Two different corpora are used, as described next, in order to have a variety of styles and to investigate analogies between the read/spontaneous and adult-directed/infant-directed contrasts.

Frequency (Hz)

6000 5000 4000 3000

A. Corpora

2000 1000 0

0.1

(a)

0.2

0.3 0.4 Time (s)

0.5

0.6

8000

Acoustic frequency (Hz)

7000 6000 5000 4000 3000 2000 1000 0

(b)

0

20

40 60 80 100 Modulation frequency (Hz)

120

FIG. 1. A spectrogram of the word “socks” and its modulation spectrum. Modulation spectrum is obtained from a sequence of STFT vectors.

and fast Fourier transform 共FFT兲 order兴, the length of the STFT sequence, and the parameters of the second Fourier transform. Figure 1 shows an example of a spectrogram and its modulation spectrum, where the base analysis window length and overlap were 16 ms and 75%, respectively, and the second modulation window size was equal to the length of the spectrogram. The speech segment is an instance of an adult female pronouncing the word “socks” sampled at 16 kHz. As the figure illustrates, the second Fourier transform was performed on time sequences of subband energies, e.g., along the arrows overlaid on the spectrogram, to obtain the modulation spectrum. Note that there are other versions of modulation spectral analysis and filtering that use either Hilbert transform19 or coherent and distortion-free methods20 to modify the modulation spectrum and then produce a new signal with filtered modulations. For this paper, the authors focus only on an energetic interpretation of the modulation spectrum, where, much as with a standard power spectral density estimate, there is no intent to modify the modulation content of a signal. The authors also use only the magnitude after the second Fourier transform—the modulation spectrum magnitude—leaving the phase of the modulation spectrum, which is also known to potentially have importance,21 for future studies. 1502

J. Acoust. Soc. Am., Vol. 126, No. 3, September 2009

The Multi-Register speech corpus 共MULTI-REG兲, which was collected at SRI, includes spontaneous speech and a read version of its transcription pronounced in a dictation manner.9,10 The speech was recorded over telephone and high quality head mounted microphone channels. The telephone channel data were stored in 8 kHz u-law format, and the high quality channel was recorded with 16 kHz pulse code modulation 共PCM兲 data. In the following experiments, a subset of the corpus was used that has consistent transcription across the speech types. It consisted of nine female speakers with 557 speech segments, where the authors restrict the analysis to female speech to match the second corpus which has only female speakers. Due to this constraint, the findings in this paper are biased to female speakers. Compared to male voice, the most prominent characteristics of female voice is higher fundamental frequency. It makes it, for example, difficult to accurately estimate formant frequency due to wider spacing of pitch harmonics.22 The Motherese corpus has infant-directed and adultdirected utterances provided by the Institute for Learning and Brain Sciences at the University of Washington. The infantdirected and adult-directed utterances are extracted from conversations that a mother has with her infant and with an adult experimenter, respectively. The authors’ work used a subset of the data taken from the set used in the Kirchhoff– Schimmel study;3 further details about the data are included therein. Specifically, 12k utterances 共6.9k infant-directed utterances and 5.5k adult-directed utterances兲 from 32 female speakers were used. The data were designed to elicit keywords from the mothers, but the authors’ analyses used complete utterances rather than just these keywords. The speech in this corpus was recorded with 16 kHz sampling and 16 bit PCM format using a far-field microphone. In the following experiments, wave forms with 8 kHz sampling frequency were made by down-sampling the original 16 kHz version with high cutoff frequency of 3.8 kHz. B. F0 analysis

F0 contours were first estimated for each 10 ms frame using the getf0 command23 from the ESPS package. Then, to reduce estimation error, a mixture model is used to characterize doubling and halving so as to more accurately determine F0, and the contour was stylized using the GRAPH24 TRACK program. It is typical to normalize F0 to account for speaker variability, but it is important to use the same normalization factors for both styles recorded for a speaker. For the experiments here, the authors chose frame-wise mean and standard deviation as the normalization factors based on spontaneous or adult-directed utterances depending on the Shinozaki et al.: Speaking style and speech recognition

Downloaded 06 Jan 2012 to 128.95.30.241. Redistribution subject to ASA license or copyright; see http://asadl.org/journals/doc/ASALIB-home/info/terms.jsp

1.5

8000

1.4

7000

Acoustic frequency (Hz)

ID

Utterance std

1.3 1.2 1.1 1 0.9 0.8

AD Spon Dict

0

0.5

5000 4000 3000 2000 1000

0.7 −0.5

6000

1

1.5

2

0 0

2.5

20

corpora. The F0 features of both genres from a speaker were normalized by subtracting the mean and dividing by the standard deviation. After the normalization, the mean and standard deviation within a segment were used as the features of that segment, excluding frames in unvoiced regions. Figure 2 shows F0 mean and standard deviation for each of the speech types that are averaged over the speakers in the corpus. It is observed that infant-directed utterances have a much higher F0 mean and variance than all other conditions, as expected. The differences between the other three cases are small in comparison. C. Modulation spectrum analysis

As described in Sec. II, the modulation spectrum is obtained by applying the Fourier transform to a slice over time of the STFT, resulting in a two-dimensional matrix with acoustic frequency and modulation frequency axes. The dimensions of the matrix are the number of acoustic frequency bins 共half the size of the first Fourier transform兲 and the number of modulation frequency bins 共half the number of time frames in the second Fourier transform兲. In the studies presented here, the STFT base window width and overlap were 16 ms and 75%, respectively, and the FFT size was 128 or 256 depending on 8 and 16 kHz sampling rates, respectively. The modulation window size was equal to the length of the input STFT sequence, which varies with the signal being analyzed. In Figs. 3 and 4, for example, the sampling rate was 16 kHz and the length was 2.0 s, so the resulting modulation spectrum is a 128⫻ 256 array. Figure 3 shows the difference of the modulation spectra for infant-directed and adult-directed speech. The figure was made by subtracting the log of averaged magnitude modulation spectrum of adult-directed speech from that of infantdirected speech using 1500 segments for each of the speech type. Similarly, Fig. 4 shows the difference of averaged modulation spectrum of the read and spontaneous speech given by female speakers from the MULTI-REG speech corpus using 700 segments for each of the speech type. The analysis indicates that both infant-directed and read speech J. Acoust. Soc. Am., Vol. 126, No. 3, September 2009

60

80

100

120

FIG. 3. Difference of averaged modulation spectrum of infant-directed utterances from adult-directed utterances.

have more energy than their counterparts at higher modulation frequencies for high acoustic frequencies, which the authors hypothesize to be due to a tendency to have more clearly articulated consonants in these genres. The analysis indicates that infant-directed speech tends to have more energy at low modulation frequencies in the low formant regions, particularly in the region of F0. This phenomenon is explored further in Sec. III D. In addition, for the read vs spontaneous contrast, a difference is observed at high modulation frequencies in the low acoustic frequency region. The authors have as yet no explanation for this difference. D. Spectrogram analysis

To better understand the modulation spectrum differences for the adult- and infant-directed speech, the authors inspected several of the target words 共covering the vowels /iy/, /uw/, and /aa/兲. They found that the mother’s fundamental frequency frequently aligns with the first formant of the vowel in infant-directed speech, as illustrated in Figs. 5 and 6, which show spectrograms and a spectral slice for adultdirected and infant-directed speech segments. All spectrograms were made from speech segments with 8 kHz sampling, and the transformation window width was 32 ms. The vertical line in the spectrogram corresponds to the time po8000 7000

Acoustic frequency (Hz)

FIG. 2. 共Color online兲 Normalized F0 statistics estimated for the MULTIREG corpus 共“Dict” and “Spon” are dictation and spontaneous speech, respectively兲 and the motherese corpus 共“ID” and “AD” are infant-directed and adult-directed speech, respectively兲.

40

Modulation frequency (Hz)

Utterance mean

6000 5000 4000 3000 2000 1000 0 0

20

40

60

80

100

120

Modulation frequency (Hz) FIG. 4. Difference of averaged modulation spectrum of dictation utterances from spontaneous utterances. Shinozaki et al.: Speaking style and speech recognition

Downloaded 06 Jan 2012 to 128.95.30.241. Redistribution subject to ASA license or copyright; see http://asadl.org/journals/doc/ASALIB-home/info/terms.jsp

1503

4000

−20

3500

−30

Frequency (Hz)

3000

−40

dB

2500 2000

−60

1500

−70

1000

−80

500 0.05

0.1

0.15

0.2

Time (s)

(a)

(b)

−90 0

−20

3500

−30

2000

3000

4000

3000

4000

−40

dB

2500 2000

−50 −60

1500

−70

1000

−80

500

(c)

1000

Hz

4000

3000 Frequency (Hz)

−50

0.05

0.1

0.15 0.2 Time (s)

0.25

0.3

(d)

−90 0

1000

2000

Hz

FIG. 5. 共Color online兲 Spectrograms and their sections of “shoes” sound. In infant-directed /uw/ sound, fundamental frequency aligns with the first formant.

sition of the spectrum section. The peaks of the spectrum envelope are the formants caused by acoustic resonance of the vocal tract. It is known that most of the discriminative information between vowels is encoded in the lowest two formants F1 and F2. The finer peaks of the spectrogram correspond to the vibration of vocal folds whose fundamental frequency is referred to as F0. Usually, the formant frequencies are much higher than F0, as can be seen in the adult-directed speech in Figs. 5共b兲 and 6共b兲. However, the authors found that F0 takes approximately the same frequency as F1 in highly exaggerated infant-directed utterances, as can be observed in Figs. 5共d兲 and 6共d兲. This phenomenon may be associated with a mother trying to draw her infant’s attention to something. The meaning of the pitch-formant matching for infants is not known. Perhaps, it helps infants to learn how to discriminate vowels by giving simpler examples of the spectrum shapes. The authors conjecture that this phenomenon might be peculiar to female speakers because speaking F0 of adult male speakers is usually much lower than F1,25 though the answer is not known yet since the Motherese corpus does not include male speakers. E. MDS analysis

Multidimensional scaling 共MDS兲 is a technique to arrange data points in a space that has lower dimension than the original data space.26 The arrangement is determined so as to keep the distances between the data points as close as 1504

J. Acoust. Soc. Am., Vol. 126, No. 3, September 2009

possible to the original space. In the MDS analysis, only the distances between points have meanings. Therefore, transformations that do not change distance, such as rotation, give an equivalent disposition in terms of MDS analysis, and the axes are arbitrary. The MDS analysis was performed on a distance matrix of speakers defined by Euclidean distances of modulationspectrum-based feature vectors. The feature vectors were made by first computing modulation spectrum for each utterance from the 8 kHz sampled wave form, reducing it to 5 ⫻ 5 matrix, and re-ordering the matrix to form a 25dimensional vector. For the reduction operation, element 共i , j兲 in the original modulation spectrum matrix was assigned to block 共5共i − 1兲 / 128 + 1 , 5共j − 1兲 / 128 + 1兲, and an average in the block was used as the element of the new matrix. The utterance-level feature vectors were then averaged to make a speaker-level feature vector, and Euclidean distances were computed for all the pairs of speakers, both within and across corpora. Instances of the same speaker in different genres are included as well as cross-speaker pairs. In the modulation spectrum estimation, the authors introduced a channel normalization to compensate for corpuslevel recording effects, so that a comparison across corpora made sense. The proposed normalization algorithm works in the complex spectral domain as follows: Y i共 ␻ 兲 Y i共 ␻ 兲 , ⬇ Sˆi共␻兲 = C共␻兲 exp共具log共Y i共␻兲兲典n1兲

共1兲

Shinozaki et al.: Speaking style and speech recognition

Downloaded 06 Jan 2012 to 128.95.30.241. Redistribution subject to ASA license or copyright; see http://asadl.org/journals/doc/ASALIB-home/info/terms.jsp

4000

−20

3500

−30 −40

2500

dB

Frequency (Hz)

3000

2000

−60

1500

−70

1000

−80

500 0.05

(a)

0.1

0.15 0.2 Time (s)

0.25

0.3

0.35

(b)

1000

2000

3000

4000

3000

4000

Hz −20

3500

−30 −40

2500

dB

Frequency (Hz)

−90 0

4000

3000

2000

−50 −60

1500

−70

1000

−80

500

(c)

−50

0.05

0.1

0.15

0.2 0.25 Time (s)

0.3

0.35

0.4

(d)

−90 0

1000

2000

Hz

FIG. 6. 共Color online兲 Spectrograms and their sections of “sheep” sound. In infant-directed /iy/ sound, fundamental frequency aligns with the first formant.

where Sˆi共␻兲 is the estimated normalized signal associated with the ith time frame, Y i共␻兲 is the corresponding observed signal, and C共␻兲 is the unknown constant channel effect which is approximated using an averaging operation 具 · 典n1 on the observed n-length sequence of spectral vectors. 共The same result is derived by applying cepstral mean normalization27 with complex cepstra and inverting it to log spectrum domain. Since the cosine transformation from log spectrum to cepstrum is a linear transformation, it is canceled in the inversion.兲 The normalization is inserted between the first and the second Fourier transform of the modulation spectrum. Figure 7 shows the two-dimensional MDS representa-

FIG. 7. 共Color online兲 MDS representation of modulation spectrum features of speakers associated with dictated 共Dict兲, spontaneous 共Spon兲, infantdirected 共ID兲, and adult-directed 共AD兲 speech. J. Acoust. Soc. Am., Vol. 126, No. 3, September 2009

tion of the different speakers for each genre: Each point represents a speaker-genre combination, i.e., the transformation of the speaker’s average modulation spectrum vector for either infant-directed/adult-directed or dictation/spontaneous manner. As can be seen, the spontaneous and adult-directed samples are arranged close to each other showing similarity of the styles. The dictated and infant-directed samples are on opposite sides of these two, representing different extremes of the range of styles examined here. The authors had expected some similarities between read speech and infantdirected speech, assuming that both would have more instances of well-articulated phonemes, but presumed that the difference was because of hyperarticulation that tended to be frequent in infant-directed speech but less so in read speech. The authors conjecture that the MDS dimensions capture the variability of the speech, with dictated 共read兲 speech being the least variable and infant-directed speech being the most variable. The result is consistent with the F0 analysis in that the infant-directed speech is at one extreme, but the different genres seem better separated by the MDS analysis. MDS is also used in an analysis of speaking style where distances between speakers are in terms of hidden Markov model state distribution parameters.28 In that study, all of the speech is read, so it is difficult to make a comparison to the findings here. Further, the approach in that study requires transcribed speech, whereas that of the authors is strictly based on acoustic observations. An important difference between the two sets of findings is that speaking rate is an Shinozaki et al.: Speaking style and speech recognition

Downloaded 06 Jan 2012 to 128.95.30.241. Redistribution subject to ASA license or copyright; see http://asadl.org/journals/doc/ASALIB-home/info/terms.jsp

1505

TABLE I. Utterance features and associated classification error rates. Feature

Error

F0 EN MS F0 + EN F0 + MS EN+ MS F0 + EN+ MS

22.7 40.3 23.7 21.1 18.7 23.0 17.9

important dimension in the Shozakai and Nagino study,28 but in the authors’ study the two slow rate cases 共read speech and motherese兲 are at opposite ends of their distribution.

was normalized by both mean and standard deviation, the error rate was 22.9%, which was similar but slightly higher than the error rate of 22.7% when only the mean was normalized. This was probably because of the difficulty in estimating the normalization factor since infant-directed side has much larger variance than the adult-directed side.兲 The MS features gave a result slightly higher than that for F0. 共The authors also looked at reduced dimension versions of the MS feature, but this led to increased error.兲 Using energy alone gave the highest error rate, which is partly because of the difficulty in removing channel effects from a far-field microphone recording. The lowest error rate 共17.9%兲 was obtained by combining F0, energy, and modulation spectrum features. V. SAMPLING FOR ASR TRAINING

IV. AUTOMATIC CLASSIFICATION OF SPEAKING STYLES

The above analysis suggests that the modulation spectrum should provide useful features for recognizing speaking style, but the findings are based on averages of multiple utterances from a speaker, and classification might be better aimed at the utterance level. In this section, the authors compare the performance of different features used in automatic classification of speaking style, with the primary aim of assessing the relative utility of different features and secondarily to assess the performance that can be achieved with acoustic cues alone. The following three kinds of features and their combinations were used in the analysis. 共1兲 F0: average F0. Before the features are calculated for utterances, the F0 is normalized for each speaker by subtracting mean that is estimated using both genres. Only voiced segments are used. 共2兲 EN: average log energy. Before the features are calculated for utterances, the energy is normalized for each speaker by subtracting mean and dividing by standard deviation that are estimated using both genres. 共3兲 MS: modulation spectrum feature vector. The modulation spectrum matrix is first averaged and decimated to obtain a fixed, low-dimension 共5 ⫻ 5兲 matrix represented as a 25-dimensional vector. These features were calculated from utterance wave forms with 8 kHz sampling. A linear discriminant function was used in a binary classification task: identifying adult- vs infant-directed speech in the Motherese corpus. The simple linear function was chosen because the training set size was not large and because the speaker-level data suggested that it would be effective. 共In genre detection where more labeled data are available, it may be interesting to investigate more complex methods.兲 The parameters of the linear discriminant functions were estimated, and style classification performance was evaluated using tenfold cross-validation. Table I shows the classification error rates. For comparison, guessing based on priors has an error rate of 49%. Among the three single features, F0 gave the lowest error rate of 22.7%, consistent with prior work showing high average F0 associated with infant-directed speech. 共When F0 1506

J. Acoust. Soc. Am., Vol. 126, No. 3, September 2009

While it is widely accepted that more data lead to improved speech recognition performance, it is also important to have data that are well matched to the target task. For new tasks 共particularly new languages兲 where large amounts of transcribed data are not readily available, the cost of transcription can have a significant impact on overall development costs. In addition, for some tasks, experiments have shown that recognition performance can be improved by omitting certain samples from the training set. For that reason, researchers have investigated methods for selecting data for training. Early work in speech recognition13 involved selecting based on a recognition error or speech recognizer confidence, specifically choosing to add those utterances with high error 共or low confidence兲. With an error criterion, it was shown that improved performance could be achieved with less than the full amount of data, but this approach requires transcription for measuring errors and so cannot be used for initial selection. While recognizer confidence was less effective in that study, it has since been used to good effect in identifying utterances to remove from the training set 共either because of poor transcriptions or noisy conditions, e.g., Ref. 29兲 and in active learning.30 An alternative to measuring confidence of one system is to look at disagreement among multiple systems, as in a study applying hidden Markov models 共HMMs兲 to part-of-speech tagging.31 Another early study32 looked at representing speech from groups of speakers with supervectors of average cepstral parameters from a subset of phones based on a force-alignment to the transcript, which are reduced in dimension with principal component analysis. These vectors are clustered, and data are selected to best represent the different clusters. A more recent related study seeks to sample the space characterized by distances between HMMs trained for different speakers, finding that the best results are obtained by sampling at the periphery of a small dimensional space learned via MDS,33 building on the results of Shozakai and Nagino28 and reducing the requirements for speech transcription by using adaptation for training the speaker-dependent models. These different results are not entirely consistent in their recommendations. The clustering results suggest that one should sample to cover the space. Other results suggest that one should sample to emphasize outliers after some initial training phase, similar to the philosophy of boosting. LookShinozaki et al.: Speaking style and speech recognition

Downloaded 06 Jan 2012 to 128.95.30.241. Redistribution subject to ASA license or copyright; see http://asadl.org/journals/doc/ASALIB-home/info/terms.jsp

ing at human language, learning would support the idea of using multiple strategies in the sense that mothers talk differently to their children when they are very young, but it also suggests the use of sampling in the first stage. In this work, the authors investigate methods of sampling in a single training phase and a two-stage approach. While they include average per-frame forced alignment likelihood 共roughly equivalent to confidence兲 as a baseline criterion for comparison, the goal is to identify acoustic criteria that indicate utterances to select for transcribing for initial training. A secondary goal of investigating multipass strategies is to develop an efficient training strategy that gives a good ASR model with a lower computational cost for training. A. Utterance features for sampling

The following one-dimensional features are used as the sampling criteria and compared to a random sampling baseline. 共1兲 LL: average per-frame log acoustic likelihood obtained by forced alignment to the reference transcript. 共2兲 F0: average normalized fundamental frequency. Only voiced segments are used. 共3兲 EN: average normalized log energy. 共4兲 MS: linear discriminant projection of modulation spectrum features designed to separate infant-directed vs adult-directed speech. LL relies on an ASR model, but the others are acoustically based features. LL was estimated using a model developed for SRI’s Decipher, a large vocabulary recognizer, trained on the same data it is used to sample from. LL is included because it is an indicator of “representative” 共or “outlier”兲 utterances in that higher 共or lower兲 log likelihood occurs for utterances that are closer to 共or farther from兲 the mean. However, there are some limitations of the LL measure. First, it cannot be used for the task of selecting data to transcribe since it requires transcriptions for forced alignments. Second, there is a bias introduced by using a previously trained ASR model to determine what is typical. Hence, LL mainly serves as a comparison point. For the F0 and EN features, the raw features are extracted as described in the genre-classification study and are normalized for each speaker. For the MS feature, the authors first train a 25-dimensional linear classifier to distinguish between infant-directed and adult-directed speech, as in the genre-classification experiments, and then use the resulting score as the feature, which corresponds to a linear discriminant analysis projection. B. Speech recognition systems 1.

HTK-based

system

A small CTS task defined in34 was used to evaluate the authors’ methods with a recognition system based on the HTK toolkit.35 The acoustic model training set of the CTS task consisted of approximately 32 h of speech 共16 h for each gender兲 coming from a mixture of Fisher36 and Switchboard37 training utterances. This baseline training set was selected by uniformly sampling the Switchboard and J. Acoust. Soc. Am., Vol. 126, No. 3, September 2009

Fisher training sets, with the constraint that the two sources would comprise roughly 40% and 60% of each 16 h subset, respectively. In the following experiments, the male part of this baseline training set was used and compared to various sampled subsets from the same corpora. The acoustic features were 12 perceptual linear predictive 共PLP兲 coefficients38 and energy, with their first two derivatives computed with vocal tract length normalization39 and mean and variance normalization. The acoustic model was a set of three state left-to-right tried-state triphone HMMs with 32 mixtures per HMM state and had 2000 states across all triphones. The model was trained using the typical HTK “recipe” of initializing triphones from monophones, clustering single Gaussian triphones, and gradually increasing the number of mixtures after clustering. Models are updated with five iterations of expectation-maximization 共EM兲 training at each step. The small CTS task test sets were selected from the RT03 evaluation test set40 based on constraining the out-ofvocabulary rate associated with a 1k-word vocabulary 共the highest frequency words in the full corpus兲. 共The original RT03 evaluation set contains about the same number of utterances from the Fisher and Switchboard corpus.兲 The male portions of the small CTS task test sets consist of 35 min of data for tuning and 32 min for testing. The dictionary for decoding contains multi-words and multiple pronunciations, so the overall size is 5.1k. A bigram language model was used, made by projecting the 2004 CTS evaluation language model onto the 1k vocabulary. 2. Decipher-based system

In order to investigate how the different sampling methods work in a large vocabulary system, experiments were also conducted using the SRI Decipher41 system. The baseline training sets were randomly sampled from the Fisher and Switchboard corpus as for the HTK-based system. The test set is the RT04 development test set. The dictionary is based on 38k-word vocabulary and having 83k entries including multi-words and multiple pronunciations. Decoding involves rescoring a lattice of initial pass hypotheses with a speakeradapted model 共using maximum likelihood linear regression兲 and a 4-gram language model. Note that this system is different from the standard SRI recognition system in that it has only PLP cross-word triphone models and only ML training is used. C. Sampling for data selection

For each feature considered, set are classified to three equally “middle,” and “highest” in order utterances 共middle category兲 are outliers.

utterances in the training sized classes of “lowest,” to assess whether typical more or less useful than

1. Small vocabulary results using

HTK

For each of the utterance-level features, utterances from the Fisher and Switchboard corpora were sorted and partitioned into three subsets based on increasing feature ranges: lowest, middle, and highest. These ranges were chosen so Shinozaki et al.: Speaking style and speech recognition

Downloaded 06 Jan 2012 to 128.95.30.241. Redistribution subject to ASA license or copyright; see http://asadl.org/journals/doc/ASALIB-home/info/terms.jsp

1507

TABLE II. Sampling criteria and resulting WER. The baseline WER was 41.4. Scoring measure LL F0 EN MS F0 + EN+ MS

Lowest

Middle

Highest

41.6 41.3 42.7 41.1 41.1

41.1 41.0 40.8 40.4 41.0

41.7 41.1 42.7 41.9 41.4

that the subsets were the same size in terms of duration. For each subset, 16 h of segments were randomly selected as training data. The ratio of durations of the two corpora within each of the sampled sets was kept the same as the original training set. Table II shows the WER of the models trained using the subsets. Random sampling is used to provide a baseline. Generally speaking, the sampled subsets from the middle classes worked better than those from the lowest or highest classes, regardless of the scoring methods. All the samplings from the middle classes gave lower WER than the baseline, and among these MS gave the best result of 2.4% relative WER reduction. For the MS feature, a higher value meant that the utterance sounds more like infant-directed. For HMM training, however, it can be seen that utterances with mid-range features are more useful than higher scoring utterances that the authors presumed to be hyperarticulated. Either hyperarticulation is not as useful in machine learning as it is for human language learners, or the high MS score captures something other than hyperarticulation. The combination of F0, EN, and MS, which led to better classification of infant- vs adult-directed speech, was also tested but did not lead to lower WERs. 2. Large vocabulary results using Decipher

Both male and female triphone models were trained respectively using 16 h of the sampled training set. In the decoding, the system automatically decided which model to use by comparing likelihood of Gaussian mixture models 共GMMs兲 associated with male and female speakers. The male/female GMMs were trained from HUB5 and had 256 mixtures each. Table III shows the WER of the subsets from the middle classes since this gave the best results in the HTK small task experiments. As can be seen, the feature-based sampling also results in lower WER than the baseline in this experiment using the large vocabulary Decipher systems. However, the LL criterion gives better results perhaps because of the match to the Decipher models. TABLE III. WER for large vocabulary recognition using Decipher. Sampling measure

WER

Baseline LL F0 EN MS

26.1 25.7 25.9 25.7 25.9

1508

J. Acoust. Soc. Am., Vol. 126, No. 3, September 2009

1st stage

Train Set

2nd stage

Train Set 1

Monophone

Monophone

Triphone

Triphone

Triphone with GM

Triphone with GM

One-stage Training

Train Set 2

Triphone with GM

Two-stage Training

FIG. 8. One-stage and two-stage training procedures.

D. Sampling for two-stage training

Usually, acoustic models are trained using a locally optimal iterative method such as the EM algorithm.42 By using an initial model trained on a subset of the data with better separated classes 共either prototypical instances or data from the middle region only兲, the final model may avoid problems of local optima and require less training time 共lower computational cost兲. Thus, to improve efficiency and hopefully performance, two-stage training methods are investigated. Figure 8 shows the procedure of conventional one-stage training and proposed two-stage training. In these experiments, the decision tree designed in stage 1 is fixed in stage 2, though the distributions are re-estimated. The experiments are designed to answer two questions: 共1兲 Does two-stage training with increased amounts of data in the second stage yield improved performance over one-stage training? 共2兲 Is it useful to constrain model means to the prototypes learned initially and update only the variances? The second question was motivated by work on child language acquisition showing early learning of prototypical vowel sounds. 1. Results using

HTK-based

systems

For two-stage training, 48 h of training data were randomly selected as a full set, using the same second stage random sampling for all different initial models. Acoustic models trained on the 16 h subset of the middle classes were used as the initial model, and their parameters were updated using two EM iterations in the second stage. Both the onestage and two-stage models had 2k states and 32 mixtures per HMM state. Two types of EM training were conducted: updating only variances vs updating all the parameters. Note that the subset used to train the initial model was not added in the second stage to keep the experimental condition the same among the sampling methods excepting the parameter initialization. Table IV shows the WER. In the table, results of mean only update are also shown for a reference purpose. The baseline “one-stage” model was trained from scratch on the 48 h of the full set. While some models gave slightly better Shinozaki et al.: Speaking style and speech recognition

Downloaded 06 Jan 2012 to 128.95.30.241. Redistribution subject to ASA license or copyright; see http://asadl.org/journals/doc/ASALIB-home/info/terms.jsp

TABLE IV. WER of one vs two-stage training using update constraints.

HTK,

with and without

Sampling

Variance

Mean

All

One-stage Random LL F0 EN MS

41.6 41.4 40.6 40.3 41.5

41.0 40.9 40.8 40.8 40.1 40.4

40.9 40.6 40.7 40.5 40.1

results when only variance or mean was updated, it was more often better to update all parameters. When all parameters were updated, all two-stage strategies improved performance over the one-stage baseline, which supports the hypothesis that initial training with better separated classes is helpful. As before in the HTK-based experiments, the MS-based initialization gave the best result. The computation time of these two-stage EM training was only 40% compared to onestage training.

2. Results using Decipher-based systems

The two-stage training strategy was also evaluated using the Decipher system using 32 h of training data in the first stage and 64 h in the second stage. The results are shown in Table V. When the randomly sampled set was used in the first stage, the WER was increased compared to the onestage baseline. On the other hand, there were small reductions in WER when the initial set was selected based on the utterance features, suggesting that better initialization can improve overall system performance. More significantly, the cost of two-stage training is 65% of the one-stage approach.

nomenon in highly exaggerated infant-directed utterances. The modulation spectrum was also shown to be useful in automatic classification of speaking style. In the second half of the paper, the authors have investigated sampling methods for ASR based on features used in the analyses. For the small CTS task, it was shown that sampling from middle range classes gave lower WER than from lowest or highest classes for all of the utterance features. Among the sampling criteria, the lowest WER 共2.4% reduction兲 was obtained by mid-range sampling using a modulation-spectrum-based feature. Sampling was useful in both a small vocabulary simple HTK-based system and a more complex large vocabulary Decipher system. While acoustic measures did well with the HTK system, the better matched likelihood criterion was most useful for the case where a model is trained in advance. In addition, sampling was useful for improving the training schedule, not only reducing the computational cost, but also leading to gains in WER in some cases. In the two-stage training experiment using HTK, it was shown that better performance was obtained by initializing with a model trained on data selected to include utterances near the mean of the modulation spectrum. In other words, using lower variance data in initializing the model is helpful. In addition, updating all parameters was better than updating only variances. The computational cost was about 40%–65% compared to the conventional one-stage training. The cost savings may increase in scaling to larger data sets, but there are issues to explore related to incrementing model complexity. An open question is how sampling interacts with discriminative training. ACKNOWLEDGMENT

This work was supported by DARPA Grant No. MDA972-02-1-0024. VI. CONCLUSIONS 1

Motivated by insights from human language acquisition and the high cost of transcribing speech when moving to a new domain, the authors analyzed the characteristics of speaking styles and investigated sampling methods for ASR. The analyses were performed in the first half of the paper using the MULTI-REG corpus and the motherese corpus. While both the infant-directed and dictation utterances tend to have more clearly articulated utterances and slower speaking rates, the authors found that these appear on opposite ends of a scale learned through MDS of the modulation spectrum. They also discovered the pitch-formant matching pheTABLE V. WER of one vs two-stage training using Decipher. Sampling

WER

One-stage Random LL F0 EN MS

23.8 24.2 23.6 23.7 23.7 23.7

J. Acoust. Soc. Am., Vol. 126, No. 3, September 2009

P. K. Kuhl, J. E. Andruski, I. A. Chistovich, L. A. Chistovich, E. V. Kozhevnikova, V. L. Ryskina, E. I. Stolyarova, U. Sundberg, and F. Lacerda, “Cross-language analysis of phonetic units in language addressed to infants,” Science 277, 684–686 共1997兲. 2 D. Burnham, C. Kitamura, and U. Vollmer-Conna, “What’s new, pussycat? On talking to babies and animals,” Science 296, 1435 共2002兲. 3 K. Kirchhoff and S. Schimmel, “Statistical properties of infant-directed vs. adult-directed speech: Insights from speech recognition,” J. Acoust. Soc. Am. 117, 2238–2246 共2005兲. 4 H. M. Liu, P. K. Kuhl, and F. M. Tsao, “An association between mothers’ speech clarity and infants’ speech discrimination skills,” Dev. Sci. 6, F1– F10 共2003兲. 5 V. Hazan and A. Simpson, “The effect of cue-enhancement on consonant intelligibility in noise: Speaker and listener effects,” Lang. Speech 43, 273–294 共2000兲. 6 R. P. Lippmann, “Speech recognition by machines and humans,” Speech Commun. 22, 1–15 共1997兲. 7 D. A. van Leeuwen, L. G. V. den Berg, and H. J. M. Steeneken, “Human benchmarks for speaker independent large vocabulary recognition performance,” in Proceedings of Eurospeech 共1995兲, Vol. 2, pp. 1461–1464. 8 O. Scharenborg, “Reaching over the gap: A review of efforts to link human and automatic speech recognition research,” Speech Commun. 49, 336–347 共2007兲. 9 M. Weintraub, K. Taussig, K. Hunicke-Smith, and A. Snodgrass, “Effect of speaking style on LVCSR performance,” in Proceedings of ICSLP, Philadelphia, PA 共1996兲, pp. 16–19. 10 M. Saraclar, H. Nock, and S. Khudanpur, “Pronunciation modeling by sharing Gaussian densities across phonetic models,” Comput. Speech Shinozaki et al.: Speaking style and speech recognition

Downloaded 06 Jan 2012 to 128.95.30.241. Redistribution subject to ASA license or copyright; see http://asadl.org/journals/doc/ASALIB-home/info/terms.jsp

1509

Lang. 14, 137–160 共2000兲. T. Kawahara, H. Nanjo, T. Shinozaki, and S. Furui, “Benchmark test for speech recognition using the Corpus of Spontaneous Japanese,” in Proceedings of SSPR2003 共2003兲, pp. 135–138. 12 K. Itou, M. Yamamoto, K. Takeda, T. Takezawa, T. Matsuoka, T. Kobayashi, K. Shikano, and S. Itahashi, “JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research,” J. Acoust. Soc. Jpn. 共E兲 20, 199–206 共1999兲. 13 T. M. Kamm and G. G. L. Meyer, “Selective sampling of training data for speech recognition,” in Proceedings of Human Language and Technology, San Francisco, CA 共2002兲, pp. 20–24. 14 G. Zweig and M. Padmanabhan, “Boosting Gaussian mixtures in a LVCSR system,” in Proceedings of ICASSP, Istanbul, Turkey 共2000兲, pp. 1527–1530. 15 H. Dudley, “Remaking speech,” J. Acoust. Soc. Am. 11, 169–177 共1939兲. 16 N. Kowalski, D. Depireux, and S. Shamma, “Analysis of dynamic spectra in ferret primary auditory cortex: I. Characteristics of single unit responses to moving ripple spectra,” J. Neurophysiol. 76, 3503–3523 共1996兲. 17 G. Langner, M. Sams, P. Heil, and H. Schulze, “Frequency and periodicity are represented in orthogonal maps in the human auditory cortex: Evidence from magnetoencephalography,” J. Comp. Physiol. 181, 665–676 共1997兲. 18 L. Atlas and S. Shamma, “Joint acoustic and modulation frequency,” EURASIP J. Appl. Signal Process. 2003, 668–675 共2003兲. 19 R. Drullman, J. M. Festen, and R. Plomp, “Effect of reducing slow temporal modulations on speech reception,” J. Acoust. Soc. Am. 95, 2670– 2680 共1994兲. 20 S. Schimmel and L. Atlas, “Target talker enhancement in hearing devices,” in Proceedings of ICASSP, Las Vegas, NV 共2008兲, pp. 4201–4204. 21 S. Greenberg and T. Arai, “The relation between speech intelligibility and the complex modulation spectrum,” in Proceedings of Eurospeech, Aalborg, Denmark 共2001兲, pp. 473–476. 22 J. Darch, B. Milner, I. Almajai, and S. Vaseghi, “An investigation into the correlation and prediction of acoustic speech features from MFCC vectors,” in Proceedings of ICASSP 共2007兲, Vol. IV, pp. 465–468. 23 D. Talkin, “A robust algorithm for pitch tracking 共RAPT兲,” in Speech Coding and Synthesis, edited by W. Kleijn and K. Paliwal 共Elsevier Science, Amsterdam, 1995兲, pp. 495–518. 24 K. Sonmez, E. Shriberg, L. Heck, and M. Weintraub, “Modeling dynamic prosodic variation for speaker verification,” in Proceedings of ICSLP, Sydney, Australia 共1998兲, Vol. 7, pp. 3189–3192. 25 E. C. Willis and D. T. Kenny, “Effect of voice change on singing pitch accuracy in young male singers,” J. Interdisciplinary Music Studies 2, 111–119 共2008兲. 26 Modern Multidimensional Scaling: Theory and Applications, edited by I. Borg and P. Groenen 共Springer-Verlag, New York, 1997兲. 27 B. Atal, “Effectiveness of linear prediction characteristics of of the speech 11

1510

J. Acoust. Soc. Am., Vol. 126, No. 3, September 2009

wave for automatic speaker identification and verification,” J. Acoust. Soc. Am. 55, 1304–1312 共1974兲. 28 M. Shozakai and G. Nagino, “Analysis of speaking styles by twodimensional visualization of aggregate of acoustic models,” in Proceedings of ICSLP, Jeju, Korea 共2004兲, Vol. I, pp. 717–720. 29 H. Y. Chan and P. C. Woodland, “Improving broadcast news transcription by lightly supervised discriminative training,” in Proceedings of ICASSP, Quebec, Canada 共2004兲, Vol. I, pp. 737–740. 30 G. Riccardi and D. Hakkani-Tur, “Active learning: Theory and applications to automatic speech recognition,” IEEE Trans. Speech Audio Process. 13, 504–511 共2005兲. 31 I. Dagan and S. Engelson, “Committee-based sampling for training probabilistic classifiers,” in Proceedings of ICML, Tahoe City, CA 共1995兲, pp. 150–157. 32 A. Nagorski, L. Boves, and H. Steeneken, “Optimal selection of speech data for automatic speech recognition systems,” in Proceedings of ICSLP, Denver, CO 共2002兲, pp. 2437–2440. 33 G. Nagino and M. Shozakai, “Building an effective corpus by using acoustic space visualization 共COSMOS兲 method,” in Proceedings of ICASSP, Philadelphia, PA 共2005兲, Vol. I, pp. 449–452. 34 B. Chen, O. Cetin, G. Doddington, N. Morgan, M. Ostendorf, T. Shinozaki, and Q. Zhu, “A CTS task for meaningful fast-turnaround experiments,” in NIST RT-04 Workshop, Palisades, NY 共2004兲. 35 S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. A. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, The HTK Book 共Cambridge University Engineering Department, Cambridge, 2006兲. 36 C. Cieri, D. Miller, and K. Walker, “The Fisher corpus: A resource for the next generations of speech-to-text,” in Proceedings of LREC, Lisbon, Portugal 共2004兲, pp. 69–71. 37 J. J. Godfrey, E. C. Holliman, and J. McDaniel, “Switchboard: Telephone speech corpus for research and development,” in Proc. ICASSP, San Francisco, CA 共1992兲, Vol. I, pp. 517–520. 38 H. Hermansky, “Perceptual linear predictive 共PLP兲 analysis of speech,” J. Acoust. Soc. Am. 87, 1738–1752 共1990兲. 39 E. Eide and H. Gish, “A parametric approach to vocal tract length normalization,” in Proceedings of ICASSP, Atlanta, GA 共1996兲, Vol. I, pp. 346– 348. 40 http://www.nist.gov/speech/tests/rt/ 共Last viewed April 21, 2008兲. 41 A. Stolcke, H. Bratt, J. Butzberger, H. Franco, V. R. R. Gadde, M. Plauche, C. Richey, E. Shriberg, K. Sonmez, F. Weng, and J. Zheng, “The SRI March 2000 Hub-5 conversational speech transcription system,” in Proceedings of NIST Speech Transcription Workshop, College Park, MD 共2002兲. 42 A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” J. R. Stat. Soc. Ser. B 共Methodol.兲 39, 1–38 共1977兲.

Shinozaki et al.: Speaking style and speech recognition

Downloaded 06 Jan 2012 to 128.95.30.241. Redistribution subject to ASA license or copyright; see http://asadl.org/journals/doc/ASALIB-home/info/terms.jsp

Characteristics and Implications of Rising Household ...

Phenomenal characteristics of autobiographical memories for social ...

Synthesis, spectral characteristics and electrochemistry of ... - Arkivoc

Trust and technologies Implications for organizational.pdf ...

Article_Semantic Transfer and Its Implications for Vocabulary ...

Casualties of war and sunk costs: Implications for ...

Implications of Health Care Reform for Inequality and ...

The Implications of Embodiment for Behavior and Cognition: Animal ...

Dynamic Characteristics of Prochlorococcus and ...

Patience and Altruism of Parents: Implications for Children's Education ...

Implications of Capitol Lake Management for Fish and Wildlife

Dynamic Characteristics of Prochlorococcus and ...

Patience and Altruism of Parents: Implications for Children's Education ...

Implications of life history for genetic structure and ...

The Implications of Embodiment for Behavior and Cognition: Animal ...

Economic and Environmental Implications of ... - Semantic Scholar

Judging the After Dinner Speaking Competitor: Style ...

The Role of Country-of-Origin Characteristics for ...