Paper Title (use style: paper title)

Viewer
Transcript

A Method for Modeling and Generating Mandarin Tone Contour with Phrase Intonation Based on the Generation Process Model Miaomiao WANG1, Miaomiao WEN1 1

Department of Electrical Engineering and Information Systems The University of Tokyo Tokyo, Japan {wangm, wenm}@gavo.t.u-tokyo.ac.jp

Abstract— This paper models F0 curves as the representation of both syllable-level tone and phrase-level intonation based on generation process model for Chinese Mandarin speech. The tone components are realized by concatenating their fragments predicted by a HMM-based method, while the phrase components are generated by rules under the generation process model framework. In the conventional HMM-based TTS, intonation, especially at the sentence or phrase level, tends to be bland. The Multi-Space Distribution (MSD) used in HMM training and F0 parameter trajectory generation is partially responsible for the blandness. Additionally, the F0 trajectory thus generated has a smaller dynamic range than that of natural speech, and makes the synthesized speech sound less lively. In this paper, in order to model the F0 within the standard HMM framework, an F0 generation process model is used to re-estimate the tone components and phrase components. A prior knowledge of VU is imposed in each Mandarin phoneme and they are used for VU decision. Also we design a set of syntax features to improve Mandarin phoneme duration prediction. Keywords-Mandarin speech synthesis; F0 modeling; HMMbased TTS; generation process model; duration prediction

I.

INTRODUCTION

Recently, in the speech synthesis community, attention has been focused on works of HMM-based speech synthesis, which has been demonstrated to be very effective in synthesizing acceptable speech. Here short term spectra, fundamental frequency (F0) and duration are simultaneously modeled by the corresponding HMMs. Compared with the unit selection based speech synthesis which based on large corpus, HMM-based synthesis is statistically oriented and model based. The speech generated by the HMMs is fairly smooth and exhibits no concatenation glitches occur in unit-selection synthesis. To change the segmental or supra-segmental quality of generated speech, we can modify HMM parameters flexibly [1, 2, 3]. However, there still remain problems if we view from the aspect of prosodic features. Although various styles such as attitudes and emotions were realized with rather high quality by the method, frame-by-frame processing of prosodic features, however, includes some problems. Prosodic features cover a wider time span than segmental features, and should be treated

Keikichi HIROSE2, Nobuaki MINEMATSU2 2

Department of Information and Communication Engineering The University of Tokyo Tokyo, Japan {hirose, mine}@gavo.t.u-tokyo.ac.jp

differently. Although the control of prosodic features is an important issue in speech synthesis for any languages, it comes quite critical for speech quality in the case of Mandarin. As it is well known, Mandarin is a typical tonal language and each syllable with the same phoneme constitution has up to four tone types, each indicating different meaning. F0 contours of utterances should include these local tonal features in addition to the sentential intonation corresponding to syntactic/utterance structures. This situation makes F0 movements of Mandarin sentences be more complicated than non-tonal languages like English, Japanese and so on. Therefore, control of F0 contours together with other prosodic features becomes an important and tough issue in Mandarin speech synthesis. In HMM-based synthesis, the modeling of F0 is difficult due to the discontinuity of F0 across voiced and unvoiced region. The multi-space distribution HMM (MSD-HMM) provides a solution to this problem by using a combination of discrete and continuous distributions [4]. However, although good performance can be achieved using MSDHMMs, this type of mixed distribution F0 modeling has some issues arising from the discontinuities at the boundaries of unvoiced regions and the need to keep the discrete and continuous density regions distinct. Therefore, the use of MSDHMMs makes it more difficult to exploit standard techniques for HMM modeling, such as adaptation, which cannot be readily applied to the mixed discrete or continuous F0 distributions. Further more, accurate prediction of phone durations is essential for high quality TTS. The use of unsuitable phoneme durations can deteriorate synthesis quality by decreasing the perceived speech naturalness. In some F0 generation systems [5], segmental durations of phonemes and pauses are first predicted and then used for the prediction of F0-related parameters. However, the current HMMs cannot predict duration information very accurately and the resultant suprasegmental quality of synthesized speech suffers. The state duration of a standard HMM is explicitly modeled with a single Gaussian distribution which is estimated by using state occupancy counts in the Baum-Welch re-estimation procedure. Then duration prediction for unseen contexts does not include high-level linguistic knowledge. It is necessary to predict

segmental durations (including pauses) according to syntax information from the text. II.

GENERATION PROCESS MODEL FOR TONAL LANGUANGE

In [6], analysis of the laryngeal structure suggests that the movement of the thyroid cartilage relative to the cricoid cartilage has two degrees of freedom [7, 8]. One is horizontal translation due presumably to the activity of pars obliqua of the cricothyroid muscle; the other is rotation around the cricothyroid joint due to the activity of pars recta of the cricothyroid, as shown in Figure 1. The translational mechanism is represented by an impulse function and named „phrase command‟. The other rotational mechanism is represented by both positive and negative pedestal function and named „tone command‟ for tonal language. Thus the F0 contours can be described in the logarithmic scale as the superposition of phrase components, tone components and a baseline level Fb. The exact relationships between these components of an F0 contour and the underlying linguistic information have been formulated by Fujisaki and his coworkers [9].

Figure 1. The roles of pars obliqua and par recta of the cricothyroid muscle in translating and ratating the thyroid cartilage

The model diagram for Mandarin is shown in Figure 2, where the phrase commands (impulses) produce phrase components through the phrase control mechanism, giving the global shape of the F0 contour at sentence level, while the tone commands generate tone components through the tone control mechanism, characterizing the local F0 changes. Both mechanisms are assumed to be critically-damped second-order linear systems.

Figure 2. The F0 contour generation model for Mandarin

III.

CONVENTIONAL F0 AND DURATION MODELING IN HMM-BASED SPEECH SYNTHESIS

A. MSD-HMM for F0 Modeling and Generation A common assumption is that F0 has a continuous value in voiced regions and no value in unvoiced regions. In order to simultaneously model the discrete VU (voiced/unvoiced)

decision and the continuous F0 trajectory variables, multi-space distribution HMMs (MSDHMM) are commonly used [4], where discrete subspace for the unvoiced regions and continuous subspace for the voiced regions. In MSD-HMM training, F0 and its first and second order time derivatives are modeled in three streams separated from the spectral feature stream. State tying via a clustered decision tree is used to tie the rich context models into generalized ones for predicting unseen contexts in synthesis. In parameter trajectory generation, contextual MSD-HMM parameters are retrieved by traversing the trained decision trees. Voiced or unvoiced decision of a state is determined by the corresponding voiced subspace weight. A maximum likelihood F0 trajectory is generated with dynamic feature constraints. However many researchers pointed out that the assumption of undefined unvoiced F0 regions and the special structure of the MSDHMM have led to limited performance in modeling of F0 patterns accurately [10--12]. Either continuous probability density for voiced observations or discrete probability for unvoiced observations prevents the model from exploiting soft decision frame occupancy to reduce the effect of F0 extraction errors in the forward-backward training procedure. Moreover, separated F0 and dynamic feature streams for F0 modeling introduces inconsistent and redundant mixture weights for VU decisions. B. Duration Modeling and Generation in HMM-based TTS The duration of a phoneme is typically modeled through HMM state durations: each context-dependent phoneme is modeled as a sequence of states and the duration of the states is modeled. A state transition probability denoting a probability of moving from one state to another is determined. Typically, left-to-right models with no state skips are used, hence the transition probability for the transitions to other states except for the following state and the state itself are set to zero. To model the state durations for synthesis, duration probability distribution for each state is determined. In HMM-TTS duration modeling the distributions are formed based on the statistics from HMM parameter re-estimation. Each state duration probability distribution is regarded as a single Gaussian with a certain mean and variance. The mean and variance are extracted based on the average of all possible durations, each of them weighted with the corresponding state occupancy probability (i.e. probability of occupying the given state during the given time interval). In speech production, durations of a short unit like state is actually regulated by the durations of longer units, e.g., phone, syllable and word, etc. The duration assignment of different units is actually done in a highly regulated, hierarchical manner. And the syntax features are useful supplement for prosodic features for Mandarin. IV.

OUR APPROACH FOR PROSODIC FEATURES GENERATION IN HMM-BASED TTS

A. F0 Modeling using Generation Process Model The previous sections highlighted the Generation Process Model which can generate continuous F0 contours at both phrase and tone level; the problems encountered in MSDHMM were successfully solved. And the F0 contours are smoothed so that there will be no flawed VU decision errors in

the training data. In the model that we proposed in Figure 3, we used Generation Process Model to generate continuous F0 contours at phrase and tone level, and assumed to exist in unvoiced regions. Traditional Chinese philology defines Mandarin phonetics in terms of initials, and finals. Initials may be consonants or vowels, and finals are vowels or nasals. In some respects, the phonemic structure of Mandarin is quite simple. It‟s either a consonant-vowel (CV) structure or single vowel (V) structure. Here we defined Mandarin phonemes with either voiced or unvoiced depending on the pervious knowledge of their waveforms as show in Table 1. After labeling each phoneme with VU information, together with the F0 values estimated from an ESPS waves-based F0 contours, Fujisaki parameters are extracted by a FujiPara Editor [13]. Then a continuous F0 contour at phrase and tone level can be re-estimated using Fujisaki parameters. Together with extracted spectral parameters, continuous F0 contours at tone level will be applied for the HMMs training. TABLE I. Unvoiced Initials Voiced Initials Voiced Tonal Finals

MANDARIN INITIALS AND TONAL FINALS WITH VU DECISIONS

the current syllable, tone of the syllable before and after the current syllable, prosodic boundary type of the boundary before and after the current syllable, syllable number of current word foot, syllable number of current prosodic word, syllable number of current prosodic phrase, syllable number of current breath group, POS of current word. For detail refer to [16].

Speech Corpus

Excitation Parameter Extraction

Text Analysis

Spectral Parameter Extraction

Tone Componen t

Phrase Component

b, c, ch, d, f, g, h, j, k, p, q, s, sh, t, x, z, zh l, m, n, r, u, y a, ai, an, ang, ao, e, ei, en, eng, er, i, ia, ian, iang, iao, ie, ii, iii, in, ing, iong, o, ong, ou, u, ua, uai, uan, uang, uei, uen, uo, v, van, ve, vn

A rule-based method was developed to generate the phrase components [14]. In the method, "prosodic word" is first defined as a chunk of syllables usually uttered in a tight connection: a prosodic word can be a word, a compound word, or a word chunk uttered together frequently. Then, at each prosodic word boundary, a phrase command with a certain magnitude is placed depending on the phrase component value at the boundary. Rules for placing are constructed based on the observations of 100 utterances by a female native speaker of Mandarin. Different from non-tonal languages, such as English and Japanese, tone components of Mandarin can have negative values. To prevent F0 contour go below the baseline (Fb) even with negative tone components, phrase components should keep a certain value any time. The rules are constructed to satisfy this condition. For detail refer to [15]. B. Duration Prediction In Mandarin, as factors that influence initial duration are different from those for final duration, in our experiments, the initial and final durations are modeled separately. For initial duration prediction, the baseline feature-set includes: initial name, initial category, final of the current syllable, final category of the current syllable, tone of the current syllable, tone of the syllable before and after the current syllable, prosodic boundary type of the boundary before and after the current syllable, syllable number of current word foot, syllable number of current prosodic word, syllable number of current prosodic phrase, syllable number of current breath group, POS of current word. For final duration prediction, the baseline feature-set includes: final name, final category, initial of the current syllable, initial category of the current syllable, tone of

Rule-based Method

HMM Training

Phrase Data Base

HMM Data Base

Text Analysis & Phoneme UV List

Duration Prediction

Phrase Component

Tone Component

F0 Parameter

Spectrum Parameter

Synthetic Speech Figure 3. The full algrothim of our proposed method for Mandarin speech syntheis

V. EXPERIMENT AND RESULT ANALYSIS To evaluate the performance of our proposed method compared to the MSD-HMM, a manually checked female speaker‟s corpus is used for both methods. Prof. Renhua Wang, from the University of Science and Technology of China provided us the Mandarin speech corpus which consists of 270 training and 30 testing sentences. Figure 4 shows an example

of F0 contours of a Mandarin utterance that are generated by extracted tone and phrase parameters. F0(Hz) 290.0 186.9 120.0 77.6 50.0 1.0 0.0 -1.0 1.0 0.0 -1.0

TABLE II.

MANDARIN INITIALS AND TONAL FINALS WITH VU DECISIONS RMSE of F0

can1 zhong1gong1 hong2 si4 ta1 jiu3 er4 jia1 guo2 nong2 jun1 yi1 san1 nian2 yue4 1.0

1.5

2.0

2.5

3.0

3.5

4.0 t (s)

1.0

1.5

2.0

2.5

3.0

3.5

4.0 t (s)

1.0

1.5

2.0

2.5

3.0

3.5

4.0 t (s)

Figure 4. An example of F0 contour of Chinese utterance "ta1 yi1 jiu3 san1 er4 nian2 si4 yue4 chan1 jia1 zhong1 guo2 gong1 nong2 hong2 jun1 (He joined the Chinese Workers and Peasants Red Army in April 1932.)."

As for the HMM-based method, the HMM-based Speech Synthesis toolkit (HTS Ver.2.1) [17] is used. Five-state, left-toright HMM phone models are adopted. The MSD-HMM generates F0 together with 24-order mel-cepstrum coefficients. The ESPS RAPT [18] algorithm is used for automatic F0 extraction. Before training, we found that almost 22.37% syllables of the total have the VU decision errors. And among these errors, 33% failures are occurred in T4, 39% in T3, 11% in T0, 12% in T2 and 5% in T1. After training process of MSDHMM, the errors will increase. But after smoothed by generation process model, all the VU errors are fixed before training. In Figure 5, there is an example of F0 contours compared between RAPT algorithm and our method. Here we can find that the ESPS RAPT algorithm is failed to find F0 values in the vowel “u” in T3.

52.8 Hz

28ms

Our Approach

29.7 Hz

17.5ms

REFERENCES [1]

[2]

[3]

[4]

[5]

[6]

[7] [8] [9]

[10]

[11]

[12]

[13]

[14] Figure 5. An example of the continuous F0 contours for the Mandarin syllable “shi2+wu3+wan4+mu3”. From top to bottom: original wave, F0 calculated by RAPT algorithm, phoneme labels, F0 re-estimated by Generation Process Model

Table 2 shows the comparison of phone duration and F0 prediction. The comparisons were carried out in three different categories: 1) MSD-HMM 2) Our HMM with continuous F0 and. After using continuous F0 of tone in HMM training, the VU decision errors are significant solved. The experiment results showed that our approach can reach higher accuracy of F0 contours compare to natural speech and also get better duration generation.

RMSE of phone Duration

MSDHMM

[15]

[16]

[17] [18]

K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kita-mura, “Speech Parameter Generation Algorithms for HMM-based Speech Synthesis,” in Proc. ICASSP, 2000. Tokuda, K., Masuko, T., Miyazaki, N., and Kobayashi, T., "Hidden Markov models based on multispace probability distribution for pitch pattern modeling," Proc. IEEE ICASSP, pp.229-232 (1999). T. Nose, J. Yamagishi, and T. Kobayashi, “A style control tech-nique for HMM-based expressive speech synthesis”, IEICE Trans. Inf. & Syst., vol. E90-D, no. 9, pp. 1406–1413, Sep. 2007 K. Tokuda, T. Mausko, N. Miyazaki, and T. Kobayashi, “Multispace probability distribution HMM,” IEICE Trans. Inf. & Syst., vol. E85-D, no. 3, pp. 455–464, 2002. K. Hirose et al., “Corpus-based generation of prosodic featuresfrom text based on generation process model”, Proc.Interspeech,pp.12741277, 2007. Fujisaki, H., Ohno, S. and Gu, W.: “Physiological and physical mechanisms for fundamental frequency control in some tone languages and a command-response model for generation of their F0 contours,” Proc. Int‟l Symposium on Tonal Aspects of Languages, 61-64 (2004). Zemlin, W. R., 1968. Speech and Hearing Science, Anatomy and Physiology. New Jersey: Prentice Hall, Inc. Fink, B. R.; Demarest, R. J., 1978. Laryngeal Biomechanics. Harvard Univ. Press. H. Fujisaki, and K. Hirose, “Analysis of voice fundamental frequency contours for declarative sentences of Japanese,” J. Acoust. Soc. Japan (E), Vol.5, No.4, pp.233-242 (1984) K. Yu, T. Toda, M. Gasic, S. Keizer, F. Mairesse, B. Thomson and S. Young, “Probablistic modelling of f0 in unvoiced regions in hmm based speech synthesis,” ICASSP 2009, Taipei, Taiwan, April 19-24 M. Wang, K. Hirose, and N. Minematsu, "Generation of fundamental frequency contours of Mandarin in HMM-based speech synthesis using generation process model," Proc. Int. Conf. Speech Prosody, CD-ROM (2010-5) Q. Zhang, F. K. Soong, Y. Qian, Z. Yan, J. Pan and Y. Yan. “Improved Modeling for F0 Generation and V/U Decision in HMM-based TTS,” ICASSP, 2010, pp. 4606-4609. Mixdorff, H., Hu, Y. and Chen, G. (2003): Towards the Automatic Extraction of Fujisaki Model Parameters for Mandarin. In Proceedings of Eurospeech 2003, Geneva. J. Zhang, and K. Hirose, "Tone nucleus modeling for Chinese lexical tone recognition," Speech Communication, Vol. 42, Nos. 3-4, pp.447466, 2004. Q. Sun, K. Hirose, W. Gu, and N. Minematsu, "Rule-based generation of phrase components in two-step synthesis of fundamental frequency contours of Mandarin," Proc. Speech Prosody, pp.561-564, 2006. M. Wen, M. Wang, K. Hirose, and N. Minematsu, "Improving Mandarin segmental duration prediction with automatically extracted syntactic information," Proc. INTERSPEECH, accepted (2010-9) "HMM-based Speech Synthesis System (HTS)," http://hts.sp.nitech.ac.jp. 2009 D. Talkin, “A robust algorithm for pitch tracking (RAPT)”, in Speech Coding and Synthesis, W. Kleijn and K. Paliwal, Eds. Elsevier, 1995, pp. 495–518.

Paper Title (use style: paper title)

generation process model for Chinese Mandarin speech. The tone components are ... Further more, accurate prediction of phone durations is essential for high ...

Download PDF

518KB Sizes 3 Downloads 135 Views

Report

Paper Title (use style: paper title)

Recommend Documents