AUTOMATIC PITCH ACCENT DETECTION USING AUTO-CONTEXT WITH ACOUSTIC FEATURES Junhong Zhao1,2 , Wei-Qiang Zhang3 , Hua Yuan3 , Jia Liu3 and Shanhong Xia1 1

State Key Laboratory on Transducing Technology, Institute of Electronics, Chinese Academy of Sciences, Beijing 100190, China 2 University of Chinese Academy of Sciences, Beijing 100190, China 3 Tsinghua National Laboratory for Information Science and Technology, Department of Electronic Engineering, Tsinghua University, Beijing 100084, China Email:[email protected] ABSTRACT In prosody event detection field, many local acoustic features have been proposed for representing the prosody characteristics of speech unit. The context information that represents some possible regularities underlying neighboring prosody events, however, hasn’t been used effectively. The main difficulty to utilize prosodic context is that it’s hard to capture the long-distance sequential dependency. In order to solve this problem, we introduce a new learning approach: autocontext. In this algorithm, a classifier is first trained based on local acoustic features; the discriminative probabilities produced by the classifier are selected as context information for the next iteration. Then a new classifier is trained by using the selected context information and local acoustic features. Repeating using the updated probabilities as the context information for the next iteration, the algorithm can boost recognition ability during its iterative process until converged. The merit of this method is that it can choose context information in a flexible way, while reserving reliable context information and abandoning unreliable ones. The experimental results showed that the proposed method improved the accuracy by absolutely about 1% for pitch accent detection. Index Terms— Pitch accent detection, auto-context, support vector machines (SVMs), acoustic, prosody 1. INTRODUCTION Spontaneous speech articulation consists of two major procedures: expressing the basic phonetic unit in segment level, and delivering the prosodic information in supra-segment level. As an important aspect of prosody, pitch accent can convey the key point of the speech utterance. It carries This work was supported by the National Natural Science Foundation of China under Grant No. 60931160443, No. 90920302 and No. 61005019, and in part by the National High Technology Development Program of China under Grant No. 2008AA040201 and by the National Science and Technology Pillar Program of China under Grant No. 2009BAH41B01.

the information of speaker’s purpose, attitude or the speech rhetoric. Although pitch accent detection has many applications, it is still a challenge work for the present. Due to the supra-segmental characteristic of pitch accent, the context information can give a clue to recognize it when the local acoustic features are not distinct enough. Thus, efficient utilization of such information is very important and can help solve problems. Some studies have been reported on how to utilize context information in pitch accent detection. the n-gram delexicalized prosodic language model has been used in the maximum a posteriori (MAP) framework [1]. However, due to the temporal restriction, this method could only make use of the preceding context information. Recently, Conditional Random Fields (CRF) model has become popular in prosody event detection [2, 3, 4, 5]. CRF model has the advantages of modeling the relations of the sequential labels and is able to retain the long distance dependency information. Although CRF had achieved better performance than the others, it usually has a fixed topology and can only model limited neighborhood relations. So the flexibility of CRF is restricted. In this paper, we will investigate the utilization of context information for pitch accent detection by using auto-context algorithm, which trains a series of classifiers in an iterative way. The classification probabilities obtained from the preceding iteration are used to provide possible contextual clues for current training process. By combining sample classification probabilities of their own and around neighbors with original features, the potential contextual relationships are strengthened step by step. 2. PITCH ACCENT MODELING USING AUTO-CONTEXT ALGORITHM The basic objective of auto-context algorithm [6] is to maximize the marginal distribution of the class labels for all the

samples, which can be denoted by Z p(yi |X) = p(yi , Y−i |X)dY−i ,

Feature vectors of training set

(1)

where X = (x1 , · · · , xn ) are the input feature vectors, yi is the class label for the i-th sample, Y−i are all the rest class label of samples other than yi . To avoid the challenging task of integrating over all the dY−i , the algorithm approach it in the following way: Given a set of training samples together with their ground truth labels S = {(Yi , Xi ), i = 1, . . . , m}, a classifier is first trained using the various local features in iteration 0. In the following iterations, the class probabilities vector P t−1 (i) of i-th sample which is produced by the preceding training and test processes in iteration (t − 1) is added with the original image features to construct a new training set St = {Yi , (Xi , P t−1 (i))} for iteration t. The vector including the probabilities of different class of the current sample and its surrounding ones. Using this updated training set, the new following model is trained. Finally, the algorithm produces a sequences of classifiers until it reaches convergence. More details are given in [6]. In auto-context algorithm, the selection of basic classifier is also important. In this paper, we use the powerful support vector machine (SVM) as basic classifier. The detailed steps to explore the context information in pitch accent detection are as follows: Step 1: The acoustic features are first used to train SVM model in two-way or four-way pitch accent detection task. After the training and test processes, we can obtain accent probabilities and unaccent probabilities of each syllable in two-way, and obtain the probabilities of the high-level, lowlevel, down-stepped accent and unaccent in four-way. Step 2: We extend the probabilities of each syllable by combining the probabilities from its neighbors. The probability set P (i) of the i-th sample is constructed as follows: P (i) = {C∗ (i − n), · · · , C∗ (i − 1), C∗ (i), C∗ (i + 1), · · · , C∗ (i + n)}, (2) where C∗ (i) represents any collection of classification probabilities for the i-th syllable. In this extension, n can be any suitable value. Considering the most common cluster that has close ties in one sentence is phrase and the prosody sequential dependency is always keep within a prosody phrase range, so the value of n is not as bigger as better. The sequence of neighbors can be selected in any pattern. This depends on the kind of dependency relationship of different pitch accent class among the neighboring syllables. Here we just choose it in successive way. After the extension, the new training set for the second stage is becoming: S(i) = {yi , (Xi , P (i)), i = 1, 2, 3, . . . , m},

(3)

where yi represents the ground truth label of the i-th syllable and Xi represents its the acoustic feature vector. By integrating the acoustic features and contextual label probabilities,

Feature vectors of testing set

Train

Labeling Probabilities

Classifier 0 Labeling Probabilities

Train

{X,P0}

Classifier 1

Labeling Probabilities

Classifier 1 Labeling Probabilities

Train

{X,P1} Labeling Probabilities

{X,Pn-1}

Classifier 0

Classifier 2

Classifier 2

…...

…...

Classifier N

Classifier N

Training Stage

Labeling Probabilities

Testing Stage

Fig. 1: Flow of auto-context algorithm

potential connections between both the labels of the current syllable and its acoustic features and the labels of its neighborhood can be explored in the training stage. Step 3: Using the S(i) training set, we train a new SVM model and updated P (i). Repeat Step 2 and 3 until the algorithm reaches convergence. The detection performance can be improved through this iteration process. Figure 1 shows the flow of the auto-context algorithm. 3. EXPERIMENTS 3.1. Experimental Setup The corpus we used in this work is Boston University Radio Speech Corpus (BURSC)[7]. This corpus is composed of news stories read by seven radio news announcers. It is hand-annotated with the orthographic transcription, phonetic alignments,part-of-speech (POS) tags, and prosodic labels based on the ToBI conventions. Considering to keep balance between the amount of training data and detection fineness, we implement pitch accent detection in two-way and four-way classification conditions respectively. In two-way case, we simply divide all the syllables into accented and unaccented ones based on the presence and absence of * mark. In four-way case, we collapse all types of the pitch accent down to four categories: high (H*), low(L*), down-stepped(!H*) and unaccented based on these symbols presence or not. The three uncertainty categories that labeled with *, *? and X*? are merged into the high pitch accent class. We use four speakers’ (F1A, F2B, M1B, M2B) speech to train and test. The total number of paragraphs used in our

88 87

83

86

82.5 82

82.5 1context 2context 3context 4context 5context

82

Accuracy (%)

83.5

Accuracy (%)

Accuracy (%)

84

85 84

81 1context 2context 3context 4context 5context

80.5

83

81.5

81.5

2context train 2context test 81

0

1

2

3

4

5

6

Number of iterations

82

0

1

2

3

4

5

6

80

0

1

Number of iterations

(a)

2

3

4

5

6

Number of iterations

(b)

(c)

Fig. 2: Performance of auto-context on training and test data set for two-way pitch accent detection.

work is 363. In order to ensure the reliability of the results, in test data set, we include the speeches from all the four speakers and each speaker’s speech data are randomly selected. In all, we choose 63 paragraphs for test, the rest for training. The distributions of the data we used are listed in Table 1. In this work, we use LIBSVM [8] with the Radial Basis Function (RBF) kernel to implement SVM classifier. And in the multi-class case of four-way pitch accent detection, we use the one-versus-one mode to decompose the multiclass classification into binary classification. For all the experiments, detection accuracy are employed as a performance evaluation measure.

experiment, we select two successive syllables before and after the current syllable as neighbors and choose probabilities of all classes of these syllables into P (i). The curves illustrate that auto-context algorithm has promising effect to improve the performance of pitch accent detection. The detection accuracy has been improved by about 1% on the test data set and 1.7% on the train data set. With the accuracy gradually rising on the train data set, the accuracy of the testing are improved accordingly until the algorithm become convergence. Actually, this kind of progressively upward trend will be seen in all of our experiments. 3.3.2. Context Choice

3.2. The acoustic features We use 24 features in our work, including: • The frame-averaged features (4 features): loudness, semitone, spectral emphasis, duration. They are all extracted follow the method detailed in [9]. In order to reduce the negative impact caused by different speakers and speech rate, they are normalized by the mean value of the whole sentence. • TILT features (4 features): Our work follows the rise/fall/connection (RFC) intonational model proposed by Taylor[10] and uses the TILT parameter[11] set to describe the variation of pitch contour. More details refer to [9]. • The difference features (16 features): We extract the forward difference and backward difference of the frame-average basic features and the tilt features mentioned above.

Auto-context is sensitive to the choice of context. Different contextual condition can lead to apparent difference in performance. We can select context in two ways: When choose context area, we can try different range and any syllable sequence combination within this range; When choose classification probabilities for each syllable, we can pick the ones with high reliability or that have significant influence for their surroundings. The outstanding advantages of auto-context algorithm is that you can choose any kind of context flexibly. To investigate the effect of different context choice, we carry out the following two experiments. • Experiment 1: Different context area We use different number of syllables from 1 to 5 both in preceding area and following area of the current syllable as context. Within these ranges, we choose all syllables in consecutive way. The results of two-way pitch accent detection on the train data set and test data set are showed in Figure 2(b) Table 1: The distribution of data used in our experiment.

3.3. Experimental Results and Discussion 3.3.1. Performance of Auto-Context We investigate the performance of the auto-context algorithm through the following experiments. The results for two-way pitch accent detection are demonstrated in Figure 2(a). In this

#Utterances #Sentences #Words #Syllables #Accents

F1 59 228 3238 5306 1794

Train F2 M1 149 58 1037 306 11305 4023 18469 6488 6274 1976

M2 34 142 2396 3841 1235

F1 15 51 755 1274 459

Test F2 M1 17 14 139 85 1392 1036 2367 1680 789 588

M2 17 67 1212 2074 698

82

77.2 Accuracy (%)

Accuracy (%)

81

77.4 1context 2context 3context 4context 5context

80

79

78

77 0

77

76.8 1context 2context 3context 4context 5context

76.6

1

2 3 4 Number of iterations

5

6

Fig. 3: Performance of different context area for four-way pitch accent detection on train data set.

and Figure 2(c) respectively. From Figure 2(b) we can see that when we increase the contextual area on two-way pitch accent task, the accuracy on the training data set keeps rising and can achieve 5% improvement in all. This improvement shows auto-context can enhance any possible prosodic relationship existing in the speech data step by step. The larger the range, the greater ability this algorithm has to dig out these relationships. Figure 2(c) illustrates that when the accuracy keep increasing as the contextual area get larger on training data set, the accuracy on the test set first rise and then decline. The best performance is achieved when the range is 2 or 3 and the iteration 4. The accuracy is 82.03% in this case. When the range exceeds 3, the performance becomes unstable. This points out that because of the diversity of speech structure, the common dependency relationship just exists within a small range. Besides, we can also see that under each context case, when reaching a peak, the performance starts to drop down. And the larger the context range is, the earlier it arrive the peak. This demonstrates that when the underlying relationships are enhanced to a certain degree, the model over-fitting problem arises. This degrades the generalization property of the model and lead to the decline of the performance. The results of four-way pitch accent detection are given in Figure 3 and Figure 4. From these two figures, we can see that auto-context algorithm acting on four-way pitch accent detection task has almost the same performance and characteristic with two-way detection. The difference is that when the task becomes more complicated, the model seems to become over-fitted more easily. The performance has degraded when the window size is 5 on the training set. • Experiment 2: Different collection of classification probabilities We choose different classification probability combinations of syllables for the four-way pitch accent task. In this experiment, we fix the context range as 2 and choose three different combinations of classification probabilities. The results are

76.4 0

1

2 3 4 Number of iterations

5

6

Fig. 4: Performance of different context area for four-way pitch accent detection on test data set. Table 2: Performance of different combinations of classification probabilities for four-way pitch accent detection. iteration iteration0 iteration1 iteration2 iteration3 iteration4 iteration5 iteration6

P(full) Train Test 77.46 76.40 78.72 76.80 79.37 76.92 79.92 76.88 80.13 76.86 80.34 76.93 80.50 76.74

P(U,H) Train Test 77.46 76.40 78.63 76.81 79.19 77.11 79.68 77.21 79.98 77.19 80.23 77.23 80.40 77.07

Pc(U,H),Po(H,L) Train Test 77.46 76.40 78.36 76.75 78.74 76.97 79.11 77.27 79.35 77.27 79.52 77.32 79.62 77.34

listed in Table 2. In this table, “P(full)” denotes we use probabilities of all pitch accent classes for all syllables within the contextual area; “P(U,H)” means we just use the probabilities of unaccent and high-level accent; “Pc(U,H),Po(H,L)” refers we choose the unaccent and high-level accent probabilities for the current syllable, and choose high-level and low-level accent probabilities for the other contextual syllables. From Table 2, we can see that the performance of using probabilities of all classes is lower than the other two modes. The reason may be the classification probabilities of the down-stepped accent is not so reliable as the others. Besides, the mode of “Pc(U,H),Po(H,L)” is the optimal combination among the three. Compare the results of iteration6 and interation0, we can see that using auto-context algorithm can improve the accuracy by about 1%. 4. CONCLUSIONS In this paper, we introduced a novel strategy called autocontext to assist prosody event detection by taking the advantages of context information. Auto-context algorithm is capable of choosing contextual information arbitrarily. Using this method, you can abandon contextual information that irrelevant or not reliable, also you can choose the range of the context area. The experiments showed that auto-context is effective in both two-way and four-way pitch accent detection. It can improve the pitch accent detection accuracy by 1%.

5. REFERENCES [1] S. Ananthakrishnan and S. S. Narayanan, “Automatic prosodic event detection using acoustic, lexical, and syntactic evidence,” IEEE Trans. Audio, Speech, Lang. Process., vol. 16, no. 1, pp. 216–228, Jan. 2008. [2] R. Fernandez and B. Ramabhadran, “Discriminative training and unsupervised adaptation for labeling prosodic events with limited training data,” in Proc. Interspeech, Makuhari, 2010, pp. 1429–1432. [3] M. L. Gregory and Y. Altun, “Using conditional random fields to predict pitch accents in conversational speech,” in Proc ACL, Barcelona, 2004. [4] G.-A. Levow, “Automatic prosodic labeling with conditional random fields and rich acoustic features,” in Proc. IJCNLP, Hyderabad, 2008. [5] Y. Qian, Z. Wu, X. Z. Ma, and F. Soong, “Automatic prosody prediction and detection with conditional random field models,” in Proc. ISCSLP, Taiwan, 2010, pp. 135–138.

[6] Z. Tu and X. Bai, “Auto-context and its application to highlevel vision tasks and 3d brain image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 10, pp. 1744– 1757, Oct. 2010. [7] M. Ostendorf, P. J. Price, and S. Shattuck-Hufnagel, “The bonston university radio news corpus,” Tech. Rep. ECS-95-001, Bonston University, SRI International, MIT, 1995. [8] R.-E. Fan, P.-H. Chen, and C.-J. Lin, “Working set selection using second order information for training support vector machines,” J. Mach. Learn. Res., vol. 6, pp. 1889–1918, Dec. 2005. [9] J. Zhao, H. Yuan, J. Liu, and S. Xia, “Automatic lexical stress detection using acoustic features for computer-assisted language learning,” in Proc. APSIPA ASC, Xi’an, 2011. [10] P. Taylor, “The rise/fall/connection model of intonation,” Speech Commun., vol. 15, no. 1-2, pp. 169–186, Oct. 1994. [11] P. Taylor, “The tilt intonation model,” in Proc. ICSLP, Pittsburgh, 2006.

AUTOMATIC PITCH ACCENT DETECTION USING ...

CRF model has the advantages of modeling the relations of the sequential labels and is able to retain the long distance dependency informa- tion. Although ..... ECS-95-001,. Bonston University, SRI International, MIT, 1995. [8] R.-E. Fan, P.-H. Chen, and C.-J. Lin, “Working set selection using second order information for ...

324KB Sizes 0 Downloads 283 Views

Recommend Documents

Using Human Perception for Automatic Accent ...
Using Human Perception for Automatic Accent Assessment. Freddy William, Abhijeet Sangwan, and John H. L. Hansen. 1. Center for Robust Speech Systems (CRSS), Eric Jonsson School of Engineering,. University of Texas at Dallas, Richardson, Texas, U.S.A.

Automatic Detection of Bike-riders without Helmet using ...
Email: {cs11b15m000001, cs14resch11003, ckm}@iith.ac.in. Abstract—In this paper, we propose an approach for automatic detection of bike-riders without ...

Cheap 3D Modulator Automatic Synchronization Signal Detection ...
Cheap 3D Modulator Automatic Synchronization Signal ... arized 3D System Free Shipping & Wholesale Price.pdf. Cheap 3D Modulator Automatic ...

Automatic Detection of Cars in Real Roads using Haar ...
feature application to a rectangular region positioned at ,x y .... Opencv [7] provides a tool for cascade performance testing. The tool applies the cascade to all ...

Automatic Excitement-Level Detection for Sports ...
curate speech background is necessary for good performance. Ac- curate segmentation .... an automatic fashion using WaveSurfer and in-house tools: fun- damental frequency F0 ... were used for training and two games for testing. The overall ex- .... a

Detection-Based ASR in the Automatic Speech Attribute ...
School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA1. Department of Computer ... As a result, a wide body of expert knowledge in ... Specifically, we present methods of detector design in the Auto-.

The Automatic Detection of Patterns In People's ...
Time Warping as a distance measure between individual movements. 1 Introduction .... based on the minimum distance between items in a cluster. This value is ...

Pattern recognition techniques for automatic detection of ... - CiteSeerX
Computer-aided diagnosis;. Machine learning. Summary We have employed two pattern recognition methods used commonly for face recognition in order to analyse digital mammograms. ..... should have values near 1.0 on the main diagonal,. i.e., for true .

A Methodology For The Automatic Detection Of ...
(lengthened syllables), it should not be used as training data for our purposes. Another example is ... silence (from the auto. align.) For this manual annotation, ...

Automatic Music Transcription using Autoregressive ...
Jun 14, 2001 - indispensable to mix and manipulate the necessary wav-files. The Matlab ..... the problems related to automatic transcription are discussed, and a system trying to resolve the ..... and then deleting a certain number of samples.

Automatic Campus Network Management using GPS.pdf ...
Automatic Campus Network Management using GPS.pdf. Automatic Campus Network Management using GPS.pdf. Open. Extract. Open with. Sign In.

Automatic Gaze-based User-independent Detection of ...
pseudorandom prompts from an iPhone app and discovered that people ... the 2014 User Modeling, Adaptation, and Personalization conference (Bixler & D'Mello ..... 6 the differences in the tasks and methodologies. Grandchamp et al. (2014) ...

Automatic detection of learning-centered affective ...
sensitive interfaces for educational software in classroom environments are discussed. ... Learning from intelligent educational interfaces elicits frequent affective ...... with technology. Journal of ... Springer,. Berlin Heidelberg, 1994, 171–18

Automatic Circle Detection on Images with an ... - Ajith Abraham
algorithm is based on a recently developed swarm-intelligence technique, well ... I.2.8 [Artificial Intelligence]: Problem Solving, Control Methods, and Search ...

Automatic Gaze-based User-independent Detection of ...
in support of an eye-mind link posits that there could be a link between external ..... caught method (discussed above) while users read texts on a computer screen. ...... 364–365. International Educational. Data Mining Society (2013)Dong, ...

abrupt change detection in automatic disturbance ...
155. 9.6.3. Citrix®. ...... client computing etc) for the ASP implementation are discussed in details. We also discuss the utilisation of the ..... to create the software package for the automatic disturbance recognition and analy- sis. 2.6 Overview

Towards automatic skill evaluation: Detection and ...
1Engineering Research Center for Computer-Integrated Surgical Systems and Technology and 2Center for .... sition is the streaming 72 data-points per time unit.

Automatic Circle Detection on Images with an Adaptive ...
test circle approximates the actual edge-circle, the lesser becomes the value of this ... otherwise, or republish, to post on servers or to redistribute to lists, requires prior ... performance of our ABFOA based algorithm with other evolutionary ...

Automatic speaker recognition using dynamic Bayesian network ...
This paper presents a novel approach to automatic speaker recognition using dynamic Bayesian network (DBN). DBNs have a precise and well-understand ...

Pattern recognition techniques for automatic detection of ... - CiteSeerX
Moreover, it is com- mon for a cluster of micro–calcifications to reveal morphological features that cannot be clearly clas- sified as being benign or malignant. Other important breast .... mammography will be created because of the data deluge to

Affect Detection from Gross Body Language Automatic ...
Verbal and non-verbal channels show a remarkable degree of sophisticated ... requires an interdisciplinary integration of computer science, psychology, artificial ...