Multiple User Intent Understanding for Spoken Dialog ...

Viewer
Transcript

Master's Thesis

Multiple User Intent Understanding for Spoken Dialog System

Hongsuck Seo (서 홍 석) Division of Electrical and Computer Engineering (Computer Science and Engineering) Pohang University of Science and Technology

2013

음성대화시스템을 위한 다수의 사용자 발화의도 이해 Multiple User Intent Understanding for Spoken Dialog System

Multiple User Intent Understanding for Spoken Dialog system by

Hongsuck Seo Division of Electrical and Computer Engineering (Computer Science and Engineering) Pohang University of Science and Technology

A thesis submitted to the faculty of the Pohang University of Science and Technology in partial fulfillment of the requirements for the degree of Master of Science in the Division of Electrical and Computer Engineering (Computer Science and Engineering).

Pohang, Korea 12. 31. 2012 Approved by Prof. Gary Geunbae Lee (Signature) Academic Advisor

Multiple User Intent Understanding for Spoken Dialog System

Hongsuck Seo

The undersigned have examined this thesis and hereby certify that it is worthy of acceptance for a master's degree from POSTECH

12. 13. 2012

Committee Chair

Gary Geunbae Lee (Seal)

Member

Jong-Hyeok Lee (Seal)

Member

Hwanjo Yu

(Seal)

MECE 20110505

서홍석, Hongsuck Seo, Multiple User Intent Understanding for Spoken Dialog System, 음성대화시스템을 위한 다수의 사용자 발화의도 이해, Division of Electrical and Computer Engineering (Computer Science and Engineering), 2012, 33 pages, Advisor: Gary Geunbae Lee. Text in English.

ABSTRACT One of the main components of spoken language understanding is user intent detection. Common classification approaches to this problem cannot detect multiple user intents. In this thesis, a user intent indicator (UII), which is a phrase that represents user intent in an utterance, is introduced. Instead of tackling user intent detection directly, this approach predicts user intent by detecting UIIs and extracting their types. UII detection is formulated as a sequential labeling problem, and a linear-chain conditional random field is used for the sequential labeling model. Based on the advantages and disadvantages of the traditional and proposed models, a second-level predictor is introduced to capitalize on the strengths and compensate for the weaknesses of each model. A set of experiments showed that the UII detection model outperformed the baselines in F1-score when tested not only on utterances with multiple user intents but also on utterances with a single user intent. With the second-level predictor, the proposed method outperformed the baselines in all evaluation metrics. To demonstrate the effectiveness of this approach in reality, human

experiments on an integrated dialog system were also performed, and the proposed method showed a shorter average turn length, a higher successful turn rate and a higher task completion rate than the baseline system.

Contents 1. INTRODUCTION ................................................................................................................. 1 2. RELATED WORK .............................................................................................................. 3 3. MULTIPLE USER INTENT DETECTION............................................................................. 5 3.1. User Intent Indicator ............................................................................................... 5 3.2. User Intent Indicator Detection .............................................................................. 6 3.3. Back-off Model....................................................................................................... 8 3.3.1. Allocating undefined Class ........................................................................ 8 3.3.2. Failure Back-off ......................................................................................... 8 3.4. Second-level Predictor ............................................................................................ 9 4. RELATIONS BETWEEN USER INTENTS AND NAMED ENTITIES .................................... 12 5. EXPERIMENTS ................................................................................................................. 14 5.1. Data Sets and Experimental Setup........................................................................ 14 5.2. Evaluation Metric ................................................................................................. 15 5.3. Results and Analysis ............................................................................................. 16 6. CONCLUSIONS ................................................................................................................. 25 요 약 문 ............................................................................................................................. 27 REFERENCES....................................................................................................................... 28

I

List of Figures

Figure 1. Examples of user intent indicators ..................................................................... 5

II

List of Tables

Table 1. Results of five-fold cross-validation on USUIIs of the K-EPG ........................ 17 Table 2. Results of five-fold cross-validation on USUIIs of the E-EPG ......................... 17 Table 3. Results of models trained on the whole set of USUIIs and tested on the whole set of UMUIIs .......................................................................................................... 19 Table 4. Results of five-fold cross-validation on UMUIIs (training set for each fold is extended by the whole set of USUIIs) ..................................................................... 19 Table 5. Results of five-fold cross-validation on the combined set of UMUIIs and USUIIs ..................................................................................................................... 21 Table 6. Results of five-fold cross-validation for back-off model and second-level predictor model on the combined set of UMUIIs and USUIIs ................................ 21 Table 7. Results of human experiments on dialog systems integrated with different user intent detection models .................................................................................... 24

III

1. INTRODUCTION The task of spoken language understanding (SLU), an important part of a spoken dialog system (SDS), is to map a spoken utterance into a formally designed semantic frame. A semantic frame consists of slot/value pairs that can be divided into two main components, a user intent and named entities (NEs) in a goal-oriented dialog system. The user intent, also called the subject slot, represents the goal of the user utterance and enables the system to select what action to take. NEs, arguments of the user intent, are identifiers of an entity. I specify an NE as the domain-specific semantic meaning of a word for SLU. In earlier studies, the methods of natural language understanding (NLU) were used to resolve SLU. Researchers developed semantic parsers with either a hand-written knowledge base or a statistical model trained with a large corpus [1-6]. A semantic parser generates tree-structured expressions of meaning, and these expressions of meaning represent a semantic frame. However, substituting SLU for NLU is not a simple matter. In contrast to written language, spoken language has poor grammatical structure because of spontaneous speech phenomena such as the restarting of sentences and the repetition of words. Furthermore, output sentences of automatic speech recognition contain many errors for various reasons. In many recent studies, intent detection has been handled separately from NE recognition to address the problems mentioned above because user intent corresponds to -1-

top-level nodes affected by all of the words and the overall grammatical structure of a sentence [7]. The main approach to intent detection has been to classify a recognized user utterance into one of several previously defined intent classes [8-11]. Various classifiers and techniques have been used to correctly classify user utterances into these predefined user intent classes. However, these classification approaches cannot handle utterances containing multiple user intents due to the definition of the task. In this thesis, I introduce an intent detection method that can detect multiple user intents in a speech utterance. The basic assumption of the proposed method is that a phrase, rather than the whole sentence, represents a user intent in a user utterance. User intent detection is accomplished by detecting these phrases. In this framework, multiple user intents can be handled naturally. Multiple user intents are represented by multiple phrases and can be identified by detecting these multiple phrases. Because the traditional model has some strengths, including grammatical insensitivity and robustness to errors, some combining methods are introduced to incorporate the advantages of each model. I also introduce a linking heuristic rule, which connects the predicted NEs to a predicted user intent based on the relative positions of the NEs and the user intents. The remainder of this thesis is organized as follows. Section 2 briefly reviews previous studies. I introduce the proposed methods for multiple user intent detection and for linking detected user intents to NEs in Section 3 and Section 4, respectively. In Section 5, the experimental results are presented. Finally, Section 6 concludes this thesis. -2-

2. RELATED WORK In numerous previous studies, user intent has been detected using classification techniques [8-11]. However, as mentioned in the previous section, these studies do not account for multiple user intents. Some early statistical semantic parsing approaches included the hidden understanding model [4], the hidden vector state model [5], the composite model of knowledge and the statistical model [6]. Although these approaches can address multiple user intent detection by building additional annotated instances with appropriate semantic structures that represent multiple user intents, I introduce the proposed method for the following reasons: First, the proposed approach is less sensitive to the grammaticality of sentences and more robust to recognition errors because the proposed method extracts only phrases expressing user intent while semantic parsers aim to generate a tree-structured representation of meaning of the overall sentence. Moreover, the proposed method works more robustly when combined with a traditional classification approach using a secondlevel predictor1, which incorporates the strengths of a traditional model. Second, there is no previous research on semantic parsing that specifically addresses the handling of multiple user intents. Finally, tree-structured semantic annotation, which is necessary for adaptation of the semantic parsing for every language, is very laborious and expensive.

1

The second-level predictor is described in Section 3.4

-3-

The most relevant previous work, performed by Tur et al. [11], used predicate/argument pairs with semantic role labeling and built hand-written rules for mapping each pair to a user intent to automatically annotate a corpus. Although the predicate/argument pair concept is similar to the idea of individual phrases indicating user intents, as proposed in this work, the predicate/argument pair is only used for the annotation step and a classifier is trained on the annotated corpus due to the poor performance of semantic role labeling on speech utterances. Therefore, the trained user intent detection model does not solve the multiple user intent problem.

-4-

3. MULTIPLE USER INTENT DETECTION 3.1. User Intent Indicator Each user intent is represented by words in a user utterance. The classification approach captures the occurrence of all words in utterances with each user intent. However, a user intent is represented not by all of the words in an utterance but by the words in a phrase (Fig. 1). I call this phrase the user intent indicator (UII). The UII type is the user intent class that the UII represents. The basic assumption of this study is that UIIs represent user intents in user utterances, and therefore, user intent detection is accomplished by detecting UIIs.

Figure 1. Examples of user intent indicators

-5-

UII detection naturally handles multiple user intents. The existence of multiple user intents in an utterance means that multiple UIIs representing those user intents are present in the utterance, and detecting those UIIs results in detecting the user intents.

3.2. User Intent Indicator Detection With the introduction of the UII concept, user intent detection can be replaced by UII detection. UII detection aims to identify one or more phrases and their types in an utterance. The proposed method uses a linear-chain conditional random field (CRF) to confront this task.

A linear-chain CRF defines a conditional probability [12] for a label sequence , given an input

( | )

:

(∑ ∑

(

))

where

is a normalization factor that makes the probability of all label sequences sum

to one.

(

transition

) is an arbitrary feature function that encodes any aspect of a state , and the observation

, centered at the current time . Feature

functions are often binary-valued in NLP tasks. The model is parameterized by , and each parameter

is associated with the feature function

-6-

. The parameters

of a linear-chain CRF are typically estimated by conditional maximum log-likelihood with appropriate regularization.

The input

, in our proposed method, is the word sequence in an utterance. The

output labels are drawn from a set of classes constructed by extending each UII type (i.e., each user intent class) by the Beginning/Inside/Outside (B/I/O) labeling technique, as has previously been used in various tasks [13-14]. The label for the first word of a UII is constructed by concatenating the UII type and the symbol indicating the beginning of a UII. For words other than the first word in a UII, the inside symbol is included in the concatenation instead of the beginning symbol. Finally, words outside of the UIIs are labeled with outside symbols alone, without concatenation. I formulate UII detection as a sequential labeling problem. The label sequence is chosen with the maximum conditional probability inferred by the linear-chain CRF as follows:

( | )

By this model, the label sequence of an utterance with multiple UIIs will contain more than one UII label.

-7-

3.3. Back-off Model Because the proposed method solves the task as a sequential labeling problem, a policy for an output label sequence containing only outside labels needs to be made. In this section, I introduce two ways of handling user utterances with only outside labels.

3.3.1. Allocating undefined Class The simplest way to address this situation is to put the utterance into an undefined user intent class and treat it as an out-of-domain utterance. The absence of any UII in an utterance is likely to indicate the inexistence of an appropriate user intent. However, the same situation may arise by unseen words or structures of UIIs causing misunderstanding of the utterance.

3.3.2. Failure Back-off To solve the problem mentioned above, one can consider a back-off model approach, i.e., detecting user intent using another model. A typical classification model can serve as a back-off model. When UII detection has predicted no UII, a classifier is run and the result of the classifier is taken as the output label. The features for the back-off classifier are extracted from the overall utterance so that the classifier can capture some information from words outside of the UIIs. In this way, the back-off classifier relieves the data sparseness problem that can occur with UII detection.

-8-

3.4. Second-level Predictor The back-off model framework described in the previous section basically takes the results of UII detection. A result of user intent detection is taken only when UII detection outputs a label sequence containing only outside labels. This policy implies an assumption that UII detection always better predicts user intent than does user intent detection by a classifier, but this assumption is not true. Each model has advantages and disadvantages. The classification approach shows robustness to unseen words or structures and recognition errors. However, it cannot detect multiple user intents, and the features of other words that are irrelevant to the user intent are noise that decreases the overall performance of the classification approach. These problems result in low probabilities of the output labels. In contrast to the classification approach, UII detection can detect multiple UIIs with features from words and structures previously seen at training. Because UII detection aims to detect phrases and the user intents are predicted only by the detected phrases (i.e., UIIs), UII detection can also predict user intent accurately even when an utterance contains some irrelevant words. Nonetheless, UII detection is less robust to unseen words or structures, recognition errors, and data sparseness because predictive features are local to UIIs. These problems also result in low probabilities of the labels in an output sequence.

-9-

To capitalize on the strengths and compensate for the weaknesses of each model, I propose to build a second-level predictor that selects which result to take. The second-level predictor is a binary classifier; I allocate the positive class for instances taking user intent detection and the negative class for instances taking UII detection. The second-level predictor takes the predicted results of the two models as its input and returns which result to select. The training data for the second-level predictor is generated by following process: 1) User intent and UII detection is performed on input utterances and the results are measured using a certain metric. 2) The instances in which both models generate the same score are removed. 3) Each instance is allocated to the class with the higher score (e.g., the positive class for those having higher user intent detection scores and the negative class for the others). Logistic regression is used for the second-level binary classifier. A logistic regression model outputs a probability

for one of the two classes given input

features, and the probability for the other class given the same input features is . The parameters of a logistic regression model can be estimated by maximum likelihood estimation with an appropriate regularization technique. The features for the second-level classifier are extracted from an output label of the user intent detection model, an output label sequence of the UII detection

- 10 -

model, and the probabilities associated with each of these. The features used in the proposed model are as follows: 

The probability (or score) of the label predicted by the user intent detection model



The probability (or score) of the label sequence 2 predicted by the UII detection



The geometric mean of the probabilities (or scores) of the UIIs 3 predicted by UII detection



The user intent predicted by the user intent detection model (a binary feature)



The user intent(s) predicted by the UII detection model (a set of binary features)

2

The probability of a predicted label sequence is the geometric mean of the probabilities for each label in the sequence. 3 The probability of a predicted UII is the geometric mean of the probabilities for each label in the UII label sequence.

- 11 -

4. RELATIONS BETWEEN USER INTENTS AND NAMED ENTITIES In the previous section, I focused on the detection of multiple user intents. However, the links between detected user intents and NEs need to be resolved when multiple intents are detected. I use language-dependent heuristic rules to address this problem. For example, in Korean, a predicate is usually placed at the end of a sentence, so all NEs before a detected UII are linked to that UII. Similarly, in English, a predicate is typically placed before its arguments other than its subject, so all NEs are linked to the closest forward UII 4. Although these heuristics look simple, their accuracy when applied to the dataset was very high: only one mistake was made in the Korean corpus, and no mistakes were made in the English corpus. These heuristic rules are only applicable to methods that can predict the positions of UIIs. They are not directly applicable to a multi-labeling approach, for example, because the output of a multi-label classifier is just a set of labels without any position information. Because some of the arguments of a user intent may be omitted due to the context of the dialog, many dialog managers retain a history of the semantic frames and manage this history. If the dialog manager fails the task with the detected NEs, it fills NE slots from the previous semantic frame. The omission of some NEs for the

4

This is because most of the utterances in a goal-oriented dialog system don’t contain a named entity in their subject position. In the English electronic program guide, the subject is most often either 1) omitted to make imperative or propositive sentences or 2) “you” or “I”.

- 12 -

second user intent can be handled in the dialog manager by means of this history maintenance. Moreover, prior to this, I add one more step that fills NE slots from the last semantic frame extracted from the same utterance because multiple user intents can share one or more of the NEs in an utterance, causing NE omission. The reason for using the last semantic frame is that when NEs are omitted by sharing, they tend to appear after all of the predicates that share them. Finally, the overall linking process is as follows: 1) UIIs and NEs are linked based on their position information and the success of the given semantic frames on a task is checked. 2) If not, NE slots are filled from the last semantic frame and the fulfillment of the task is checked. 3) Finally, when the second step fails, NE slots are filled from the history and the task is performed.

- 13 -

5. EXPERIMENTS 5.1. Data Sets and Experimental Setup I evaluated the method described here on the electronic program guide (EPG) domain corpus (the television information access interface domain). The EPG corpus used here consisted of two sub-corpora: the Korean EPG (K-EPG) and the English EPG (E-EPG). In the K-EPG corpus, the number of user intents was 36, including the undefined user intent. The number of NEs in this corpus was 20. The K-EPG consisted of 6542 Korean utterances annotated with UIIs and NEs. Among the utterances in the K-EPG, 6401 utterances were utterances with a single UII (USUII) and 141 utterances were utterances with multiple UIIs (UMUII). In the EEPG corpus, the number of user intents was 35, including the undefined user intent, and the number of NEs was 22. The E-EPG consisted of 1072 English utterances (996 USUIIs and 76 UMUIIs) annotated with UIIs and NEs. Utterances in the EEPG had relatively few “noisy” words, which are not related to user intent, whereas K-EPG utterances had many noisy words. I compared the proposed model to two baseline models: a maximum entropy 5 (ME) model and a multi-label ME (MME) model. The ME model is a multiclass classifier providing the least biased estimate based on the given features [15]. The

5

Maximum entropy classification is also known as multinomial logistic regression.

- 14 -

class with the largest probability is chosen as the output label. The MME model is a multi-label classifier, which predicts a set of labels. The MME model was implemented as a set of independent binary ME models trained for each user intent class. This multiple binary model is a common method for multi-label classification. The positive examples for each binary model were the utterances with the corresponding user intent class, and all other utterances were negative examples. If the probability for a class was larger than the threshold, then the class was added to the output label set. In the experiments, the thresholds for all binary models were set to 0.5. In the results, the CRF represents the proposed method using UII detection. We used bag-of-words features for the ME and MME models and local word features with a sliding window of width 3, centered at the target word, for the CRF model

5.2. Evaluation Metric The models for user intent detection are directly evaluated in terms of precision, recall and the harmonic mean of these two measures, the F1 score:

(

)

- 15 -

To evaluate dialog systems, I used three measures: average turn length (ATL), successful turn rate (STR) and task completion rate (TCR). ATL is the average number of utterances given by users to complete a given set of tasks. STR is the ratio of the number of successfully completed utterances to the total number of utterances. TCR is the number of tasks completed by a dialog divided by the total number of given tasks.

5.3. Results and Analysis In Tables 1–6, the symbol +NE indicates that the NE type symbols are substituted for the words of the NEs (for ME and MME, both words and NE type symbols are added to the feature set). I first performed five-fold cross-validation on the USUIIs of both the KEPG and the E-EPG using various methods (Table 1, 2). The first observation was that the common approach of multi-label classification showed very low F1-scores for both corpora. The F1-scores of the MME models were lowest, and even adding NE features did not improve these models. For the K-EPG, the CRF model outperformed the ME model in all measures regardless of whether NE features were added. In contrast, the ME model performed better than the CRF model in all measures for the E-EPG. I believe these results can be explained by the following three observations: First, the K-EPG contains more irrelevant words than does the E-EPG. Because bag-of-words features are used for the ME model, words in an - 16 -

Table 1. Results of five-fold cross-validation on USUIIs of the K-EPG Model

Precision

Recall

F1-score

MME

0.7436

0.8207

0.7802

MME +NE

0.7487

0.8142

0.7801

ME

0.8502

0.8502

0.8502

ME+NE

0.8527

0.8527

0.8527

CRF

0.8589

0.8605

0.8597

CRF+NE

0.8637

0.8661

0.8649

Table 2. Results of five-fold cross-validation on USUIIs of the E-EPG Model

Precision

Recall

F1-score

MME

0.7004

0.7159

0.7080

MME +NE

0.7054

0.7189

0.7121

ME

0.8665

0.8665

0.8665

ME+NE

0.8715

0.8715

0.8715

CRF

0.7629

0.7721

0.7675

CRF+NE

0.8294

0.8394

0.8343

- 17 -

utterance that are irrelevant to a user intent represent noise, making the performance poor. The low rate of irrelevant words in the E-EPG could raise the performance of the ME model. Second, local structures of words are more important in Korean than in English. In Korean, which is an agglutinative language, eojeols play the basic syntactic role in a sentence and an eojeol is comprised of a sequence of words, whereas in English, syntactic roles are encoded in the words themselves. In bag-of-words feature space, the local context cannot be captured while word-encoded information is captured. Finally, the E-EPG is smaller than the K-EPG. The proposed model needs enough data to generalize the local context of UIIs. Moreover, the comprised syntactic information in English words causes a diversity of surface forms, which gives rise to the need for a bigger dataset. The final observation is that adding NE features improved the overall performance of both the ME and CRF models. I next trained the models on the whole set of USUII and tested them on the whole set of UMUIIs (Table 3). The MME+NE model showed the lowest performance, as with the USUIIs of both corpora. The recalls of the ME+NE models is below 0.5 because each UMUII has more than one user intent and the ME+NE model predicts only one user intent. The CRF+NE model showed the best performance in all metrics although no utterance in the training set contains multiple UIIs. For the E-EPG, the CRF+NE model outperformed the ME model

- 18 -

Table 3. Results of models trained on the whole set of USUIIs and tested on the whole set of UMUIIs Dataset

K-EPG

E-EPG

Model

Precision

Recall

F1-score

MME +NE

0.6037

0.3498

0.4430

ME+NE

0.8511

0.4240

0.5660

CRF+NE

0.8629

0.5355

0.6608

MME +NE

0.6333

0.3775

0.4730

ME+NE

0.8421

0.4238

0.5639

CRF+NE

0.8700

0.5762

0.6932

Table 4. Results of five-fold cross-validation on UMUIIs (training set for each fold is extended by the whole set of USUIIs) Dataset

K-EPG

E-EPG

Model

Precision

Recall

F1-score

MME +NE

0.6443

0.4417

0.5241

ME+NE

0.8511

0.4240

0.5660

CRF+NE

0.8454

0.6206

0.7157

MME +NE

0.7981

0.5497

0.6510

ME+NE

0.9211

0.4636

0.6167

CRF+NE

0.9307

0.7520

0.8319

- 19 -

(in contrast to the previous results). This is because in the ME models, words related to each user intent in an utterance increase the ambiguity of classification. To capture the effect of UMUII inclusion in the training set, five-fold cross-validation on the UMUII was performed (Table 4). In every test for each fold, the models are trained on the combination of the whole set of USUIIs and the training set of UMUIIs. Although the MME+NE model showed increased performance in all metrics when the UMUII training set was included, the precision was still low compared to that of the other models. Because the ME+NE model is a single label classifier, I needed to decide how to utilize the UMUII training set for this model. I duplicated each utterance in the UMUII set as many times as the number of UIIs in the utterance and put each reference label to each duplicated utterance. In the Korean dataset, the performance of the ME+NE model stayed the same as when it was trained without UMUIIs. In the English dataset, the precision and the recall of the ME+NE model improved a lot. This indicates that the characteristics of USUIIs and UMUIIs are quite different. In spite of the performance gain, the fundamental limitation of classification is not resolved, so the recall of the ME+NE model stayed below 0.5. In the CRF model, however, the inclusion of UMUIIs in the training set improved the F1- score by approximately 5% for the K-EPG and over 15% for the E-EPG, with a large increase in the recall and a small decrease in the precision.

- 20 -

Table 5. Results of five-fold cross-validation on the combined set of UMUIIs and USUIIs Dataset

K-EPG

E-EPG

Model

Precision

Recall

F1-score

MME +NE

0.7373

0.8003

0.7675

ME+NE

0.8517

0.8336

0.8426

CRF+NE

0.8659

0.8604

0.8631

MME +NE

0.6804

0.6643

0.6723

ME+NE

0.8694

0.8126

0.8400

CRF+NE

0.8270

0.8213

0.8241

Table 6. Results of five-fold cross-validation for back-off model and second-level predictor model on the combined set of UMUIIs and USUIIs Dataset

Model

Precision

Recall

F1-score

ME+NE

0.8517

0.8336

0.8426

CRF+NE

0.8659

0.8604

0.8631

Back-off

0.8690

0.8636

0.8663

SLP

0.8824

0.8737

0.8781

ME+NE

0.8694

0.8126

0.8400

CRF+NE

0.8270

0.8213

0.8241

Back-off

0.8569

0.8509

0.8539

SLP

0.8788

0.8596

0.8691

K-EPG

E-EPG

- 21 -

I finally performed experiments on the combined data sets of USUIIs and UMUIIs (Table 5). The results of five-fold cross-validation showed that the proposed method outperformed both of the baseline models in the K-EPG but yielded a poorer F1-score than did the ME+NE model in the E-EPG. This results from the low performance of this method on the USUIIs of the E-EPG. Although the CRF model outperformed the ME model in all experiments on the K-EPG, the CRF model outperformed the ME model only on UMUIIs and showed poorer results than the ME model on the combined set of USUIIs and UMUIIs, which represents the real-world data, on the E-EPG. Analyzing the results for the E-EPG corpus, we can observe the advantages and disadvantages of each model as described in Section 3.4. To capture the effect of a back-off model and a second-level predictor (SLP) model 6, I performed five-fold cross-validation on the combined set of USUIIs and UMUIIs of each corpus (Table 6). Although the CRF+NE model with the ME+NE model as the back-off model yielded a higher F1-score than either the CRF+NE model or the ME+NE model alone, the precision of the model with back-off is lower than either the CRF+NE model or the ME+NE model for both the K-EPG and the E-EPG. This implies that the CRF+NE model generates better predictions in some instances and the ME+NE model does so in others. Though back-off predictions are mostly right because the

6

To train the SLP model, relative F1 scores are used to determine class assignments.

- 22 -

targets of the back-off classification are unpredicted instances from the CRF+NE model, it is still possible that those instances belong to the undefined class. Moreover, the ME+NE model generates better predictions not only for unpredicted instances from the CRF+NE model but also for predicted instances from the CRF+NE model. The results of the SLP proved this phenomenon by generating the highest scores in all metrics for both corpora. Finally, to investigate the effectiveness of the proposed SLU model, I integrated a rule-based dialog manager with either the baseline model (ME+NE) or the proposed model (CRF+NE and ME+NE with a second-level predictor), building a dialog system, and conducted human experiments (Table 7). I gave ten subjects five sets of tasks and asked the subjects to use the dialog systems to achieve the given tasks. The average numbers of tasks in a set were 3.6 for the Korean experiments and 3.8 for the English experiments. To focus on multiple user intent detection, I told the subjects that the utterances containing multiple user intents are preferred. The results showed that the ATL for the proposed model is shorter than the ATL for the baseline model for both Korean and English. The proposed model also showed higher STR and TCR than the baseline model as expected based on the results of the previous SLU experiments.

- 23 -

Table 7. Results of human experiments on dialog systems integrated with different user intent detection models

Dataset

SLU Model

ATL

STR

TCR

Baseline

4.43

0.7097

0.8556

Proposed

3.42

0.8129

0.8722

Baseline

5.4

0.5185

0.7368

Proposed

3.68

0.6739

0.8158

K-EPG

E-EPG

- 24 -

6. CONCLUSIONS I have presented a new method for multiple user intent detection in SLU, and I have introduced the concept of the UII, which is a phrase that represents a user intent in an utterance. The proposed method is designed to detect user intents by detecting UIIs. Because we treated UII detection as a sequential labeling problem and utilized a linear-chain CRF model, multiple user intents could be naturally detected. I have introduced the use of a back-off model to handle instances that are not predicted by the CRF model. It has also been noted that the two methods of (1) solving user intent detection directly by a classifier and (2) solving it indirectly by detecting UIIs both have advantages and disadvantages, and therefore, a back-off model is not the best approach to combining both models. To capitalize on the strengths and compensate for the weaknesses of each model, I have also introduced a second-level predictor that chooses which model results to take. In a series of experiments, I first compared the UII detection model to two baseline models: an ME model and an MME model. The UII detection model yielded different results depending on the dataset used. In experiments testing the effectiveness of a back-off model and a second-level predictor, the second-level predictor model outperformed the other models in all evaluation metrics. I finally performed human experiments and observed that the dialog system with the - 25 -

proposed user intent detection model outperformed the baseline system as expected based on the results of the previous experiments.

- 26 -

요 약 문 사용자 발화의도 이해는 음성언어이해 모듈의 주요한 구성요소이다. 분류기를 활용하여 사용자 발화의도를 예측 하는 방법이 널리 사용되고 있으나 사용자 발화에 다수의 발화의도가 포함된 경우에 대한 처리가 불가능하다는 문제를 가지고 있다. 이 논문은 사용자 발화로부터 다수의 발화의도를 이해하는 방법에 관한 연구를 다룬다. 먼저 사용자 발화 안에서 발화의도를 나타내는 구인 사용자 발화의도 지시자를 정의하고 사용자 발화의도 이해 문제를 사용자 발화의도 지시자 예측문제로 변경하여 접근하게 된다. 사용자의 발화의도는 사용자 발화의도 지시자에 의해서 발화 내에 실체화 되기 때문에 다수의 사용자 발화의도 지시자와 그 지시자가 어떤 발화의도를 실체화 하고 있는지를 예측함으로써 사용자 발화의도를 예측한다. 본 연구에서는 사용자 발화의도 지시자 예측 문제를 sequential labeling 문제로 접근하고, linear-chain conditional random field 를 사용하였다. 또 일반적인 분류 방식과 사용자 발화의도 지시자 예측 문제는 서로 다른 강점과 약점을 가지기 때문에 서로를 보완하기 위해 2 차 분류기로 두 결과 중 어떤 결과를 선택할지 결정하여 보다 높은 성능을 얻을 수 있었다. 일련의 실험을 통해서 이 논문에서 제안하는 방법을 통하여 기존의 방법보다 높은 성능을 얻을 수 있는 것을 보였다.

- 27 -

REFERENCES 1. W. Ward and S. Issar, “Recent improvements in the CMU spoken language understanding system,” in Proceedings of the workshop on Human Language Technology, pp. 213-216, March 1994. 2. S. Seneff, “TINA: a natural language system for spoken language applications,” Computational Linguistics, Vol.18, No.1, MIT Press, pp. 61-86. 3. J. Dowding, J. M. Gawron, D. Appelt, J. Bear, L. Cherny, R. Moore, D. Moran, “Gemini: a natural langauge system for spoken language understanding,” in Proceedings of the 31st annual meeting on Association for Computational Linguistics, Ohio, pp. 54-61, 1993. 4. S. Miller, R. Bobrow, R. Ingria, R. Schwartz, “Hidden understanding models of natural language,” in Proceedings of the 32nd annual meeting on Association for Computational Linguistics, pp. 25-32, 1994. 5. Y. He and S. Young, “Hidden vector state model for hierarchical semantic parsing,” in Proceedings of the IEEE international conference on Acoustics, Speech, and Signal Processing, Hong Kong China, pp. 268-271, 2003. 6. Y. -Y. Wang and A. Acero, “Combination of CFG and N-gram modeling in semantic grammar learning,” in Proceedings of Eurospeech, Geneva Switzerland, pp. 1229-1232, 2003. 7. G. Tur and R. D. Mori, Spoken Language Understanding: systems for extracting semantic information from speech, Wiley, 2011.

- 28 -

8. C. Lee, J. Eun, M. Jeong, G. G. Lee, Y. G. Hwang, M. G. Jang, “A multi-strategic concept-spotting approach for robust understanding of spoken Korean,” ETRI journal, Vol.29, No.2, pp. 179-188, 2007. 9. A. Celikyilmaz, D. Hakkani-Tur, G. Tur, A. Fidler, D. Hillard, “Exploiting distance based similarity in topic models for user intent detection,” in Proceedings of IEEE workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 425-430, 2011. 10. M. Jeong and G. G. Lee, “Jointly predicting dialog act and named entity for spoken language understanding,” in Proceedings of IEEE/ACL 2006 Workshop Spoken Language Technology, pp. 66-69, Dec. 2006. 11. G. Tur, D. Hakkani-Tur, A. Chotimongkol, “Semi-supervised learning for spoken language understanding semantic role labeling,” in Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 232-237, 2005. 12. J. Lafferty, A. McCallum, F. C. N. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in Proceedings of 18th International Conference on Machine Learning, pp. 282-289, 2001. 13. L. A. Ramshaw and P. M. Marcus, “Text chunking using transformation-based learning,” in Proceedings of the Third ACL Workshop on Very Large Corpora, pp. 82-94, 1995.

- 29 -

14. M. Jeong and G. G. Lee, “Exploiting non-local features for spoken language understanding,” in Proceedings of the COLING/ACL on Main conference poster sessions, pp. 412-419, 2006. 15. E. T. Jaynes, “Information theory and statistical mechanics,” Physical review, Vol. 106, No. 4, pp. 620-630, 1957.

- 30 -

Acknowledgements 감 사 의 글 먼저

저를

이

곳까지

이끄신

하나님

아버지께

감사

드립니다.

실업계고교를 졸업하고 공장에서 생산직으로 일하며 살아가려던 제가 어느새 대학을 졸업하고 이제는 석사학위를 수여 받게 되었습니다. 어느 누가 이런 인생의 반전을 생각해 낼 수 있었을지, 되돌아보면 제 삶의 모든 것이 하나님의 은혜이며 그 계획 아래 있다는 것을 새삼 느낍니다. 어려움이 있을 때마다 예상치 못한 방법으로 저를 이끄시고 위로해 주신 은혜를 기억하며 앞으로의 삶도 주님께 내려놓고 주님 뜻에 따라 살아가길 원합니다. 많이 부족한 저를 연구실 멤버로 받아 주시고 2 년 동안 지도해 주신 이근배 교수님께 감사 드립니다. 또한 수업을 통하여 많은 가르침을 주신 학위 논문 심사위원 이종혁 교수님과 바쁘신 중에도 흔쾌히 학위 논문 심사를 맡아주신 유환조 교수님께도 감사 드립니다. 멀리 창원에서 그리고 포항에서 공부하는 저를 언제나 응원해주시고 반겨주신 동광원 가족 분들께 감사 드립니다. 정말 다양한 방법으로 저를 도와주시고 큰 사랑 베푸신 지준홍 이사장님과 임혜령 원장님, 두 분이 베푸신 큰 사랑을 본 받아 저도 사랑을 베풀 줄 아는 사람이 되겠습니다. 어릴 적부터 제 문제를 상담해 주시고 조언을 아끼지 않으신 이옥재 국장님, 저도 저의 경험과 지식을 남을 위해 사용하는 사람이 되겠습니다. 제 신앙의 처음을 열어주신 지인숙 선생님, 저도 당신처럼 삶을 통해 주님을 드러낼 수 있는 그런 사람이 되도록 기도하고 노력하겠습니다. 항상 뒤에서 응원하고 챙겨준 김소윤, 류윤임, 문소연 선생님 그리고 지금은 수원에서 보지 못하는 많은 선생님들 고맙습니다. 여러분의 응원이 있었기에 지금의 제가 있습니다. 나이

- 31 -

차이가 많이 나서 불편할 텐데도 찾아 갈 때마다 반겨준 동광원 베드로방 친구들에게도 고마움을 표합니다. 포항에서의 2 년을 함께해 온 연구실 식구들 감사합니다. 오랫동안 함께하진 못했지만 저의 학문적 롤 모델 되신 민우형과 성진이형, 항상 밝은 분위기를 이끌어내신 청재형, 처음 포항에 왔을 때 멘토로서 친절히 가르쳐 주던 진식이형, 학업 뿐만 아니라 모든 일에 있어서 같이 고민해 준 석환이형, 책을 많이 읽어 아는 게 많았던 경덕이형, 묵묵히 맡은 일을 해내던 종훈이형, 적절한 타이밍에 개그 공격하는 동현이형, 객관적인 시각으로 모든 일을 평가할 수 있던 형종이형, 긍정과 순수가 무엇인지 보여준 규송이형과 쿨함의 끝을 보여준 준휘, 맛 집이 궁금할 때 늘 답을 주던 인재형, Visual Studio 마스터 세천이형, 힘든 일도 웃어 넘길 줄 아는 용희, 즐기고 노력할 줄 아는 성한이, 힘든 상황에서도 객관적으로 상황을 바라볼 줄 아는 지수, 사람들을 잘 파악하고 적절히 대할 줄 아는 상도, 연구실의 중심 상준이, 그리고 출장 갈 때마다 같은 걸 또 물어도 이해해주시는 이해심 넓은 영선누나까지 연구실에서 함께했던 모두에게 감사 드립니다. 제 포항생활을 더 윤택하게 해준 동아리 보우시즈의 홍중, 창호, 종선, 경훈이와 과제연구 학생으로 인연을 맺게 된 드러머 재연이, 연구실도 전공도 다르지만 대학원에 같이 입학한 동기이기에 친해 질 수 있었던 혜지, 자주 신세 졌던 기타리스트 경훈이, 그 외 저와 인연을 맺은 모든 분들께도 감사의 말을 전합니다. 마지막으로 이곳 포항 생활의 기초가 되어준 포스텍교회 식구들 감사 드립니다. 이곳 포항에 내려와 교회를 개척하시고 매일 같이 영의 양식을 채워주신 강신철 목사님과 최수진 사모님, 웃음으로 리드하는 찬양팀 리더 병남이형과 못 하는 것도 없고 얼굴도 마음도 예쁜 지은이, 뭐든지 쉽게 쉽게 배워버리는 찬오, 하나님의 선물 내 동생 정은이, 한 사람 한 사람 챙겨주는 속 - 32 -

깊은 민경이, 함께 할 수 있었던 기간이 짧아 너무 너무 아쉬운 룸메이트 동진이, 머뭇거리는 찬양팀의 추진력이 되어주는 진욱이, 남을 위해 희생할 줄 아는 미애, 찬양팀의 필수멤버 정리 왕 용진이, 신앙생활의 멘토 성환이형, 연구실도 찬양팀도 함께한 절대음감 지수, 함께 있는 사람들을 언제나 즐겁게 하는 정임이, 웃음 보증 수표 명희누나, 친해지고 얼마 안돼서 미국으로 떠나버린 태연이형, 허락된 달란트 교회에서 사용하는 모습이 보기 좋은 창일이와 그 외의 많은 청년들과 교수님들과 사모님들을 포함한 모든 포스텍 교회 성도 여러분, 제가 얼마나 소중한 존재인지 깨닫게 해주셔서 감사 드립니다. 여러분이 제게 가르쳐준 그 하나님의 사랑, 저도 이제 그 사랑을 전하며 살고 싶습니다. 무엇보다도 이 모든 것을 제 삶에 허락하시어 지금의 저를 있게 하신 하나님 아버지께 다시 한번 감사와 영광을 올립니다.

- 33 -

Curriculum Vitae Personal Name

Hongsuck Seo

Education Mar. 2011 – Feb. 2013

M.S. Candidate in Computer Science and Engineering, Pohang University of Science and Technology (POSTECH) (GPA: 4.06/4.3)

Mar. 2006 – Feb. 2011

B.S. in Computer Engineering, Changwon National University (GPA: 4.43/4.5, Summa Cum Laude)

Research Experiences Internship Program Jun. 2010 – Aug. 2010

Multi-agent Systems Lab. Gwangju Institute of Science and Technology Developed an ontology for a cloud service search engine

Research Assistant Sep. 2006 – Aug. 2008

Natural Language Processing Lab. Changwon National University Developed a resource management system which helps researchers manage and share their papers with their coworkers.

Selected Research Projects 2012

Language Analyzer for Japanese TTS Samsung Electronics

2011 – 2012

Development of Intelligent Robots for English Conversation Tutoring. Ministry of Knowledge and Economy

2011 – 2012

Research laboratory for natural language-based immersive English tutoring system Ministry of Education Science and Technology

2011

Development of Dialog-based Speech Interfaces for Mobile Platforms Ministry of Knowledge and Economy

2011

Development of Language Analyzers for Korean/English/Spanish TTS Samsung Electronics

2011

Korean Prosody Modeling for HMM-based Conversational Speech Synthesis Ministry of Education Science and Technology

Publications International Conferences 

A meta-learning approach to grammatical error correction. Hongsuck Seo, Jonghoon Lee, Seokhwan Kim, Kyusong Lee, Sechun Kang, Gary Geunbae Lee. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012), Jeju Island, Korea, July 2012.



Grammatical error annotation for Korean learners of spoken English. Hongsuck Seo, Kyusong Lee, Gary Geunbae Lee, Soo-ok Kweon. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, May 2012.



Generating garamar questions using corpus data in L2 learning. Kyusong Lee, Soo-ok Kweon, Hongsuck Seo, Gary Geunbae Lee. In Proceedings of the 2012 IEEE Workshop on Spoken Language Technology (SLT 2012), Miami, December 2012.

Domestic Conferences 

Automatic assessment and feedback system for spoken English. Jonghoon Lee, Jinsik Lee, Hongsuck Seo, Gary Geunbae Lee, Seonhee Kim, Minhwa Jeong, Hoyoung Lee. In Proceedings of the HCI Society of Korea – workshop on educational Games, Pyeongchang, Jan 2012.



English pronunciation feedback system without given text. Jonghoon Lee, Jinsik Lee, Hongsuck Seo, Sechun Kang, Gary Geunbae Lee, Seonhee Kim, Minhwa Jeong. In Proceedings of Korean Society of Speech Sciences (KSSS 2011 Fall), Seoul, Dec 2011.



A sentence stress feedback system for English proficiency improvement. Jinsik Lee, Jonghoon Lee, Hongsuck Seo, Sechun Kang, Gary Geunbae Lee, Ho-young Lee. In Proceedings of Korean Society of Speech Sciences (KSSS 2011 Fall), Seoul, Dec 2011.



Grammatical error detection system for spoken and written English of L2 Learners. Hongsuck Seo, Sungjin Lee, Jinsik Lee, Jonghoon Lee, Gary Geunbae Lee In Proceedings of the 23rd Annual Conference on Human and Cognitive Language Technology (HCLT 2011), Seoul, Oct 2011.



POMY: Dialogue system for immersive English education. Hyungjong Noh, Hongsuck Seo, Kyusong Lee, Sungjin Lee, Gary Geunbae Lee. In Proceedings of Korea Computer Congress (KCC 2011), Gyeongju, Jun, 2011



Language analyzer for Spanish TTS. Hongsuck Seo, Jinsik Lee, Jonghoon Lee, Cong Liu, Gary Geunbae Lee. In Proceedings of Korean Society of Speech Sciences (KSSS 2011 Spring), Iksan, Jun 2011



Rule-based implementation of English sentence prediction. Cong Liu, Jonghoon Lee, Jinsik Lee, Hongsuck Seo, Gary Geunbae Lee, Ho-young Lee. In Proceedings of Korean Society of Speech Sciences (KSSS 2011 Spring), Iksan, Jun 2011

Patents Domestic Patents 

Method and apparatus of stress feedback for foreign language training. Hongsuck Seo, Jinsik Lee, Jonghoon Lee, Gary Geunbae Lee.



Method and apparatus of building a phrase-break corpus from labeled transcripts by multiple annotators. Hongsuck Seo, Jinsik Lee, Jonghoon Lee, Gary Geunbae Lee.



Method and apparatus of machine learning for generation of multiple answers. Hongsuck Seo, Jonghoon Lee, Jinsik Lee, Gary Geunbae Lee.



Method and apparatus of generating pronunciation variation and detecting pronunciation errors. Hongsuck Seo, Jonghoon Lee, Jinsik Lee, Gary Geunbae Lee.

본 학위논문 내용에 관하여 학술 교육 목적으로 사용할 모든 권리를 포항공과대학교에 위임함

Multiple User Intent Understanding for Spoken Dialog ...

Master's Thesis. Multiple User Intent Understanding for Spoken Dialog System. Hongsuck Seo (ì í ì). Division of Electrical and Computer Engineering. (Computer Science ... that it is worthy of acceptance for a master's degree from. POSTECH. 12. ...... Developed a resource management system which helps researchers ...

Download PDF

704KB Sizes 7 Downloads 195 Views

Report

Multiple User Intent Understanding for Spoken Dialog ...

Recommend Documents