Insights on Transferability of Dialog-Act Cue-Phrases Across Communication Domains, Modality and Semantically Similar Dialog-Acts Ashish Sureka and Atul Goyal {ashish, atul08015}@iiitd.ac.in Indraprastha Institute of Information Technology, Delhi (IIIT-D) New Delhi, India Abstract The paper presents empirical results, illustrative examples and insights on the topic of domain transferability of dialog-act cue phrases. We analyze four publicly available dialog act tagged corpora (Switchboard, ICSI MRDA, BC3 Email and NPS Chat) belonging to different communication domains and modality and study the extent to which corpus-derived n-gram word cue-phrases from one domain can be transferred or ported as discriminatory features for dialog act recognition belonging to other domains. The transferability of cue-phrases is studied with respect to cross-domain (discussion on general topics, official meetings), cross-modality (transcribed phone conversation, email, online chat) and cross-dialog-acts (questions and action directive mapping to requests). We describe a dialog act classifier for categorizing sentences in Email domain based on a model trained on preannotated data belonging to transcribed spoken conversation. The precision and recall of the proposed method for crossdomain dialog act labeling shows that the method is effective.
1
Background and Research Goals
Dialogue acts (also referred to as speech act or illocutionary act) refers to the act of communicating the intention of the speaker to the listener (receiver or addressee)(Austin, 1962)(Searle, 1969)(Searle, 1975). Dialogue acts represents the communicative function (illocutionary force and user intent) of an utterance in a dialogue (Austin, 1962)(Searle, 1969)(Searle, 1975). Requesting (for example the utterance: could you please send me the information) or asking is an example of
directive speech act where as promising (for example the utterance: I will send you the information by tomorrow) is an example of commisive speech act. Searle et al. suggests multiple classes of speech acts such as Assertives (suggesting, concluding), Directives (asking, requesting), Commisives (promising, vowing), Expressives (thanking, apologizing) and Declarations (Searle, 1969)(Searle, 1975). Automatic dialog act recognition in free-form text has several practical applications such as: intelligent assistance for email data management, deception detection in online communication, information retrieval based on dialog acts and user intentions, user profiling and identifying roles and relationships between people based on their interactive conversations, email thread and online threaded discussion summarization. However, automatically identifying dialog act categories (greeting, question, apology, request, commit, warning, deny, propose, promise, suggest, advice etc) in natural language text based user conversations is an intellectually challenging problem because of technical issues such as: natural language ambiguities at word and sentence level, presence of noise in text such as incorrect grammar & spelling, short-forms, presence of large number of expressions and language constructs for expressing the same intention. The technical challenges and practical utility of the problem triggered considerable amount of research activity on dialog act classification. There have been several research efforts on building dialogue act classifiers and developing natural language processing applications that employ dialogue act classifiers as building blocks. In order to provide experimental and evaluation dataset to the research community, several handlabeled dialog act corpora annotated according to a pre-defined tag-set was created for research purposes. For example, the Switchboard Dialog
Act Corpus consisting of transcribed one-on-one human-to-human telephone conversations is one such manually annotated corpus. The Switchboard Dialog Act Corpus is annotated with approximately 60 basic dialog act tags. The corpus comprises of 205,000 utterances annotated using a tag-set known as the SWBD-DAMSL (Switchboard Database - Discourse Annotation and Markup System of Labeling) labels. However, manual labeling of dialogues with respect to a dialogue act tag-set is a human-effort intensive and time consuming process. There are several application domains for which manually annotated training data is not available and the lack of sufficiently labeled data in such domains makes the task of training or learning a classifier challenging. The research presented in this paper is triggered by the need to investigate solutions that exploits annotated data from one domain to perform dialog act classification in another domain. The focus of this paper is on domain adaptability of dialog act classifiers. The research aim of the work presented in this paper is following: 1. To investigate solutions for automatic dialog act classification for a domain for which manually annotated training data is not available. In particular, our interest is to investigate techniques based on learning predicting models from a domain which has sufficiently large annotated corpora and apply the learnt model to automatically label utterance and sentences (belonging to a different domain than the source domain) with appropriate dialog acts. 2. To provide fresh perspectives and insights on the topic of cross-domain, cross-modality and cross-labeling of dialog act categories by performing empirical analysis on four publicly available pre-annotated corpora belonging to different domains and modality. Our research objective is to extend state-of-the-art in the area of domain adaptation in dialogue act classification by proposing novel methods for automatic dialog act classification based on identifying domain transferable discriminatory linguistic features.
2
Related Work
In this Section, we present a broad overview of closely related work. We first review work in the
area of speech act and dialog act classification. We survey some of the main papers (due to limited space in the paper) on real-world applications of dialog act classifiers. After reviewing papers on dialog act classifiers and applications, we describe previous work on the topic of domain adaption in dialog act classifier which is the primary focus of this paper. We also compare and contrast our work from the closely related work on domain adaptability in dialog act classifier. 2.1
Dialog Act Classification Techniques and Applications
Corston-Oliver et al. presents a machine learning based sentence level speech act classifier for automatically identifying action items (tasks) in email messages (Corston-oliver et al., 2004). Feng at al. use graph-based algorithm and integrate different features such as lexical similarity, poster trustworthiness, and speech act analysis (such as confirm or acknowledge, advice or suggest, question) to analyze human conversation focus in the context of online discussions (Feng et al., 2006). Ravi et al. developed linear SVM based speech act classifiers (exploiting N-gram features) which is used to tag messages as questions and answers and then perform thread analysis to identify threads that may have unanswered questions for the purpose automatically profiling student interactions in on-line discussions (Ravi and Kim, 2007). Adkins et al. present methods for discriminating between deceptive and non-deceptive messages in CMC (computer mediated communication) using speech act profiling. Their result shows that using speech act profiling as part of deception detection in text-based synchronous conversations is promising (Mark Adkins and Jr., 2004). Cohen et al. present a machine learning based approach to perform email specific speech act classification (Cohen et al., 2004). Carvalho et al. showed that exploiting sequential correlation among email messages in the same thread can improve email-act classification (Carvalho and Cohen, 2005). Leuski at al. use statistical classification techniques to detect and assign the speech acts to individual email messages for the purpose of extracting semantic roles (for example, graduate student and a research assistant) and determining the relationship between a pair of roles (for example, research adviser and student relation between person A and B) (Leuski, 2005). Lampert et al.
present their research motivation as development of intelligent and automated assistance to users in understanding the status of their email conversations and propose mechanism for tagging email utterances with speech acts (Lampert et al., 2006). They describe a statistical classifier that automatically identifies the literal meaning category of utterances using the Verbal Response Modes (VRM) taxonomy of speech acts (Lampert et al., 2006). Ulrich et al. present a regression-based machine learning approach to email thread summarization and investigate the usefulness of features based on speech act for solving the problem of summarization of email conversations (Jan Ulrich, 2009). They apply manual annotation and used a system called as Ciranda to assign speech acts labels at the email and sentence level in their experiments and showed that speech acts are a useful feature for the task of email thread summarization (Jan Ulrich, 2009). Mildinhall et al. discusses a method of partitioning speech acts within email messages consisting of a unidirectional stream of several utterances and successions of speech acts. Their approach is based on examining the structure of messages by comparing probability transition matrices of speech act categories using multidimensional scaling and hierarchical clustering (Mildinhall and Noyes, 2008). 2.2
Learning and Domain Adaptation for Statistical Classifiers
Blitzer et al. introduce a method called as structural correspondence learning to adapt existing models from a resource-rich source domain to a resource-poor target domain (Blitzer et al., 2006). They test their technique for the NLP task of part of speech tagging and demonstrate encouraging results. Hal et al. present a framework for understanding the domain adaptation problem and apply their proposed novel framework in the context of conditional classification models and conditional linear-chain sequence labeling models (Daum´e and Marcu, 2006). Mansour discuss learning and domain adaptation and present recent theoretical models and results regarding domain adaptation (Mansour, 2009). 2.3
Cross-Domain Adaptability of Dialog Act Classifiers
Cross-domain adaptability of dialog act classification models is studied by Tur et al. (study on portability of model induced from SWITCH-
BOARD corpus to predict the dialog acts of utterances in the ICSI-MRDA corpus) and Webb et al. (study on portability of n-gram cue phrases from Switchboard corpus to AMITIES corpus) (Tur et al., 2006a), (Tur et al., 2006b), (Webb and Liu, 2008). The work by Tur et al. is based on supervised model adaptation and uses the Boosting family of classifiers whereas our specific interest is the transferability of n-gram cue-phrases across domains. The most closely related work to this paper is the work done by Webb et al. (2008). The research motivation of Webb et al. is to study domain adaptability and transferability of dialogue act predictive models to new domains and corpora (a relatively unexplored area) which is the same as our research motivation. The work of Web et al. examines the portability of corpus-derived n-gram cue phrases (as presence of certain domain independent cue phrases are strong indicators of dialogue act categories) for cross-domain dialogue act tagging. Their study consists of extracting frequent word n-grams from Switchboard corpus and examining the transferability and generality of the derived n-grams to the AMITIES (Automated Multilingual Interaction with Information and Services) corpus (Webb and Liu, 2008). The corpora used in the study by Webb et al. consist of two-sided human-to-human spoken dialogue transcriptions. The difference between the work by Webb et al. and this work is that we study domain adaptability of corpus derived frequent ngram cue-phrases from transcribed telephone conversations which has sufficiently large annotated corpora to a quite diverse domain and modality such as email and online chat which does not have large annotated corpora. Also, we study approaches to reduce noise or irrelevant features by combining knowledge from multiple source corpora (and assigning more weight to common features) to label dialog acts in a target corpora which is not covered by Webb et al.
3
Research Contributions
While there has been significant amount of work done in the area of automatic dialog act classification on free-form text and its applications, the topic of domain adaptability in context to dialog act classification is a relatively unexplored area. The work presented in this paper is an attempt to advance the state-of-the-art on domain adaptabil-
Yes-No-Question
Or-Question
Wh-Question
Rhetorical-Question
Action-Directive
Commit
do they have
are you talking
do you think
do we have
dont worry about
i have to
do you do
do you have
how do you
how do you
dont you go
i will ask
do you have
do you think
what are you
if you dont
lets talk about
i will start
do you know
is it a
what do you
is there a
why dont you
i will try
do you think
is it just
what is it
the question is
you go ahead
i will do
do you want
is that just
what is the
you know what
you have to
im going to
is that right
or do you
what kind of
what kind of
you need to
will do that
is that the
or is it
what was the
you want to
will try to
is that what
or something like
what would you
Table 1: Illustratuve list of common frequent tri-grams between Switchboard Corpus and ICSI MRDA Corpus for six dialog acts. Corpus
N-gram
YN-Q
OR-Q
WH-Q
RH-Q
Open-Q
Directive
Commit
SW
bi-gram
6% 7% 5% 20%
4% 17% 3% 10%
8% 10% 12% 16%
0% 5% 4% 4%
12% 16% 3% 23%
3% 19% 7% 20%
18% 11% 16% 6%
SW
tri-gram
MRDA
bi-gram
MRDA
tri-gram
Table 2: Percentage of noise in top 100 n-gram cue phrases derived from SW and MRDA corpora for various dialog act categories 3. Study and empirical analysis on effect of combining n-gram cue phrases derived from dialog act tagged corpus from two different communication domains to identify dialog acts on utterances belonging to another application domain.
ity in dialog act classification and in context to the related work, makes the following unique contributions: 1. To the best of our knowledge, this paper is the first study on developing a dialog act classifier using frequent n-gram based features derived from Switchboard and ICSI MRDA corpus (written transcription of twosided telephone conversation recording and multi-party meeting recording) and testing its performance with respect to domain adaptation on a corpus belonging to a diverse domains or modality of email (BC3 corpus) and online chat (NPS corpus). The paper reports empirical results by performing experiments on four corpora: SW, MRDA, BC3 and NPS. The usage of these four publicly available corpora in a single study for dialog act analysis is the first in literature. 2. Study and empirical analysis on the transferability of n-gram cue phrases derived from utterances annotated with one type of tags (such as action-directive and questions in telephone conversations) to utterances labeled with semantically similar tags (such as requests in formal email communications between team members in an organization).
4
Empirical Analysis and Results
The corpora used in this study belongs to transcribed spoken telephone conversations (Switchboard and ICSI-MRDA corpus) (Jurafsky et al., 1997) (Shriberg et al., 2004), emails (BC3 corpus) (Ulrich et al., 2008) (Carvalho and Cohen, 2006) and online chat (NPS corpus) (Forsythand and Martell, 2007). In this Section, we present research questions, our hypothesis and report empirical results in support of our hypothesis. Research Question (RQ1): To what extent irrelevant features (noise) can be eliminated by extracting common n-gram cue-phrases from SW and MRDA corpus We derive top (most frequent) 100 bi-grams and tri-grams from the MRDA and Switchboard (SW) corpus for various dialog act categories1 . We notice that there are some n-grams which are fre1 We used LingPipe for frequent n-gram extraction : http://alias-i.com/lingpipe
N-gram
YN-Q
OR-Q
WH-Q
RH-Q
Open-Q
Directive
Commit
bi-gram
0% [0/48] 4% [1/24]
0% [0/42] 0% [0/23]
0% [0/41] 0% [0/17]
0% [0/39] 0% [0/14]
0% [0/19] 12% [1/8]
0% [0/34] 10% [1/10]
0% [0/20] 0% [0/15]
tri-gram
Table 3: Number of common frequent n-grams between SW and MRDA corpus for various dialog acts and the percentage of noise within the common n-grams quent in a corpus within a particular dialog act category but is actually not an indicator of that dialog act (not considering n-grams which are frequent in general such as ”on the”, ”in the”, ”of the”, ”to the”, ”at the”, ”for the”, ”with the”, ”for a”, ”in the”, ”on it”, ”on this”, ”to be”). For example, the phrase ”a lot of” shows up within the top 10 frequent tri-grams in SW corpus for the Yes-No Question dialog act category but is actually not a discriminatory feature (n-gram in this case) for the Yes-No Question dialog act (also it is not a feature which is common to other dialog act categories). Thus, the n-gram ”a lot of” can be regarded as noise (a feature which is irrelevant and not a signal or predictor for the dialog act in which it is prevalent) - an artifact of a specific dataset rather than a cue which has generalization power. A visual inspection of the extracted n-gram clearly reveled to us that noise in frequent n-gram cue-phrases can be present even in sufficiently large corpus. We manually classified each n-gram as relevant or irrelevant to understand the extent to which noise can be present. Table 2 presents the percentage of noise in top 100 n-gram cue phrases derived from SW and MRDA corpora for various dialog act categories. The results presented in Table 2 reveals that noise can vary from a minimum of 0% to a maximum of 23%. We observe that several irrelevant n-grams are present in the beginning of top 100 frequent n-gram list and several relevant ngrams appear in the bottom of the list. The noise is interleaved with the signal and hence it is not straight-forward to remove noise by just setting a cut-off (for example, top 30) to separate signal from noise. We hypothesize that extracting common frequent n-grams across two corpora for the same dialog act can remove irrelevant features. Our hypothesis is based on the premise that an n-gram which is frequent across two corpora for the same dialog act category is a stronger property of the respective dialog act than a frequent n-gram which occurs in one corpus. Hence, we conduct an experiment to check if noise can be eliminated and
domain independent cue phrases can be extracted by computing an intersection of the frequent ngrams from two different corpora. Table 1 lists common frequent top 30 n-grams for six dialog acts between MRDA and SW corpus. A visual inspection of the result shows that noise can be eliminated by combining frequent n-gram results from different corpora and n-grams which lie at the intersection can be assigned a higher discriminatory weight. We also notice that there are several important discriminatory n-grams which are present in MRDA corpus but not in SW corpus and viceversa which suggests that combining features from two corpora can help in reducing noise as well as increase the coverage of discriminatory features since the final features to be used are derived from knowledge gained from two diverse corpora (empirical results in Table 5 and 6). Table 3 presents empirical results to answer the stated research question: To what extent irrelevant features (noise) can be eliminated by extracting common n-gram cue-phrases from SW and MRDA corpus. The results in Table 3 reveals that there are 42 (out of 100) common bi-grams for the dialog act OR-Question and all 42 are relevant features (i.e., the percentage of noise is 0%). Similarly, there are 24 (out of 100) common trigrams for the YN-Question dialog act and the percentage of noise is 4% i.e., there is 1 out of 24 features which is irrelevant and not a predictor for the YN-Question dialog act. We conclude that extracting common frequent n-gram for the same dialog act across two corpora can help in eliminating noise. The limitation is that the common frequent n-grams features constitutes only a certain percentage of the combined cue phrases that can be derived from the corpora as a result of which using only common frequent n-gram based features for dialog act categorization will result in low classification accuracy. Research Question (RQ2): To what extent frequent n-gram based features from one corpus (SW and MRDA) are present in a semantically similar dialog act for another
BC3 Sentence (Email Act: Request)
Corpus
Dialog Act
so now we have to decide on the map which we will use to overlay the dots on
MRDA
Action-Directive
we need to specify the map as soon as possible in order to get postcards printed for www2003
MRDA
Action-Directive
things you need to do to meet the checkpoint these can be fairly general
MRDA
Action-Directive
for those of you who want to attend the sig let me know if you have any hard constraints
MRDA
Action-Directive
things you need to do to meet the checkpoint these can be fairly general
SW
Action-Directive
look at the source
SW
Action-Directive
for example it would be very useful so that the final wg schedule was available
MRDA
Offer
if we can plan what about planning the year australia included
MRDA
Offer
only if we can go to a certain shop barb knows what i am talking about
MRDA
Offer
maybe we could go there
MRDA
Offer
is there a good time to call in after amsterdam lunch
MRDA
Rhetorical-Question
it seems slightly buggy but what do you think
SW
Rhetorical-Question
it seems slightly buggy but what do you think
SW
Rhetorical-Question
what about the wcag curriculum
MRDA
Open-Question
do you feel we will need the full two days to work on the draft
SW
Open-Question
what are your thoughts
SW
Open-Question
is that a discussion you feel able to contribute to
MRDA
Or-Question
has the decision been made or should i express my preference for seattle
MRDA
Or-Question
what is the relevance to web accessibility
MRDA
Wh-Question
where is the best place to find our specifics about the issues involved with web pages
MRDA
Wh-Question
what are your thoughts
SW
Wh-Question
what is the url
SW
Wh-Question
is there any reason it has to be geographically accurate
MRDA
Yes-No-Question
is there any way to find out how many people have voted
MRDA
Yes-No-Question
would it be helpful if i was present
MRDA
Yes-No-Question
would it be acceptable to provide a warning message something like submitting this form
MRDA
Yes-No-Question
if you have any questions or special requirements please contact me
SW
Yes-No-Question
Table 4: Illustrative list of sentences (having request email act) in BC3 email corpus containing frequent tri-grams of Switchboard and ICSI MRDA Corpus. corpus belonging to a different modality (BC3 and NPS) Table 4 is an illustrative list of sentences in BC3 Email corpus tagged with Request dialog act (referred as an Email Act by (Carvalho and Cohen, 2006)) consisting of n-gram cue phrases derived from MRDA and SW corpus. Our interest is to study whether cue-phrases derived from one corpus (MRDA and SW) for a set of dialog acts (Action-Directive, Offer, Rhetorical-Question, Open-Question, Or-Question, Wh-Question and Yes-No-Question) can be used as features for semantically similar or related (but not the same) dialog act category (Request) across different communication domain and modality (from data on transcribed spoken conversation on phone on general topics and official meetings to email discussions between team members in an organization). Table 4 shows that a speakers
intention of Request can mapped to dialog acts such as questions, action directive and offer. Our objective is to see if cue-phrases learnt from dialog-acts for which annotated data is available (in the source corpus) can be used for tagging semantically related dialog-acts (in the target corpus) for which no training or annotated data is available in the source corpus (Request is not a dialog act category in Switchboard and MRDA corpus). We notice that the Proposal request act in BC3 corpus also contains cue-phrases derived from question, action directive and offer dialog act tagged data from our source corpus. A visual inspection validates our hypothesis that cue phrase transferability between semantically similar dialog acts across different communication domain is possible. We perform an experiment to answer the stated research question RQ2. Table 5 and 6 presents empirical results on transferability
N-Gram Bi-Gram
Tri-Gram
BC3
SW
MRDA
SW+MRDA
SW+MRDA Bi-Grams
NPS Bi-Grams
Request
58%
60%
67%
in NPS Corpus
not in SW+MRDA
Commit
75%
64%
82%
are you
was the
how come
who ate
Propose
60%
72%
79%
are we
what are
how is
who comes
Request
13%
25%
27%
do you
what did
how was
who is
Commit
14%
27%
34%
did you
what do
how you
who said
28%
how are
what is
on what
who says
how did
what was
what happen
who wants
Propose
12%
22%
Table 5: Empirical results on domain transferability (source: MRDA and SW, target: BC3) for bigram and tri-gram cue-phrases N-Gram Bi-Gram Tri-Gram
NPS
SW
MRDA
SW+MRDA
yn-Ques
37%
38%
42%
wh-Ques
40%
46%
50%
yn-Ques
05%
04%
07%
wh-Ques
09%
08%
13%
Table 6: Empirical results on domain transferability (source: MRDA and SW, target NPS Chat) for bi-gram and tri-gram cue-phrases
of dialog-act cue phrases across communication domains, modality and semantically similar dialog-acts. As presented in Table 5, we notice that the bi-grams in SW corpus with Commit tag are present in 75% of the sentences in BC3 email corpus (a different domain and modality than MRDA and SW corpus). Similarly, the tri-grams derived from MRDA corpus for the Commit tag are present in 27% of the sentences in BC3 corpus. Table 5 presents results for bi-gram and tri-gram for three dialog acts in the BC3 corpus. Table 5 presents improved accuracy results by combining n-gram cue-phrases from two corpora. For example, we notice that if the bi-grams for the MRDA and SW corpus are combined then the combined bi-grams are present in 67%, 82% and 79% of the sentences in BC3 corpus for Request, Commit and Propose category. We observe that there is a certain class of sentences in the target domain for which the source domain does not contain enough examples to learn or induce a model for the same or similar dialog-act category. For example, following is a list of illustrative sentences in BC3 corpus labeled with the Request dialog-act tag but does not contain a frequent bi-gram or tri-gram (n-gram cue phrases derived from semantically similar dialog-act categories) from the source corpora:
how do
what kind
what time
who was
how old
where are
when did
why did
how many
where did
where in
why is
how much
why not
where from
why were
Table 7: List of presence and absence of bi-gram cue-phrases across corpora for wh-question and yn-question
- anyone know any web sites that are highly - anyone out there who can help - can people check their diaries - how is everyone with 4:00pm edt on monday - please check out the exact time for when - please do add yourselves if youve not - please let us know as soon as possible - please talk to your sysadm about This is due to domain specific linguistic properties (in an email setting, it is common to make a request or an action directive using the please keyword). This examination shows that there are some indicators (cue-phrases like the presence of please word to make a request or action-directive) which are not present in the source domain and hence learning from additional labeled data is required to have a high precision and recall classifier for the target domain for which enough labeled data is not available. Also, we notice that the correlation between the presence of phrases like anyone know and the request or action-directive speech act is not extracted from the source domain like SW corpus as the SW consists of a conversation between two parties unlike the target domain where a person is sending a request mail to list or group (hence using the words: anyone, everyone or people). Table 6 presents empirical results on transferability of bi-grams and tri-grams from SW and MRDA corpus to NPS Chat Corpus for YN-Question and WH-Question. Table 7 lists the union of SW and MRDA n-grams in NPS Corpus and the NPS n-grams which are not in SW and
Classified as Request Classified as Commit Classified as Propose Unable to Classify Total Sentences Precision Recall
Request
Commit
Propose
29 7 36 28 100 40% 29%
2 11 8 0 21 52% 52%
13 2 37 6 58 71% 63%
Table 8: BC3 dialog act classifier performance (based on combined SW and MRDA frequent bigram based features) results in terms of precision and recall
Classified as Request Classified as Commit Classified as Propose Unable to Classify Total Sentences Precision Recall
Request
Commit
Propose
11 2 11 76 100 45% 11%
1 5 5 10 21 45% 24%
4 1 16 37 58 76% 28%
Table 9: BC3 dialog act classifier performance (based on combined SW and MRDA frequent trigram based features) results in terms of precision and recall MRDA corpus. We applied text pre-processing to the NPS chat corpus as it contains terms which are out-of-vocabulary and slang common in casual online chat conversations. The pre-processing consists of replacing slang and short-forms to their correct form (for example, ”r u” or ”ru” is replaced with ” are you ”, ”whats” or ”wat” or ”wats” is replaced with ”what is ”, ”any 1” is replaced with ”anyone”) . The results show that domain-specific pre-processing is required and can increase accuracy. Results in Table 6 and 7 show the extent of domain transferability of cue phrases. Even after combining cue phrases from two large corpora (MRDA and SW), we notice that there are several cue-phrases (refer to Table 7) which are left out. However, across domains and modality there is a transferability of around 40% of cue-phrases (for bi-grams) for the two question dialog act (refer to Table 6). Research Question (RQ3): To what extent frequent n-gram based features from SW and MRDA corpus can be used to classify sentences in BC3 corpus
The main research aim of this work to develop a dialog act classifier for the BC3 email corpus which is trained on SW and MRDA transcribed spoken conversation corpora. We develop a predictive model that takes a manually annotated sentence (tagged either as Request, Propose or Commit) from BC3 corpus and predicts its dialog act category (as Request, Propose or Commit). The performance of the proposed classifier is evaluated based on the percentage of correctly classified sentence for each of the category. The BC3 corpus consists of 40 email threads (3222 sentences) from the W3C corpus (Ulrich et al., 2008). The annotation consists of Extractive Summaries, Abstractive Summaries with linked sentences Labeled Sentences with Speech Acts Labels (Propose, Request, Commit, and Meeting), Meta Sentences and Subjectivity Labels. We extract sentences tagged as Request, Propose or Commit by at-least two annotators which act as the test dataset having the ground-truth. We first derive top 100 frequent bi-grams and tri-grams for the eight dialog act categories (commit, action-directive, offer, yes-no question, wh-question, or-question, openquestion and rhetorical-question) and create a matrix that assigns a numeric score to each n-gram and the dialog-act pair. The numeric score represents how good a predictor (discriminatory power) the n-gram is for the various dialog act categories. We apply the following formula to compute the numeric score: score = √
IN 2 RA ∗ OT
(1)
IN takes three values: 0, 1 or 2. If an n-gram is not present in the frequent n-gram list of SW and MRDA for a particular dialog act category then the value of IN is 0. This basically means that if an ngram is not correlated to a dialog act category in the training dataset then it is not a predictor for that particular dialog act. An n-gram may be present in the frequent n-gram list of SW and not MRDA (and vice-a-versa) for the same dialog act. In such cases, IN is assigned the value of 1. If the n-gram is present in the frequent n-gram list of both source corpora then IN is assigned the value of 2. The premise is that if an n-gram shows correlation with the particular dialog act both the source corpora then it is probably a good predictor for that specific dialog act. This links to our previous argument that frequent n-grams based features common in
two corpora have more generalization power than n-grams present in only one corpus. RA takes values from 1 to 100. RA represents rank of the n-gram in the frequent n-gram list of the dialog act for which the score is getting computed. If the n-gram is present in the frequent ngram of both the corpora then we assign the higher rank to RA. This is based on the premise that more frequent an n-gram is in sentences belong to a particular dialog act, the better indicator it is for that particular dialog act. OT takes value from 1 to 15 and represents the number of other frequent n-gram list in which the target n-gram is present. The basic premise is that if an n-gram is present in multiple frequent n-gram list then its discriminatory power decreases as it is a property of multiple dialog acts. If an n-gram is unique to a particular dialog act then its discriminatory power for that particular dialog act is higher. The intention behind Equation (1) (a proposed heuristic) is to capture the discriminatory power of an n-gram (feature or property of a dialog act) as a function of its frequency for the target dialog act, commonalty across multiple corpora and presence in frequent n-gram list of other dialog acts. We map commit dialog act in source corpora to commit dialog act in the target corpora. The command, suggestion and question dialog act in the source corpora is mapped to request and propose in the target corpora. We perform two sets of experiments (one each for combined bi-grams and tri-grams for the SW and MRDA corpus). Table 8 and 9 shows the confusion matrix, number of sentences unable to classify (as no n-gram from the test sentence was found in the trained model), precision and recall for Request, Propose and Commit dialog acts in BC3 corpus. Table 8 and 9 reveals the extent to which Request, Propose and Commit dialog acts in BC3 corpus can be accurately classified (answering the stated research question RQ3) based on a model learnt from a different corpora (SW and MRDA) belonging to a different domain. We observe the proposed classifier is able to discriminate between commit and request/propose with good accuracy whereas the discrimination between request and propose is posing challenges to the classifier.
5
Conclusions
Intersection of the set of corpus derived cuephrases from multiple corpora results in eliminating noise and non-discriminatory n-gram features. Sufficiently large available annotated corpora (Switchboard and MRDA) alone is not enough for deriving a comprehensive list of cuephrases for commonly used dialog act categories and a hybrid (combining results from multiple corpora) solution results in improved accuracy. Results show that transferability of cue-phrases across semantically related dialogue acts, modality and domain is possible. However, due to different genres and writing styles between two domains, certain word n-gram indicators or clues required to classify sentences in target domain are unavailable in source domain and hence cannot be derived or learnt. The paper describes a method to label request, commit and propose dialog acts in Email domain (target domain) based on a model trained on pre-annotated data belonging to transcribed spoken conversation (source domain). The study presents an empirical analysis to further the understanding of four publicly available labeled corpora belonging to different domains with respect to domain adaptation in dialogue act recognition.
References John L. Austin. 1962. How to do things with words. Harvard Univiversity Press, Cambridge, MA. John Blitzer, Ryan McDonald, and Fernando Pereira. 2006. Domain adaptation with structural correspondence learning. In EMNLP ’06: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pages 120–128, Morristown, NJ, USA. Association for Computational Linguistics. Vitor R. Carvalho and William W. Cohen. 2005. On the collective classification of email ”speech acts”. In SIGIR ’05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 345– 352. Vitor R. Carvalho and William W. Cohen. 2006. Improving email speech act analysis via n-gram selection. In Proceedings of the HLT/NAACL - ACTS Workshop, pages 35–41, New York City, NY. Association for Computational Linguistics. William W. Cohen, Vitor R. Carvalho, and Tom M. Mitchell. 2004. Learning to classify email into
“speech acts”. In Dekang Lin and Dekai Wu, editors, Proceedings of EMNLP 2004, pages 309–316, Barcelona, Spain, July. Association for Computational Linguistics. Simon Corston-oliver, Eric Ringger, Michael Gamon, and Richard Campbell. 2004. Task-focused summarization of email. In Proceedings of the Text Summarization Branches Out ACL Workshop. Hal Daum´e, III and Daniel Marcu. 2006. Domain adaptation for statistical classifiers. J. Artif. Int. Res., 26(1):101–126. Donghui Feng, Erin Shaw, Jihie Kim, and Eduard Hovy. 2006. Learning to detect conversation focus of threaded discussions. In Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, pages 208– 215, Morristown, NJ, USA. Association for Computational Linguistics. Eric N. Forsythand and Craig H. Martell. 2007. Lexical and discourse analysis of online chat dialog. In ICSC ’07: Proceedings of the International Conference on Semantic Computing, pages 19–26, Washington, DC, USA. IEEE Computer Society. Gabriel Murray Raymond Ng Jan Ulrich, Giuseppe Carenini. 2009. Regression-based summarization of email conversations. In 3rd Int’l AAAI Conference on Weblogs and Social Media (ICWSM-09), San Jose, CA. AAAI. D. Jurafsky, R. Bates, N. Coccaro, R. Martin, M. Meteer, K. Ries, E. Shriberg, A. Stolcke, P. Taylor, and C. Van Ess-Dykema. 1997. Switchboard discourse language modeling project report. Technical report, Johns Hopkins University, Center for Speech and Language Processing, Baltimore, MD. Andrew Lampert, Robert Dale, and Cecile Paris. 2006. Classifying speech acts using verbal response modes. In Proceedings of the Australasian Language Technology Workshop 2006, pages 34–41, Sydney, Australia, November. Anton Leuski. archives.
2005.
Context features in email
Yishay Mansour. 2009. Learning and domain adaptation. In ALT’09: Proceedings of the 20th international conference on Algorithmic learning theory, pages 4–6, Berlin, Heidelberg. Springer-Verlag. Judee K. Burgoon Mark Adkins, Douglas P. Twitchell and Jay F. Nunamaker Jr. 2004. Advances in automated deception detection in text-based computermediated communication. In Proceedings of the SPIE Defense and Security Symposium, Florida, USA. John Mildinhall and Jan Noyes. 2008. Toward a stochastic speech act model of email behavior. In Fifth Conference on Email and Anti-Spam, Mountain View, California, USA.
Sujith Ravi and Jihie Kim. 2007. Profiling student interactions in threaded discussions with speech act classifiers. In Proceeding of the 2007 conference on Artificial Intelligence in Education, pages 357–364, Amsterdam, The Netherlands, The Netherlands. IOS Press. John R. Searle. 1969. Speech acts. Cambridge University Press. John R. Searle. 1975. A taxonomy of illocutionary acts. language, mind and knowledge. Minnesota Studies in the Philosophy of Science, pages 344– 369. Elizabeth Shriberg, Raj Dhillon, Sonali Bhagat, Jeremy Ang, and Hannah Carvey. 2004. The icsi meeting recorder dialog act (mrda) corpus. In Michael Strube and Candy Sidner, editors, Proceedings of the 5th SIGdial Workshop on Discourse and Dialogue, pages 97–100, Cambridge, Massachusetts, USA, April 30 - May 1. Association for Computational Linguistics. G. Tur, U. Guz, and D. Hakkani-Tur. 2006a. Model adaptation for dialog act tagging. Proc. IEEE/ACL Workshop on Spoken Language Technology. Gokhan Tur, Umit Guz, and Dilek Hakkani-Tur. 2006b. Model adaptation for dialog act tagging. In Proceedings of the IEEE Workshop on Spoken Language Technology (SLT 2006), pages 94–97, Palm Beach, Aruba, December. IEEE. Jan Ulrich, Gabriel Murray, and Giuseppe Carenini. 2008. A publicly available annotated corpus for supervised email summarization. In Proceedings of the AAAI 2008 EMAIL Workshop, pages 77–81, Chicago, USA. AAAI. Nick Webb and Ting Liu. 2008. Investigating the portability of corpus-derived cue phrases for dialogue act classification. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 977–984, Manchester, UK, August. Coling 2008 Organizing Committee.