Srinivasan Janarthanam & Oliver Lemon

Viewer
Transcript

A Data-driven method for Adaptive Referring Expression Generation in Automated Dialogue Systems: Maximising Expected Utility Srinivasan Janarthanam & Oliver Lemon School of Informatics, University of Edinburgh www.classic-project.org

Introduction

User Simulation

Adaptive generation of referring expressions in dialogue systems benefits grounding between dialogue partners (Issacs & Clark 1987). Although adapting to users is beneficial, adapting to an unknown user is tricky, and hand coding such adaptive REG policies is cumbersome work. We present a data-driven framework to automatically learn an adaptive REG (NLG) policy for spoken dialogue systems (fig. 1) using Reinforcement Learning. The learned policy tries to maximise the expected utility of the RE choices made by the system.

Step 1. P(CRu,t | REs,t, DKu,RE, H) Step 2a. P(Au,t | As,t , CRu,t) Step 2b. P(EAu,t | As,t , CRu,t) REs,t – Referring expression used in the system’s utterance at turn t. CRu.t – Clarification Request by the user u at turn t. DKu.RE – Domain Knowledge of the user u on referring expression RE. H – History of clarifications already given. As,t – System action at turn t. Au,t – User dialogue action at turn t. EAu,t – User’s environment action at turn t.

Fig 1. Adaptive dialogue system

NLG module

Fig 2. Reinforcement Learning Setup

Translates dialogue acts into system utterances. Identifies the RE to be used in the utterances to refer to the domain objects based on the REG policy: • Jargon – Use technical terms as referring expressions “Connect one end of the broadband cable to the broadband filter.”

User simulation model probabilities were populated from data collected using Wizard-of-Oz experiments with real users. See (Janarthanam and Lemon, ENLG 2009) for more information.

Training

• Descriptive – Use descriptive referring expressions “Connect one end of the thin cable with grey ends to the small white box.”

• Tutorial – Use both to teach technical terms “Connect one end of the broadband cable to the broadband filter. The broadband cable is the thin cable with grey ends. The broadband filter is the small white box.”

The user model (UMs,u), based on which REs are chosen, is dynamically updated with information on the user’s domain knowledge.

Reinforcement Learning A basic Reinforcement Learning (RL) setup consists of a learning agent, its environment and a reward model (Sutton and Barto, 1998). The learning agent explores by taking different possible actions in different states and exploits the actions for which the environmental rewards are high. RL has been successfully used for learning dialogue management policies (Levin et al., 1997). In our model, the learning agent is the NLG module of the dialogue system, whose objective is to learn an adaptive REG policy. The environment consists of a user simulation which interacts with the dialogue system (fig 2). The NLG module explores by choosing different expressions during learning. The user simulation rewards the system when the system chooses the appropriate referring expressions. The NLG module reinforces the choices that get more reward from the user and avoid the choices that get less reward. © Srini Janarthanam and Oliver Lemon

Fig 3. Training graph with mixed user population – experts and novices.

Task Completion Reward (TCR)= 1000 Cost of CR (CCR) = -n(CR) * 30 Final Reward = TCR + CCR

Evaluation Policy

Avg. Reward

Avg. CRs

LP

944.09*

1.8

Random

919.31*

2.69

Desc only

948.19

1.75

Jargon only

885.35*

3.82

The current Learned policy (LP) is better than the Random and Jargon-only policies. It is currently as good as a Desc-only policy. Future work will explore much longer training runs and different learning parameters to produce better policies. PRE-CogSci 2009, Amsterdam