Reinforcement Learning for Adaptive Dialogue Systems PART III: Simulation-based Dialogue Strategy Optimisation

Oliver Lemon

Verena Rieser

School of Informatics University of Edinburgh For updated course materials see: http://sites.google.com/site/olemon/eacl09

EACL tutorial, March 2009

Outline Simulation-based Reinforcement Learning Data Collection and Corpus requirements Simulated environments for dialogue optimisation State-Action Space Noise model Simulated Users Data-driven reward modelling Policy training and evaluation

Outline Simulation-based Reinforcement Learning Data Collection and Corpus requirements Simulated environments for dialogue optimisation State-Action Space Noise model Simulated Users Data-driven reward modelling Policy training and evaluation

Simulation-based RL

Components of a simulated dialogue environment

I

State-action space and dialogue processing constraints (e.g. Information State Update rules defining the dialogue “skeleton”)

I

User simulation

I

Noise/error model

I

Reward function

→ trained via Supervised Learning (SL).

Dialogue simulation on the Speech Act level

Figure: The user action (au ), noisy state estimate of user action s˜u , system action based on noisy state estimate a˜s , system action given current state as )

Advantages of simulation-based RL

I

Large amounts of artifical data can be generated.

I

SL of simulated components requires less training data.

I

Online learning.

I

Can discover strategies which are not in the initial data set.

Challenges for simulation-based RL

I

The quality of the learned strategy depends on the quality of the simulated environment.

I

Reward function needs to be explicitly set.

I

Transferability to real dialogue settings?

I

Often no suitable in-domain data available to train simulations from (chicken-and-egg problem).

Learning for new applications: A chicken-and-egg problem

Learning for new applications: A chicken-and-egg problem

2 approaches to overcome the problem: I

Hand-craft probabilities, update and retrain [Schatzmann et al., 2007a]

I

Start by exploring Wizard-of-Oz (WOZ) data [Williams and Young, 2004, Prommer et al., 2006, Rieser and Lemon, 2008c].

Wizard-of-Oz data collection [Fraser and Gilbert, 1991]

Example: “Bootstrapping” Approach from WOZ data I

1. Collect WOZ data [Fraser and Gilbert, 1991]. 2. Train and test simulated environment using Supervised Learning. 3. Train and evaluate dialogue policies in simulation using Reinforcement Learning. 4. Evaluate learned strategies with real users! 5. Meta-evaluate the whole framework, e.g. show that results transfer.

Example: “Bootstrapping” Approach from WOZ data II

Outline Simulation-based Reinforcement Learning Data Collection and Corpus requirements Simulated environments for dialogue optimisation State-Action Space Noise model Simulated Users Data-driven reward modelling Policy training and evaluation

Corpus requirements for simulation-based RL

I

Define an optimisation task: non trivial decisions (fix what’s obvious!)

I

Coverage of state-action space: explore competing strategies in context.

I

User reactions for each state/action pair (→ train user simulation).

I

User ratings for each dialogue (→ model reward function).

Example: SAMMIE data collection [Kruijff-Korbayová et al., 2006] I

I

Domain: multimodal information-seeking dialogue for MP3, in-car.

I

Actions: multimodal output options.

I

6 wizards, 21 subjects, 4 tasks each, ca. 1700 turns.

I

Wizards are not restricted by a script.

I

Input noise simulation (also see [Stuttle et al., 2004]).

Example: SAMMIE data collection [Kruijff-Korbayová et al., 2006] II

Example: SAMMIE data collection [Kruijff-Korbayová et al., 2006] III

Examples User: “Please search for music by Björk." Wizard: “I found 43 items." The items are displayed on the screen." [displays list] User: “Please select ‘Human Behaviour’."

Example: Non-trivial trade-offs for multimodal information presentation

I

Presentation timing: I I I

I

when to present information, how many pieces of information to the user, or ask for further constraints.

Presentation mode: how the retrieved items are presented I I

multimodal mode (screen and speech) verbal mode (uni-modal).

→ hierachical decision problem.

Outline Simulation-based Reinforcement Learning Data Collection and Corpus requirements Simulated environments for dialogue optimisation State-Action Space Noise model Simulated Users Data-driven reward modelling Policy training and evaluation

Action set for dialogue optimisation

Action set: defines the set of possible choices available to the learner at each state. I

Conventionally predefined by the system designer, e.g. [Singh et al., 2002, Walker, 2000, Henderson et al., 2008].

I

WOZ data allows to study human behaviour first, e.g. [Williams and Young, 2004, Levin and Passonneau, 2006].

State space for dialogue optimisation

State space: defines the agent’s view of the environment. I

System runtime features, e.g. information_provided= yes/no.

I

Often manually selected, see critique [Paek, 2006].

I

Scalability issue for RL algorithms.

I

As little features as possible, as informative as possible.

I

Automatic feature selection techniques [Rieser and Lemon, 2006b].

Example: Hierarchical State-Action space

2

n 2 2 3 askASlot filledSlot 1 | 2 | 3 | 4 | : 0,1 6 6 6 7 n 6 6acquisition action: 6implConfAskASlot7state: 6 6confirmedSlot 1 | 2 | 3 | 4 | : 6 7 6 6 4explConf 5 6 4 n o 6 6 DB: 1-438 presentInfo 6 6 n o3 2 6 6 DB low: 0,1 6 " # 6 6 n o7 6 7 presentInfoVerbal 6 7 6presentation action: 0,1 DB med: state: 6 6 7 6 presentInfoMM 4 5 n o 4 DB high 0,1

Noise models for dialogue optimisation

Noise model: simulates the channel noise, introduced by Automatic Speech Recognition (ASR), and is often measured in terms of word error rate (WER) or concept error rate (CER). The system’s next action (as ) is based on a noisy estimate (s˜u ) of the user’s action (au ). au → s˜u → a˜s → as ;

Previous approaches to noise modelling

I

I

Slot level: Fixed error rate per dialogue, e.g. [Pietquin and Renals, 2002, Henderson et al., 2008, Rieser and Lemon, 2008b]. String level: I

I

Simulate phone-level confusions, e.g. [Pietquin, 2004, Stuttle et al., 2004, Deng et al., 2003]. Simulate word-level confusion, e.g. [Pietquin and Dutoit, 2006b, Schatzmann et al., 2007b]

Noise model evaluation

I

Intrinsic: Similarity of generated user utterances between simulated and real (initial) corpus, e.g. [Schatzmann et al., 2007b]

I

Extrinstic: Policy performance in simulation, e.g. [Pietquin and Beaufort, 2005, Pietquin and Dutoit, 2006a]

Simulated Users for dialogue optimisation User simulation: predictive model of the next user action (au ) in a specific context/state (S) S → au → as → S 0 → . . . Purpose: I

Automatic evaluation, e.g. [Chung, 2004, López-Cózar et al., 2003, Möller et al., 2006]→ realistic estimate of expected results with real users.

I

Automatic training, e.g. [Georgila et al., 2006, Schatzmann et al., 2006, Ai et al., 2007]→ exploration of complete (all possible) user actions in one state.

Previous approaches to user simulations I

Level of abstraction: I

Acoustic level, e.g. [López-Cózar et al., 2003, Chung, 2004, Filisko and Seneff, 2006],

I

Word-level, e.g. [Watanabe et al., 1998],

I

Intention level, most RL approaches, introduced by [Eckert et al., 1997].

Previous approaches to user simulations II

Issues: I

Training using n-grams. bu,t ≈ argmaxau ,t P(au,t |as,t−1 ); a

I

User goal modelling for task consistency.

I

Amount of training data required.

I

Instrinsic vs. extrinsic evaluation.

Example: Cluster-based user simulations from small data sets

I

Problem: Data sparsity → user simulation is not complete; bad for training!

I

Idea: Similar to real users within similar contexts/clusters (but not identical) to allow exploration of unseen state-action pairs.

I

Method: Bi-gram model based on clusters of similar system states, [Rieser and Lemon, 2006a]. bu,t ≈ argmaxau ,t P(au,t |clusters,t−1 ) a

Example: Cluster-based vs. bi-gram user simulations

Table: Bi-gram model (left) vs. cluster-based model (right)

Evaluating user simulations I

Intrinsic: I

I I

I

I

I

(Expected) accuracy, recall, and precision with respect to the user population in the initial data set [Schatzmann et al., 2005a], [Georgila et al., 2006]. Perplexity [Georgila et al., 2005] ‘High-level’ comparison of generated and real corpora [Scheffler and Young, 2001], [Schatzmann et al., 2005a], [Ai and Litman, 2006]. HMM similarity of real and generated dialogues [Cuayáhuitl et al., 2005]. Cramér-von Mises divergence [Williams, 2007]

Extrinsic: I

Policy performance in simulation [Schatzmann et al., 2005b], [Ai et al., 2007], [Lemon and Liu, 2007]

Example: SUPER evaluation

I

Simulated User Pragmatic Error Rate [Rieser and Lemon, 2006a]

I

Simulated users should show varying, but also complete and consistent behaviour in a certain context.

I

Variation (V ), consistency (no insertions I), completeness (no deletions D) m

SUPER

=

1 XV +I+D m n k =1

...where n is the number of possible actions in a context, m is the number of contexts.

Example: SUPER results

SUPER

parameters Training Testing

smoothed 59.47 -5.74

user simulations: cluster random 19.15 -4.89 -0.83 -17.38

I

Training: more variation,  = 0.1 and δ = 0.4

I

Testing: more realistic,  = 0.05 and δ = 0.1

majority -24.90 -29.90

Reward function for dialogue optimisation

Reward function: defines a mapping r (d, i) from a dialogue d and a position in that dialogue i to a reward value r . I

Final value of ”goodness“ of a dialogue, e.g. task success, user satisfaction, etc.

I

Also known as: Objective function, evaluation function.

I

Reading material: [Walker, 2005].

Previous approaches to reward modelling I

I

“The most hand-crafted aspect of Reinforcement Learning" [Paek, 2006]. Manually constructed, e.g. [Levin et al., 2000, Frampton and Lemon, 2006, Williams and Young, 2007]. I

I

Example: (−1) for each turn; (+10) for task success.

Obtained from data [Walker, 2000, Rieser and Lemon, 2008c].

I PARADISE

framework [Walker et al., 2000].

US |{z}

subjective

=

n X

wi × N(Ci )

i=1

|

{z

objective

}

Example: reward functions for information presentation

TaskEase = − 20.2 ∗ dialogueLength + 11.8 ∗ taskCompletion + 8.7 ∗ multimodalScore;

I

Stepwise linear regression for feature selection (information gain analysis).

I

Subjective target variable (= what you care about), e.g. Task Ease, User Satisfaction, Learning Gain, etc..

I

Objective, system-runtime input features (= what you can control), e.g. number of turns.

Example: non-linear reward functions for multimodal presentation reward function for information presentation 10

multimodal presentation: MM(x) verbal presentation: Speech(x)

0

turning point:14.8

-10 intersection point

user score

-20 -30 -40 -50 -60 -70 -80 0

10

20

30

40

50

60

70

no. items

Figure: Evaluation functions relating number of items presented in different modalities to multimodal score [Rieser and Lemon, 2008c]

Evaluating PARADISE models

I

Intrinsic: Goodness-of-fit R 2 [Möller et al., 2007].

I

Extrinsic: Prediction performance [Walker et al., 2000], [Engelbrecht and Möller, 2007].

I

Meta: Model stability across user populations [Rieser and Lemon, 2008a].

Outline Simulation-based Reinforcement Learning Data Collection and Corpus requirements Simulated environments for dialogue optimisation State-Action Space Noise model Simulated Users Data-driven reward modelling Policy training and evaluation

Policy training in simulation

I

Simulated environment.

I

RL algorithm (SARSA, Q-learning, etc.)

I

Log runtime features, simulated dialogue moves, etc.

I

Visualise what was learned.

Policy training using SHARSHA [Shapiro and Langley, 2002]

Training environment REALL-DUDE [Lemon et al., 2006b]

Policy testing in simulation

I

Test with a different user simulation, otherwise it’s ”cheating“ [Paek, 2006].

I

Use an informative baseline.

I

Use a T-test (or equivalent) for comparing final reward, dialogue length, etc., for significant improvements.

I

Interpret what was learned.

Example: Baseline I

I

A baseline should allow meaningful comparisons.

I

Our baseline: Rule-based Supervised Learning (SL) on WOZ data.

I

Allows to measure the relative improvements over the average wizard strategy (= human performance) present in the initial data.

Example: Simulated dialogue for multimodal information presentation RL policy: greet (db: 438) sim.User: prvAsked(artist=Nirvana) RL policy: implConf(artist=Nirvana), AskASlot(album=?) (db:26) sim.User: prvAsked(album=MTV Unplugged) RL policy: present[mm](artist=Nirvana, album=MTV Unplugged) (db:14) sim.User: click(song-title=On a Plain) RL policy: present[verbal] (db:1) sim.User: yes-answer(yes)

Seperation of Speech Acts, e.g. prvAsked, and task model, e.g. artist=Nirvana, for User Simulation and learned strategy.

Policy testing with real users

I

Integrate your learned policy into a working system, e.g. using DUDE [Lemon and Liu, 2006].

I

Log system runtime features.

I

Collect subjective measures using a questionnaire.

I

Use statistical tests to show significant differences, e.g. Wilcoxon signed-rank test for the subjective measures.

I

Interpret the results, e.g. was there anything which influenced the ratings which you failed to model in your simulated environment?

Example Questionnaire, adapted from PARADISE

1. In this conversation, it was easy to find what I was searching for. (Task Ease) 2. The system worked the way I expected it to, in this conversation. (Expected Behaviour) 3. In this task, I thought the system had no problems understanding me. (NLU Performance) 4. In this task, the system was easy to understand. (TTS Performance) 5. In this task, I thought the system chose to present the search results at the right time. (Presentation Timing) 6. In this task, I thought the number of items displayed on the screen was right. (MM Presentation) 7. In this task, I thought the amount of information presented in each spoken output was right. (Verbal Presentation) 8. In this task, I found that searching for music distracted me from the driving simulation. (Cognitive Load) 9. Based on my experience in this conversation, I would like to use this system regularly. (Future Use)

Example: Results subjective ratings

measure Task Ease Expected Behaviour NLU TTS timing MM Presentation Verbal Presentation Cognitive Load Future Use

SL Baseline 4.78(±1.84) 4.79 (±1.68) 4.92(±1.61) 4.73 (±1.23) 4.42 (±1.84) 4.57 (±1.87) 4.94 (±1.52) 5.06 (±1.19) 3.86 (±1.44)

RL Strategy 5.51(±1.44)*** 5.48(±1.27)*** 5.43 (±1.43) 5.18 (±1.23) 5.36 (±1.46)*** 5.32 (±1.62)*** 5.55 (±1.38)*** 4.85 (±1.37) 4.68 (±1.39)***

Table: *** denotes statistical significance at p < .001

Example: Results objective measures

Measure av. turns av. speech items av. MM items av. reward

SL baseline 5.86(±3.2) 1.29(±.4) 52.2(±68.5) -628.2(±178.6)

RL Strategy 5.07(±2.9)*** 1.2(±.4) 8.73(±4.4)*** 37.62(±60.7)***

Table: *** denotes significant difference between SL and RL at p < .001

Transfer between real and simulated environments

I

Are results obtained in simulation a valid estimate of real dialogues? [Lemon et al., 2006a, Rieser and Lemon, 2008c]

I

Show that objective measures (e.g. dialogue length) are transferable (=not significantly different).

I

Use the data from the user tests to retrain your simulations, e.g. the reward function [Rieser and Lemon, 2008a].

I

Compare initial and updated model (meta-evaluation).

Example: Do results transfer?

meas./env.

SIM

av. turns av.Speech av. MM av.reward

5.9(±2.4) 1.1(±.3) 11.2(±2.4) 44.06(±51.5

RL Strategy REAL 5.07(±2.9) 1.2(±.4) 8.73(±4.4) 37.62(±60.7)

Example: Re-training the reward function

Summary: Bootstrapping approach

1. Collect data in a WOZ experiment. 2. Construct simulated environment using Supervised Learning techniques. 3. Train and evaluate dialogue policies in simulation. 4. Test with real users. 5. Show that the results between simulated and real interaction are compatible. Verena Rieser. Bootstrapping Reinforcement Learning-based Dialogue Strategies from Wizard-of-Oz data. Saarbrueken Dissertations in Computational Linguistics and Language Technology, Vol 28. [Rieser, 2008]. http://homepages.inf.ed.ac.uk/vrieser/thesis.html.

Ai, H. and Litman, D. (2006). Comparing real-real, simulated-simulated, and simulated-real spoken dialogue corpora. In Proc. of the AAAI Workshop on Statistical and Empirical Approaches for Spoken Dialogue Systems. Ai, H., Tetreault, J., and Litman, D. (2007). Comparing user simulation models for dialog strategy learning. In Proc. of the North American Meeting of the Association of Computational Linguistics (NAACL). Chung, G. (2004). Developing a flexible spoken dialog system using simulation. In Proc. of the Annual Meeting of the Association for Computational Linguistics (ACL). Cuayáhuitl, H., Renals, S., Lemon, O., and Shimodaira, H. (2005). Human-computer dialogue simulation using hidden markov models. In Proc. of the IEEE workshop on Automatic Speech Recognition and Understanding (ASRU). Deng, Y., Mahajan, M., and Acero, A. (2003). Estimating speech recognition error rate without acoustic test data. In Proc. of the European Conference on Speech Communication and Technology (Eurospeech). Eckert, W., Levin, E., and Pieraccini, R. (1997). User modeling for spoken dialogue system evaluation. In Proc. of the IEEE workshop on Automatic Speech Recognition and Understanding (ASRU).

Engelbrecht, K.-P. and Möller, S. (2007). Pragmatic usage of linear regression models for the predictions of user judegments. In Proc. of the 8th SIGdial workshop on Discourse and Dialogue. Filisko, E. and Seneff, S. (2006). Learning decision models in spoken dialogue systems via user simulation. In Proc. of the AAAI Workshop on Statistical and Empirical Approaches for Spoken Dialogue Systems. Frampton, M. and Lemon, O. (2006). Learning more effective dialogue strategies using limited dialogue move features. In Proc. of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING/ACL). Fraser, N. M. and Gilbert, G. N. (1991). Simulating speech systems. Computer Speech and Language, 5:81–99. Georgila, K., Henderson, J., and Lemon, O. (2005). Learning user simulations for information state update dialogue systems. In Proc. of the International Conference of Spoken Language Processing (Interspeech/ICSLP). Georgila, K., Henderson, J., and Lemon, O. (2006). User simulation for spoken dialogue systems: Learning and evaluation. In Proc. of the International Conference of Spoken Language Processing (Interspeech/ICSLP).

Henderson, J., Lemon, O., and Georgila, K. (2008). Hybrid reinforcement / supervised learning of dialogue policies from fixed datasets. Computational Linguistics (to appear). Kruijff-Korbayová, I., Becker, T., Blaylock, N., Gerstenberger, C., Kaisser, M., Poller, P., Rieser, V., and Schehl, J. (2006). The SAMMIE corpus of multimodal dialogues with an MP3 player. In Proc. of the 5th International Conference on Language Resources and Evaluation (LREC). Lemon, O., Georgila, K., and Henderson, J. (2006a). Evaluating Effectiveness and Portability of Reinforcement Learned Dialogue Strategies with real users: the TALK TownInfo Evaluation. In Proc. of the IEEE/ACL workshop on Spoken Language Technology (SLT). Lemon, O. and Liu, X. (2006). DUDE: a dialogue and understanding development environment, mapping business process models to Information State Update dialogue systems. In Proc. of the Conference of the European Chapter of the ACL (EACL). Lemon, O. and Liu, X. (2007). Dialogue policy learning for combinations of noise and user simulations: transfer results. In Proc. of the 8th SIGdial workshop on Discourse and Dialogue, pages 55–58. Lemon, O., Liu, X., Shapiro, D., and Tollander, C. (2006b). Hierarchical Reinforcement Learning of dialogue policies in a development environment for dialogue systems: REALL-DUDE.

In Proc. of the 10th SEMdial Workshop on on the Semantics and Pragmatics of Dialogues. Levin, E. and Passonneau, R. (2006). A WOz variant with contrastive conditions. In Proc. Dialog-on-Dialog Workshop, Interspeech. Levin, E., Pieraccini, R., and Eckert, W. (2000). A stochastic model of human-machine interaction for learning dialog strategies. IEEE Transactions on Speech and Audio Processing, 8(1). López-Cózar, R., la Torre, A. D., Segura, J. C., and Rubio, A. J. (2003). Assessment of dialogue systems by means of a new simulation technique. Speech Communication, 40(3):387–407. Möller, S., Englert, R., Engelbrecht, K., Hafner, V., Jameson, A., Oulasvirta, A., Raake, A., and Reithinger, N. (2006). MeMo: Towards automatic usability evaluation of spoken dialogue services by user error simulations. In Proc. of the International Conference of Spoken Language Processing (Interspeech/ICSLP). Möller, S., Smeele, P., Boland, H., and Krebber, J. (2007). Evaluating spoken dialogue systems according to de-facto standards: A case study. Computer Speech & Language, 21(1):26 – 53. Paek, T. (2006). Reinforcement Learning for spoken dialogue systems: Comparing strengths and weaknesses for practical deployment.

In Proc. Dialog-on-Dialog Workshop, Interspeech. Pietquin, O. (2004). A Framework for Unsupervised Learning of Dialogue Strategies. PhD thesis, Faculté Polytechnique de Mons. Pietquin, O. and Beaufort, R. (2005). Comparing ASR modeling methods for spoken dialogue simulation and optimal strategy learning. In Proc. of the International Conference of Spoken Language Processing (Interspeech/ICSLP). Pietquin, O. and Dutoit, T. (2006a). Dynamic bayesian networks for NLU simulation with application to dialog optimal strategy learning. In Proc. of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). Pietquin, O. and Dutoit, T. (2006b). A probabilistic framework for dialog simulation and optimal strategy learnin. IEEE Transactions on Audio, Speech and Language Processing, 14(2):589–599. Pietquin, O. and Renals, S. (2002). ASR system modeling for automatic evaluation and optimization of dialogue systems. In Proc. of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). Prommer, T., Holzapfel, H., and Waibel, A. (2006).

Rapid simulation-driven Reinforcement Learning of multimodal dialog strategies in human-robot interaction. In Proc. of the International Conference of Spoken Language Processing (Interspeech/ICSLP). Rieser, V. (2008). Bootstrapping Reinforcement Learning-based Dialogue Strategies from Wizard-of-Oz data. PhD thesis, Saarbrueken Dissertations in Computational Linguistics and Language Technology, Vol 28. Rieser, V. and Lemon, O. (2006a). Cluster-based user simulations for learning dialogue strategies. In Proc. of the 9th International Conference of Spoken Language Processing (Interspeech/ICSLP). Rieser, V. and Lemon, O. (2006b). Using Machine Learning to explore human multimodal clarification strategies. In Proc. of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING/ACL). Rieser, V. and Lemon, O. (2008a). Automatic learning and evaluation of user-centered objective functions for dialogue system optimisation. In Proc. of the 6th International Conference on Language Resources and Evaluation (LREC). Rieser, V. and Lemon, O. (2008b).

Does this list contain what you were searching for? Learning adaptive dialogue strategies for Interactive Question Answering. J. Natural Language Engineering, 15(1). Rieser, V. and Lemon, O. (2008c). Learning effective multimodal dialogue strategies from Wizard-of-Oz data: Bootstrapping and evaluation. In Proc. of the 21st International Conference on Computational Linguistics and 46th Annual Meeting of the Association for Computational Linguistics (ACL/HLT). Schatzmann, J., Georgila, K., and Young, S. (2005a). Quantitative evaluation of user simulation techniques for spoken dialogue systems. In Proc. of the 6th SIGdial Workshop on Discourse and Dialogue. Schatzmann, J., Stuttle, M., Weilhammer, K., and Young, S. (2005b). Effects of the user model on simulation-based learning of dialogue startegies. In Proc. of the IEEE workshop on Automatic Speech Recognition and Understanding (ASRU). Schatzmann, J., Thomson, B., Weilhammer, K., Ye, H., and Young, S. (2007a). Agenda-based user simulation for bootstrapping a pomdp dialogue system. In Proc. of the North American Meeting of the Association of Computational Linguistics (NAACL). Schatzmann, J., Thomson, B., and Young, S. (2007b). Error simulation for training statistical dialogue systems. In Proc. of the IEEE workshop on Automatic Speech Recognition and Understanding (ASRU).

Schatzmann, J., Weilhammer, K., Stuttle, M., and Young, S. (2006). A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies. Knowledge Engineering Review, 21(2):97–126. Scheffler, K. and Young, S. (2001). Corpus-based dialogue simulation for automatic strategy learning and evaluation. In Proc. NAACL Workshop on Adaptation in Dialogue Systems. Shapiro, D. and Langley, P. (2002). Separating skills from preference: Using learning to program by reward. In Proc. of the 19th International Conference on Machine Learning (ICML). Singh, S., Litman, D., Kearns, M., and Walker, M. (2002). Optimizing dialogue management with Reinforcement Learning: Experiments with the NJFun system. JAIR, 16:105–133. Stuttle, M. N., Williams, J. D., and Young, S. (2004). A framework for dialogue data collection with a simulated ASR channel. In Proc. of the International Conference of Spoken Language Processing (Interspeech/ICSLP). Walker, M. (2000). An application for Reinforcement Learning to dialogue strategiey selection in a spoken dialogue system for email. Artificial Intelligence Research, 12:387–416. Walker, M. (2005). Can we talk? Methods for evaluation and training of spoken dialogue system.

Language Resources and Evaluation, 39(1):65–75. Walker, M., Kamm, C., and Litman, D. (2000). Towards developing general models of usability with PARADISE. Natural Language Engineering, 6(3). Watanabe, T., Araki, M., and Doshita, S. (1998). Evaluating dialogue strategies under communication errors using computer-to-computer simulation. IEICE transactions on information and systems, E81-D(9):1025–1033. Williams, J. (2007). A method for evaluating and comparing user simulations: The cramer-von mises divergence. In Proc. of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). Williams, J. and Young, S. (2004). Using Wizard-of-Oz simulations to bootstrap Reinforcement-Learning-based dialog management systems. In Proc. of the 4th SIGDIAL Workshop on Discourse and Dialogue. Williams, J. and Young, S. (2007). Partially Observable Markov Decision Processes for spoken dialog systems. Computer Speech and Language, 21(2):231–422.

Reinforcement Learning for Adaptive Dialogue Systems

43 items - ... of user action ˜su, system action based on noisy state estimate ˜as, system action given current state as) ... Online learning. .... Simulate phone-level confusions, e.g.. [Pietquin ... Example: Cluster-based user simulations from small ..... business process models to Information State Update dialogue systems. In Proc.

2MB Sizes 3 Downloads 214 Views

Recommend Documents

Next challenges for adaptive learning systems
ios has been rapidly increasing. In the last ... Requirements for data mining and machine learning in gen- eral and .... Another way to scale up the adaptive prediction system is to ..... The variety of data types and sources calls for specialized.

Reinforcement Learning Trees
Feb 8, 2014 - size is small and prevents the effect of strong variables from being fully explored. Due to these ..... muting, which is suitable for most situations), and 50% ·|P\Pd ..... illustration of how this greedy splitting works. When there ar

Bayesian Reinforcement Learning
2.1.1 Bayesian Q-learning. Bayesian Q-learning (BQL) (Dearden et al, 1998) is a Bayesian approach to the widely-used Q-learning algorithm (Watkins, 1989), in which exploration and ex- ploitation are balanced by explicitly maintaining a distribution o

Adaptive Pairwise Preference Learning for ...
Nov 7, 2014 - vertisement, etc. Automatically mining and learning user- .... randomly sampled triple (u, i, j), which answers the question of how to .... triples as test data. For training data, we keep all triples and take the corresponding (user, m

Reinforcement Learning as a Context for Integrating AI ...
placing it at a low level would provide maximum flexibility in simulations. Furthermore ... long term achievement of values. It is important that powerful artificial ...

Kernel-Based Models for Reinforcement Learning
cal results of Ormnoneit and Sen (2002) imply that, as the sample size grows, for every s ∈ D, the ... 9: until s is terminal. Note that in line 3 we compute the value ...

Multi-Objective Reinforcement Learning for AUV Thruster Failure ...
Page 1 of 4. Multi-Objective Reinforcement Learning. for AUV Thruster Failure Recovery. Seyed Reza Ahmadzadeh, Petar Kormushev, Darwin G. Caldwell. Department of Advanced Robotics, Istituto Italiano di Tecnologia, via Morego, 30, 16163 Genova. Email:

Interactive reinforcement learning for task-oriented ... - Semantic Scholar
to a semantic representation called dialogue acts and slot value pairs; .... they require careful engineering and domain expertise to create summary actions or.

Complex adaptive systems
“By a complex system, I mean one made up of a large number of parts that ... partnerships and the panoply formal and informal arrangements that thy have with.

Effects of Adaptive Robot Dialogue on Information ...
robot, instructing novice and expert cooks with a male voice and responding to ... otherwise, or republish, to post on servers or to redistribute to lists, requires ...

Reinforcement learning for parameter estimation in ...
Oct 14, 2011 - Keywords: spoken dialogue systems, reinforcement learning, POMDP, dialogue .... “I want an Indian restaurant in the cheap price range” spoken in a noisy back- ..... 1“Can you give me a phone number of The Bakers?” 12 ...

Reinforcement Learning for Capacity Tuning of Multi ...
At the same time, more and more software product uses the parallel ... gramming (LP)[1], online learning with TD-λ[4], and fitted Q-iteration with Parzen window ... In National Conference on Artificial Intelligence, pages 183 188, 1992.

Small-sample Reinforcement Learning - Improving Policies Using ...
Small-sample Reinforcement Learning - Improving Policies Using Synthetic Data - preprint.pdf. Small-sample Reinforcement Learning - Improving Policies ...

Reinforcement Learning Agents with Primary ...
republish, to post on servers or to redistribute to lists, requires prior specific permission ... agents cannot work well quickly in the soccer game and it affects a ...

Reinforcement Learning: An Introduction
important elementary solution methods: dynamic programming, simple Monte ..... To do this, we "back up" the value of the state after each greedy move to the.