Nitin Yadav and Sebastian Sardina RMIT University ...

Viewer
Transcript

Decision Theoretic Behavior Composition Nitin Yadav and Sebastian Sardina RMIT University, Melbourne, Australia In Proceedings of Autonomous Agents and Multi-Agent Systems (AAMAS), Taipei, Taiwan, 2011. The Behavior Composition Problem

Controller Evaluation Value of controller: Measures degree of target’s expected realizability

Garden environment

Reward gained on ... : • successful action delegation: Probability of action request × reward for the action request. • one target step (Ri ): Sum of points gained for each legal target action request.

C ONTROLLER

• k steps of whole system: R1 + R2 + . . . + Rk .

Target Garden Bot

• infinite runs of whole system: R1 + αR2 + α2 R2 + . . ., using discount factor 0 ≤ α < 1. Result Every exact solution is an optimal controller.

Solution via reduction to MDP

Multi Bot

Plucker Bot

Cleaner Bot

Task: Controller realizes virtual target behavior T by coordinating available behaviors B1 . . . Bn in environment E.

a0 , b0 , c0 e0 t1 , pluck

Motivations & Objectives cleaner: 0. 8 × 0. 1 × 0. 3

Classical behavior composition approaches: • Operate on strict uncertainty.

a0 , b0 , c0 e0 t0 , clean

• Deal only with exact solutions. • Lack “optimality” notion for problems without exact solution. Our contribution Handle (unsolvable) problems with non-exact solutions by:

8 . 0 : r e n a cle

.1 0 ×

a0 , b0 , c0 e0 t1 , water

.7 0 ×

a1 , b0 , c0 e1 t1 , pluck

cleaner: 0. 2 × 0. 9 × 0. 3 clea ner : 0. 2 ×0 .9 × 0. 7 a1 , b0 , c0 e1 Encoded MDP (partial) t1 , water

1. quantifying the sources of uncertainty: • non-determinism in the environment;

MDP encoding MS,T = hQ, A, p, ri:

• non-determinism in the behaviors;

• Q is the finite set of state encoding the state of the system, the state of the target, and the next requested action.

• action requests in the target; 2. defining optimality notions based on target “expected realizability”;

• A = {1, . . . , n, u} is the set of available behavior indexes;

3. reducing the composition problem to an MDP.

• p(q, i, q 0 ) is the stochastic transition relation encoding the possible next system state and action requested; • r(q, i) is the reward allocated on correct delegation.

Decision Theoretic Composition Problem pluck: 0. 25 water: 1 empty: 1 clean: 0. 8

water: 1 pluck: 0. 25 empty

e0

clean : 0.2

water: 1, clean: 0. 8

Environment

a0 e1

empty

• Optimal policy for MS,T ≡ Optimal controller for T in S.

clean: 0. 9

a1

• Existence of exact controller can be checked by calculating the optimal policy for horizon equal to |Q| + 1.

Cleaner Bot pluck: 0. 75

y pt

e2

clean: 0. 2 em

pluck: 0. 75

empty

clean: 0. 1

Results

hempty, 1i

t0

e3

hclean, 1i

t3 hpluck: 0. 3, 1i

t1 hwater: 0. 7, 1i

water: 1

Target Bot

hempty, 1i

t2

• Stochastic transition evolutions in avail. behaviors & environment. • Reward for each action request. • Stochastic model for target action requests.

Future work • Apply machine learning: – Reinforcement learning: model of system unknown; – Evolutionary computation: build controller incrementally. • Include extended constraints, e.g., action empty must be feasible after action pluck has been executed. • Include preferences, e.g., behavior Plucker-bot uses less energy than the Multi-bot.