Decision Theoretic Behavior Composition Nitin Yadav and Sebastian Sardina RMIT University, Melbourne, Australia In Proceedings of Autonomous Agents and Multi-Agent Systems (AAMAS), Taipei, Taiwan, 2011. The Behavior Composition Problem
Controller Evaluation Value of controller: Measures degree of target’s expected realizability
Garden environment
Reward gained on ... : • successful action delegation: Probability of action request × reward for the action request. • one target step (Ri ): Sum of points gained for each legal target action request.
C ONTROLLER
• k steps of whole system: R1 + R2 + . . . + Rk .
Target Garden Bot
• infinite runs of whole system: R1 + αR2 + α2 R2 + . . ., using discount factor 0 ≤ α < 1. Result Every exact solution is an optimal controller.
Solution via reduction to MDP
Multi Bot
Plucker Bot
Cleaner Bot
Task: Controller realizes virtual target behavior T by coordinating available behaviors B1 . . . Bn in environment E.
a0 , b0 , c0 e0 t1 , pluck
Motivations & Objectives cleaner: 0. 8 × 0. 1 × 0. 3
Classical behavior composition approaches: • Operate on strict uncertainty.
a0 , b0 , c0 e0 t0 , clean
• Deal only with exact solutions. • Lack “optimality” notion for problems without exact solution. Our contribution Handle (unsolvable) problems with non-exact solutions by:
8 . 0 : r e n a cle
.1 0 ×
a0 , b0 , c0 e0 t1 , water
.7 0 ×
a1 , b0 , c0 e1 t1 , pluck
cleaner: 0. 2 × 0. 9 × 0. 3 clea ner : 0. 2 ×0 .9 × 0. 7 a1 , b0 , c0 e1 Encoded MDP (partial) t1 , water
1. quantifying the sources of uncertainty: • non-determinism in the environment;
MDP encoding MS,T = hQ, A, p, ri:
• non-determinism in the behaviors;
• Q is the finite set of state encoding the state of the system, the state of the target, and the next requested action.
• action requests in the target; 2. defining optimality notions based on target “expected realizability”;
• A = {1, . . . , n, u} is the set of available behavior indexes;
3. reducing the composition problem to an MDP.
• p(q, i, q 0 ) is the stochastic transition relation encoding the possible next system state and action requested; • r(q, i) is the reward allocated on correct delegation.
Decision Theoretic Composition Problem pluck: 0. 25 water: 1 empty: 1 clean: 0. 8
water: 1 pluck: 0. 25 empty
e0
clean : 0.2
water: 1, clean: 0. 8
Environment
a0 e1
empty
• Optimal policy for MS,T ≡ Optimal controller for T in S.
clean: 0. 9
a1
• Existence of exact controller can be checked by calculating the optimal policy for horizon equal to |Q| + 1.
Cleaner Bot pluck: 0. 75
y pt
e2
clean: 0. 2 em
pluck: 0. 75
empty
clean: 0. 1
Results
hempty, 1i
t0
e3
hclean, 1i
t3 hpluck: 0. 3, 1i
t1 hwater: 0. 7, 1i
water: 1
Target Bot
hempty, 1i
t2
• Stochastic transition evolutions in avail. behaviors & environment. • Reward for each action request. • Stochastic model for target action requests.
Future work • Apply machine learning: – Reinforcement learning: model of system unknown; – Evolutionary computation: build controller incrementally. • Include extended constraints, e.g., action empty must be feasible after action pluck has been executed. • Include preferences, e.g., behavior Plucker-bot uses less energy than the Multi-bot.