Exploration in POMDPs

Viewer
Transcript

¨ Gesellschaft fur Submitted to: Osterreichische ¨ Artificial Intelligence

U NIVERSITEIT VAN

A MSTERDAM

IAS technical report IAS-UVA-08-01

Exploration in POMDPs

Christos Dimitrakakis Intelligent Systems Laboratory Amsterdam, University of Amsterdam The Netherlands

In recent work, Bayesian methods for exploration in Markov decision processes (MDPs) and for solving known partially-observable Markov decision processes (POMDPs) have been proposed. In this paper we review the similarities and differences between those two domains and propose methods to deal with them simultaneously. This enables us to attack the Bayes-optimal reinforcement learning problem in POMDPs. Keywords: POMDPs, exploration, bayesian, reinforcement learning, belief

IAS

intelligent autonomous systems

Exploration in POMDPs

Contents

Contents 1 Introduction 1.1 Exploration in MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 2

2 Exploration in POMDPs 2.1 Belief POMDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 The belief state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Action selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 3 4 5

3 Current and future research

5

Intelligent Autonomous Systems Informatics Institute, Faculty of Science University of Amsterdam Kruislaan 403, 1098 SJ Amsterdam The Netherlands Tel (fax): +31 20 525 7461 (7490) http://www.science.uva.nl/research/ias/

Corresponding author: C. Dimitrakakis tel: +31 20 525 7517 [email protected] http://www.science.uva.nl/~dimitrak/

Copyright IAS, 2008

Section 1

1

Introduction

1

Introduction

Let us consider the problem of an agent acting in a discrete-time dynamic environment. The dynamics of the environment are such that the transition from the current state st to the next, st+1 depends only on the current state and the action at the agent is taking at the time. In addition, there exists a reward rt ∈ <, and the agent wishes to maximise the expected PTsignal −t utility EUt (T ), with Ut (T ) = k=1 g(t + k)rt+k , where T is the horizon after which we are no longer interested in rewards and g ≥ 0 is a discounting factor which allows us to express the relative value of rewards in the future. If the agents can observe the state, then the process of the agent’s interaction with the environment can be formally defined as follows. Definition 1 [Markov decision process] A Markov decision process µ (MDP) is defined as the tuple µ = (S, A, T , R) comprised of a set of states S, a set of actions A, a transition distribution T conditioning the next state on the current state and action, µ(s0 |s, a) = P(st+1 = s0 |st = s, at = a, µ) and a reward distribution R conditioned on states and actions µ(r|s, a) = p(rt+1 = r|st = s, at = a), with a ∈ A, s, s0 ∈ S, r ∈ <. Note that in this definition, and in the rest of the text, we are using the notational convention z(x|y)≡P(x|y, z). The set of all MDPs shall be denoted by M. In the simplest setting, which is usually referred to as dynamic programming [10, 3, 14], we assume that the model µ of the Markov decision process is known and attempt to discover a policy π ∗ which is optimal for that MDP. A policy π is defined as a distribution over actions conditioned on the state, i.e. πt (a|s) = P(at = a|st = s, π). The optimal policy given µ, T, g maximises the state value function T −t X π Vt,T (s)≡E(Ut (T )|π, st = s, µ)≡ g(t + k)E(rt+k |π, st = s) (1) k=1 π∗

∗ (s)≡V π for every state s in the MDP, i.e. Vt,T t,T (s) ≥ Vt,T (s) for any π, s. The form of the value function determines our objective. When T is finite we are only interested in what occurs in the environment until time T and the problem falls in the category of finite-horizon problems. For the infinite horizon case it is useful to set g(t + k) = γ t+k , which for γ ∈ [0, 1) leads to π = V π. limT →∞ Vt,T In the reinforcement learning framework, µ is not known and we must estimate the optimal policy π ∗ by interacting with the environment. This framework is described in detail in [3, 14], which provide algorithms for approximate dynamic programming. These converge under certain conditions to the same solution as dynamic programming. However, such approaches mostly deal with the problem from an aspect of optimisation and stochastic approximation theory, while the uncertainty inherent in the problem (we do not know the “true state” of the world), is not directly considered. Typical convergence proofs for such algorithms contain sufficient conditions for approximating an optimal policy that require the agent to act in an exploratory manner in order to reduce uncertainty. Furthermore, although there exist both asymptotic and sample-complexity results, the question of how to behave optimally such as to maximise (1) while learning, had not been addressed in this line of work. In fact, the question of acting optimally under uncertainty appears even in the simplest possible reinforcement learning setting, the multi-armed bandit problem. There, there is only one state and the agent can take one of a finite set of actions, each of each has an unknown, but usually fixed, mean reward. Depending on the discount parameter γ and the uncertainty about the values of different actions, the agent will choose between a profitable arm or an uncertain, but potentially more profitable one. A set of provably good methods for optimal exploration in this setting is given by [1], with the only restriction being boundedness of the expected rewards. However, in the Bayesian subjectivist setting for sequential decision making [4], there has also

2

Exploration in POMDPs

been substantial work towards optimal Bayesian methods for bandit problems, the most wellknown of which is the Gittins index [6]. The attractiveness of the Bayesian approach is that, given an initial prior over all possible MDPs, we can create another model, whose state is composed of two sub-states: the state of our belief (a probability distribution over M) and the system state of the original MDP. This belief-augmented MDP can then be solved using standard dynamic programming methods in order to obtain the optimal action under uncertainty. We shall look at the estimation procedure that is involved in creating this MDP, and discuss the cases when this is intractable. In the cases where it is tractable, we shall examine methods for performing optimal action selection by solving the resulting augmented MDP. Because this can be an extremely large MDP, standard solution methods might not be applicable and we will have to resort to approximations. The rest of the paper is organised as follows. First we shall present belief-augmented MDPs and then extend this formalism to partially observable MDPs. Secondly, we shall present how a belief state can be maintained in either case. Finally, we shall outline current methods for optimal action selection in these settings and propose directions for future research.

1.1

Exploration in MDPs

When we are uncertain about which MDP we are acting in, we may maintain a belief over possible MDPs. If we augment the MDP’s state with a belief, we can then solve the exploration easily via standard dynamic programming algorithms such as backwards induction or value iteration. We shall call such models Belief MDPs1 (BMDPs), analogously to the BAMDPs (Bayes-Adaptive MDPs) of [5]. This is done by not only considering densities conditioned on the state-action pairs (st , at ), i.e. p(rt+1 , st+1 |st , at ) but taking into account the belief ξt ∈ B, a probability space over possible MDPs, i.e. augmenting the state space from S to S × B and considering the following conditional density: p(rt+1 , st+1 , ξt+1 |st , at , ξt ). More formally, we may give the following definition: Definition 2 [Belief MDP] A Belief MDP ν (BMPD) is an MDP ν = (Ω, A, T 0 , R0 ) where Ω = S × B, where B is the set of probability measures on M, and T 0 , R0 are the transition and reward distributions conditioned jointly on the MDP state st , the belief state ξt , and the action at , such that the following factorisations are satisfied for all µ ∈ M, ξt ∈ B. p(st+1 |st , st−1 , . . . , s1 , at , µ) = µ(st+1 |st , at )

(2)

p(ξt+1 |st+1 , at , st , µ, ξt )µ(st+1 |at , st )ξt (µ)dµ

(3)

Z p(st+1 , ξt+1 |at , st , ξt ) = M

We shall use MB to denote the set of BMDPs. It should be obvious from (3) that st , ξt jointly form a Markov state in this setting. The form of the probability distribution over MDPs, ξt (µ), need not be particularly complex. In fact, if we consider discrete states and action spaces, then the belief over transition distributions can be represented with a simple Dirichlet prior for each state-action pair as long as we consider the state-action-state transition distributions to be independent. This is a probability 1

It is also possible to consider different forms of beliefs than standard Bayesian ones.

Section 2

Exploration in POMDPs

3

distribution over possible multinomial distributions, to which it is conjugate. This is fully characterised by transition counts.2 More specifically, suppose we have k discrete events, drawn from an unknown multinomial distribution q ≡ (q1 , . . . , qk ). If q ≡ (q1 , . . . , qk ) ∼ Dir(φ1 , . . . , φk ), Q then the p.d.f. is φ(q) ∝ ki=1 qiφi −1 , where φi is the number of times i has been observed. We shall omit details for the fully-observable case and proceed directly to exploration in partiallyobservable MDPs.

2

Exploration in POMDPs

A useful extension of the MDP model can be obtained by not allowing the agent to directly observe the state of the environment, but an observation variable ot that is conditioned on the state. This more realistic assumption is formally defined as follows: Definition 3 [Partially observable Markov decision process] A partially observable Markov decision process µ (POMDP) is defined as the tuple µ = (S, A, O, T , R) comprised of a set of states S, a set of actions A, a transition-observation distribution T conditioned the current state and action µ(st+1 = s0 , ot+1 = o|st = s, at = a) and a reward distribution R, conditioned on the state and action µ(rt+1 = r|st = s, at = a), with a ∈ A, s, s0 ∈ S, o ∈ O, r ∈ <. We shall denote the set of POMDPs as MP . For POMDPs, it is often assumed that one of the two following factorisations holds: µ(st+1 , ot+1 |st , at ) = µ(st+1 |st , at )µ(ot+1 |st+1 )

(4)

µ(st+1 , ot+1 |st , at ) = µ(st+1 |st , at )µ(ot+1 |st , at ).

(5)

The assumption that the observations are only dependent on a single state or a single stateaction pair is a natural decomposition for a lot of practical problems. POMDPs are similar to BMDPs. In fact, BMDPs are equivalent to a special case of a POMDP in which the state is split into two parts: One fully observable dynamic part and one unobservable, but stationary part, which models the unknown MDP. Typically, however, in POMDP applications the unobserved part of a state is dynamic. The problem of acting optimally in POMDPs has two aspects. The first is state estimation, and the second is acting optimally given the estimated state. As far as the first part is concerned, given an initial state probability distribution, updating the belief amounts to simply maintaining a multinomial distribution over the states. However, the initial state distribution might not be known. In that case, we may assume an initial prior density over the multinomial state distribution. It is easy to see that this is simply a special case of an unknown state transition distribution, where we insert a special initial state which is only visited once. We shall, however, be concerned with the more general case of full exploration in POMDPs, where all state transition distributions are unknown.

2.1

Belief POMDPs

It is possible to create an augmented MDP for POMDP models, by endowing them with an additional belief state, in the same manner as MDPs. However now the belief state will be a joint probability distribution over MP and S. Nevertheless, each (at , ot+1 ) pair that is observed leads to a unique subsequent belief state. More formally, a belief-augmented POMDP is defined as follows: 2

In this sense, it is analogous to the beta distribution, which is the conjugate family to Bernoulli distributions.

4

Exploration in POMDPs

Definition 4 [Belief POMDP] A Belief POMDP ν (BPOMPD) is an MDP ν = (Ω, A, O, T 0 , R0 ) where Ω = G × B, where G is the set of probability measures on S, B is the set of probability measures on MP , T 0 R0 are the belief state transition and reward distributions conditioned on the belief state ξt and the action at such that the following factorizations are satisfied for all µ ∈ MP , ξt ∈ B p(st+1 |st , at , st−1 , . . . , µ) = µ(st+1 |st , at )

(6)

p(ot |st , at , ot−1 , . . . , µ) = µ(ot |st , at )

(7)

p(ξt+1 |µ, ot+1 , at , ξt )ξt+1 (µ|ot+1 , at , ξt ) dµ

(8)

Z p(ξt+1 |ot+1 , at , ξt ) = MP

We shall denote the set of BPOMDPs with MBP . Again, (8) simply assures that the transitions in the belief-POMDP are well-defined. The Markov state ξt (µ, st ) now jointly specifies a distribution over POMDPs and states.3 As in the MDP case, in order to be able to evaluate policies and select actions optimally, we need to first construct the BPOMDP. This requires calculating the transitions from the current belief state to subsequent ones according to our possible future observations, as well as the probability of those observations. The next section goes into this in more detail.

2.2

The belief state

In order to simplify the exposition, in the following we shall assume firstly that each POMDP has the same number of states. Then ξ(st = s|µ) describes the probability that we are in state s at time t given some belief ξ and assuming we are in the POMDP µ. Similarly, ξ(st = s, µ) is the joint probability given our belief. This joint distribution can be used as a state in an expanded MDP, which can be solved via backward induction, as will be seen later. In order to do this, we must start with an initial belief ξ0 and calculate all possible subsequent beliefs. The belief at time t + 1 depends only on the belief time t and the current set of observations rt+1 , ot+1 , at . Thus, the transition probability from ξt to ξt+1 is just the probability of the observations according to our current belief, ξt (rt+1 , ot+1 |at ). This can be calculated by first noting that given the model and the state, the probability of the observations no longer depends on the belief, i.e. ξt (rt+1 , ot+1 , |st , at , µ) = µ(rt+1 , ot+1 |at , st ) = µ(rt+1 |at , st )µ(ot+1 |at , st ).

(9)

The probability of any particular observation can be obtained by integrating over all the possible models and states Z Z ξt (rt+1 , ot+1 |at ) = µ(rt+1 , ot+1 |at , st )ξ(µ, st ). (10) MP

S

Given that a particular observation is made from a specific belief state, we now need to calculate what belief state it would lead to. For this we need to compute the posterior belief over POMDPs and states. The belief over POMDPs is given by ξt+1 (µ)≡ξt (µ|rt+1 , ot+1 , at , ) ξt (rt+1 , ot+1 , at |µ)ξt (µ) = ξt (rt+1 , ot+1 , at ) ZZ ξt (µ) = µ(rt+1 , ot+1 , at |st+1 , st )ξt (st+1 , st |µ) dst+1 dst , Z S 3

(11) (12) (13)

The formalism is very similar to that described in [11], with the exception that we do not include the actual POMDP state in the model.

Section 3

Current and future research

5

where Z = ξt (rt+1 , ot+1 , at ) is a normalising constant. Note that ξt (st+1 , st |µ) = µ(st+1 |st )ξt (st |µ), where ξt (st |µ) is our belief about the state in the POMDP µ. This can be updated using the following two steps. Firstly, the filtering step ξt+1 (st |µ)≡ξt (st |rt+1 , ot+1 , at , µ) µ(rt+1 , ot+1 |st , at )ξt (st |µ) , = ξt (rt+1 , ot+1 |at , µ)

(14) (15)

where we adjust our belief about the previous state of the MDP based on. Then we must perform a prediction step Z ξt+1 (st+1 |µ) = p(st+1 |st = s, µ)ξt+1 (st = s|µ) ds, (16) S

where we calculate the probability over the current states given our new belief concerning the previous states. These predictions can be used to further calculate a new possible belief, since our current belief corresponds to a distribution over possible MDPs. We use the probability distribution over MDPs, and for each possible MDP we determine how our beliefs would change as we acquire new observations. The main difficulty is maintaining the joint distribution over states and POMDPs. This will be further discussed in the final section.

2.3

Action selection

The second difficulty in the exploration task is action selection. For this, we need to select the action maximising Z Z π∗ Vt,T (at , ξt ) = ξt (µ, st ) max E(Ut |π, st , at , ξt , µ)dst dµ. M

π

S

This is far from a trivial operation. However, in finite-horizon problems we can perform a backwards induction procedure where we start from the optimal action at the last stage and calπ ∗ (a|ξ ) for all possible belief states ξ at that stage and we subsequently calculate culate VT,T T T ∗ ∗ ∗ π π π VT −1,T (a|ξT ), VT −2,T (a|ξT −2 ), . . . , Vt,T (a|ξt ). Note that at the current time t there is only a single belief ξt . Further, at each stage n we can write Z Z ∞ h i π∗ π∗ Vn,T (an |ξn ) = ξn (rn+1 , on+1 |an ) r + Vn+1,T a∗n+1 |rn+1 , on+1 , an , ξn drn+1 don+1 , O

−∞

∗

∗

π π where Vn+1,T (a∗n+1 |rn+1 , on+1 , an , ξn ) ≡ Vn+1,T (a∗n+1 |ξn+1 ) for one of the possible next-step beliefs. The implication is that we first must calculate all possible belief states starting from the current state, for all stages until T . The complexity of this operation is high, since if at each stage there are n possible observations, then the number of possible beliefs is of order nT ; if the reward, state, action, or observation spaces are continuous, we are presented with a potentially insurmountable problem.

3

Current and future research

The BMDP as presented herein is essentially identical to the Bayes-adaptive MDP formalism use in [5], but the general idea has been around since [2]. Current research focuses on practical methods for decision making in these cases. The work by [8] is one of the first modern works on POMDPs where uncertainty about the model is explicitly taken into account. In this setting, it is possible to directly query the

6

REFERENCES

true value of the state parameter of the POMDP with a special query action. It is possible to generalise the ’query action’ idea by [8] to the case where a query action is just a temporally extended [15] ’exploration’ action. As was suggested for example in [?], it is possible to place upper bounds on the value of exploration by lower-bounding the regret incurred while taking exploratory movements. Methods for bounding the value of different actions will be particularly applicable to continuous state, action or observation spaces. An extension of the Bayes-adaptive MDP framework to the POMDP case was also discussed in [11]. The additional problem in POMDPs is that tracking the joint belief over S × MP is difficult. In most cases the effort concentrates on reducing the amount of states that must be tracked. For standard POMDP problems, fixed point approximations work well [13]. However, exploration problems create the need for a continuous refinement of the discretisation. Montecarlo methods [7, 9] offer a potential solution. For uncertain POMDP problems, methods such as those used in [11] rely on procedures for compacting the belief space. Bayesian approaches have not been limited to exploration problems. For example, [16] solve known POMDPs, where a version of the EM algorithm is used to find policies. One of the main ideas therein was considering an infinite horizon MDP as a infinite mixture of finite horizon MDPs. This idea is also used by [7], which extends the procedure to a full Bayesian approach and continuous spaces. Furthermore, they introduce an efficient transdimensional Markov chain Monte-Carlo procedure, which is illustrated to perform at a level comparable to that of the full Bayesian approach. Such methods might be applicable to to the full exploration problem as well. While solutions methods for known POMPDs could be applied to BMDPs, or BPOMDPs, most currently used methods perform off-line calculations that are geared towards solving a tracking problem and thus they are not directly suitable for exploration. However, recently, progress has been made towards applicable online methods. In particular, [12] offers a theoretical analysis of heuristic online POMDP planning algorithms which shows their near-optimality. It should be possible to apply such methods to uncertain POMDP problems as well. Multiple agent problems, especially decentralised POMDPs, are an interesting scenario. The question then becomes one of determining how to best explore the POMDP given the additional exploration value that multiple agents add. In particular, it would be of interest to see how the complexity of exploration reduces as the number of agents increases. In summary, there are three main possibilities for future research. The first is to consider classes of problems with special structure; this might either allow a more efficient solution, or it might lead to an interesting secondary problem. This in turn could result in new theoretical analyses. The second is to consider approximate methods for POMDPs using either highprobability bounds on the value function for parts of the tree in order to perform pruning, Monte Carlo sampling to selectively expand parts of the tree, projection of the belief to a more compact representation, or other heuristics for simplifying computations. Methods for solving POMDPs could be applied to BMDPs. Finally, examining different belief representations than the full Bayesian ones might lead to a more manageable problem.

References [1] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite time analysis of the multiarmed bandit problem. Machine Learning, 47(2/3):235–256, 2002. A preliminary version has appeared in Proc. of the 15th International Conference on Machine Learning. [2] Richard Ernest Bellman. Dynamic Programming. Princeton University Press, 1957. Republished by Dover in 2004.

REFERENCES

7

[3] Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996. [4] Morris H. DeGroot. Optimal Statistical Decisions. John Wiley & Sons, 1970. Republished in 2004. [5] Michael O’Gordon Duff. Optimal Learning Computational Procedures for Bayes-adaptive Markov Decision Processes. PhD thesis, University of Massachusetts at Amherst, 2002. [6] C. J. Gittins. Multi-armed Bandit Allocation Indices. John Wiley & Sons, New Jersey, US, 1989. [7] Matthew Hoffman, Arnaud Doucet, Nando De Freitas, and Ajay Jasra. Bayesian policy learning with trans-dimensional mcmc. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20. MIT Press, Cambridge, MA, 2008. [8] R. Jaulmes, J. Pineau, and D. Precup. Active Learning in Partially Observable Markov Decision Processes. European Conference on Machine Learning, 2005. [9] Alessandro Lazaric, Marcello Restelli, and Andrea Bonarini. Reinforcement learning in continuous action spaces through sequential monte carlo methods. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20. MIT Press, Cambridge, MA, 2008. [10] Marting L. Puterman. Markov Decision Processes : Discrete Stochastic Dynamic Programming. John Wiley & Sons, New Jersey, US, 1994,2005. [11] Stephane Ross, Brahim Chaib-draa, and Joelle Pineau. Bayes-adaptive POMDPs. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20. MIT Press, Cambridge, MA, 2008. [12] Stephane Ross, Joelle Pineau, and Brahim Chaib-draa. Theoretical analysis of heuristic search methods for online POMDPs. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20. MIT Press, Cambridge, MA, 2008. [13] M.T.J. Spaan and N. Vlassis. Perseus: Randomized point-based value iteration for POMDPs. Journal of Artificial Intelligence Research, 24:195–220, 2005. [14] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998. [15] Richard S. Sutton, Doina Precup, and Satinder P. Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1-2):181–211, 1999. [16] Marc Toussaint, Stefan Harmelign, and Amos Storkey. Probabilistic inference for solving (PO)MDPs. Research report, University of Endinburgh, School of Informatics, 2006.

8

REFERENCES

Acknowledgements Frans Oliehoek and Ronald Ortner for useful discussions and proof reading

IAS reports This report is in the series of IAS technical reports. The series editor is Bas Terwijn ([email protected]). Within this series the following titles appeared: F.A. Oliehoek and N. Vlassis and M.T.J. Spaan, Properties of the QBG-value function Technical Report IAS-UVA-07-04, Informatics Institute, University of Amsterdam, The Netherlands, August 2007. G. Pavlin and P. de Oude and M.G. Maris and J.R.J. Nunnink and T. Hood A Distributed Approach to Information Fusion Systems Based on Causal Probabilistic Models. Technical Report IAS-UVA-07-03, Informatics Institute, University of Amsterdam, The Netherlands, July 2007. P.J. Withagen and F.C.A. Groen and K. Schutte Shadow detection using a physical basis. Technical Report IAS-UVA-07-02, Informatics Institute, University of Amsterdam, The Netherlands, Februari 2007. All IAS technical reports are available for download at the ISLA website, http:// www.science.uva.nl/research/isla/MetisReports.php.

Intelligent Systems Laboratory Amsterdam,. University of Amsterdam. The Netherlands. In recent work, Bayesian methods for exploration in Markov decision ...

Download PDF

118KB Sizes 4 Downloads 255 Views

Report

Exploration in POMDPs

Recommend Documents