Decision Boundary Partitioning: Variable Resolution ...

Viewer
Transcript

Decision Boundary Partitioning: Variable Resolution Model-Free Reinforcement Learning Stuart I. Reynolds

School of Computer Science, The University of Birmingham, Birmingham, B15 2TT [email protected] July 21, 1999 Abstract Reinforcement learning agents attempt to learn and construct a decision policy which maximises some reward signal. In turn, this policy is directly derived from long-term value estimates of state-action pairs. In environments with real-valued state-spaces, however, it is impossible to enumerate the value of every state-action pair, necessitating the use of a function approximator in order to infer state-action values from similar states. Typically, function approximators require many parameters for which suitable values may be dicult to determine a-priori. Traditional systems of this kind are also then bound to the xed limits imposed by the initial parameters, beyond which no further improvements are possible. This paper introduces a new method to adaptively increase the resolution of a discretised action-value function based upon which regions of the state-space are most important for the purposes of choosing an action. The method is motivated by similar work by Moore and Atkeson but improves upon the existing techniques insofar as it: i) is applicable to a wider class of learning tasks, ii) does not require transition or reward models to be constructed and so can also be used with a variety of model-free reinforcement learning algorithms, iii) continues to improve upon policies even after a feasible solution to the learning problem has been found.

1

1 Introduction Reinforcement learning is the problem of learning to map situations to actions in order to maximise some reward signal. This mapping, referred to as a policy, exhaustively speci es an immediate course of action for every situation. Unlike most other forms of machine learning, the learner (an agent) isn't told the best action or even some sample of the best actions, but must evaluate and adapt its policy through trial and error via interactions with its environment (or a model of it). The agent perceives its environmental state, chooses some action and notes the eect: a new state and some immediate reward signal. However, we are seldom so fortunate to be rewarded for everything we do. In many tasks, useful reward information is only received after some long sequence of actions and the agent then has the dicult task of associating early states and actions with some distant delayed-reward. This is the credit assignment problem [9]. In order to re ect the future consequences of taking an action in a state, the agent constructs an action-value function which expresses an estimate of the expected long-term future utility for taking the action under the agent's current policy. These estimates can then be used to improve the policy at any state simply nding which action gives the highest value. For high-dimensional and real-valued state-spaces, however, keeping an exhaustive table of these values rapidly becomes impossible. The common solution is to use a function approximator to infer values of states (or state-action pairs) from similar states. Choosing suitable parameters for function approximators is often very dicult without detailed prior knowledge of the problem. The remainder of this paper introduces a new method to adaptively increase the resolution of the function in areas of the state space which are expected to be most important for the purposes of choosing actions. The intention is that this will remove the need to choose some of the initial design-time parameters (or make the choice of their values an arbitrary one). In addition because the features being sought by the function approximator are independent of the size and dimensionality of the state-space, it is expected that this method will also allow reinforcement learning in a larger and more general class of problems than has previously been possible.

2 Markov Decision Processes Formally, the reinforcement learning problems discussed here can be modelled as a Markov Decision Process (MDP) with the following constituent parts [21, 7]: 2

a (possibly in nite) set of environment states, S = fs 2 hX1 ; X2 ; : : : Xn ig, a nite discrete set of possible actions A(s), 8s 2 S , a state transition probability function, P : S A S ! [0; 1], a reward function, : R : S A S ! and, a policy, : S ! A. IR

The transition probability function, P , is de ned by the agent's environment and speci es the probability of entering a new state s0 at the next time step given that the agent started in a prior state s and took action a. This allows for environments with stochastic dynamics. Similarly, R(s; a; s0 ) may be a random variable taken from a xed distribution. In these cases, where the environment's dynamics and rewards are uncertain, the agent has the task of maximising the expected future reward received. Broadly, reinforcement learning algorithms can be divided into two classes: model-based and model-free. Simply, the latter does not attempt to construct models of the transition probability and reward functions and are the kind used in the experiments presented later in this paper.

3 Action-Values In order to solve the credit assignment problem, the most common reinforcement learning algorithms to-date attempt to estimate the long-term values of stateaction pairs (these are referred to as Q-values from hereinafter). For example, consider the simpler task of nding a Q-value under a xed policy, (Q ). The expected value of taking an action in a state is may be de ned to be the expected sum of future discounted returns:

Q (st ; a) = E rt+1 + rt+2 + 2 rt+3 + : : :

(1)

where E is an expectation given that is always followed. rt = R(st ; at ; st+1 ) is the immediate reward received for taking the action suggested by the policy and is a discount factor (0 1). For < 1 discounting may be thought of in two ways: i) a preference to more immediate rewards over those received more steps into the future or, ii) a mathematical trick to prevent the values of` states diverging to in nity in cases where an agent is permitted to collect reward inde nitely. Note that setting < 1 also necessarily rede nes the learning task. Such an agent has the task of maximising the expected discounted rewards it 3

collects. Typically, one can expect dierent optimal policies for dierent values of , though in most tasks is set to 1 or just less than one. Based upon the returns measure given in equation 1 dynamic programming can be used to iteratively nd predictions of Q based upon the action values in the expected immediate successor states [2][19, p. 70{76]: (1 ) X k E

rt+k+1 jst = s k =0 ( ) 1 X = E rt+1 + k rt+k+2 jst = s k=0 (1 " )# X a X k a =

rt+k+2 jst+1 = s Pss Rss + E s k =0 X a = Pss fRass + Q (s0 ; a )g s

Q (s; a) =

0

0

0

0

0

(2)

0

That is, the long-term expected value for taking action a in state s is the expectation of the immediate reward received plus the discounted future rewards which can be gained from the expected successor states given that policy is followed thereafter, or simply:

Q (st ; a) = E frt+1 + Q (st+1 ; a )g

(3)

Starting with initial arbitrary estimates, repeatedly re-approximating every Qvalue using equation 2 has been shown to converge upon correct estimates for Q . Of course, it is more interesting to learn the action-values for the optimal policy, Q . This is de ned as the expected value for taking an action and doing the best thing thereafter:

Q(st+1 ; a ) Q (st ; a) = E rt+1 + a 2max A(s )

0

t

+1

0

(4)

Although it is possible to also nd the exact values for this using dynamic programming, this requires that we know both P and R in advance. Instead, the model-free algorithm, 1-step Q-learning [21], iteratively learns the Q-function by modifying the Q-value of a state-action pair, (s; a), towards a new estimate gained from the Q-value for the best action in the observed successor state s0 :

4

target returns estimate }| { z 0 ; a0 ) Q(s; a)] Q ( s Q(s; a) := Q(s; a) + [rt+1 + a max 2A(s ) | {z } temporal dierence error 0

(5)

0

This process, called temporal dierence learning [17], relies upon making the reapproximations of Q at s using Q at s0 with the frequency de ned by Pssa . This can be achieved in a very natural way without knowing P , simply making the re-approximations after every observed transition in the environment. Learning occurs after and on-line with every experience. In equation 5, is a learning rate parameter which should be slowly declined over time to ensure that Q converges to Q (an optimal action-value estimate under an optimal policy ). Strictly speaking, convergence also requires that state-action pairs are experienced in nitely often, the state-space is discrete, nite and has the Markov Property (no additional information from the previously visited history of states can be gained about the current state). The entire purpose of constructing the Q-function in this way is to transform original problem of nding an optimal policy into one of estimating future utilities. Once Q is known nding the optimal policy is trivial since one can simply choose the action with the highest value: 0

(s) = arg amax Q (s; a) 2A(s)

(6)

4 Action-Value Function Approximation One obvious limitation to learning in the above way is that exhaustively keeping a Q-value for every state-action pair quickly becomes impossible. As the number of dimensions to the state-action space grows linearly, so the number of possible states grows exponentially. This is known as Bellman's curse of dimensionality [3, 11, 10]. Similarly, in spaces which are very large or are real-valued in some dimensions it is impossible to explicitly enumerate every state. Memory requirements are not the only issue. Having even moderately large numbers of discrete states also requires that each individual state is visited, and visited many times, before value estimates approach convergence1. 1 This may not be strictly true for some model-based methods can which attempt to generalise a state transition model from experience. However in this case, large numbers of states invariably means that the computational costs in generating the value function is intolerably high.

5

A number of approaches to generalising value and Q functions have been tried previously with mixed success. By far the most common method is to perform uniform discretisation and assume that each region approximates a state in a discrete Markov process [21, 8, 16]. Many other methods exist including coarse tile-coding [18, 21, 15, 14], memorybased methods [1, 14], neural networks with backpropogation [20, 5] and recurrent networks [13] to name just a few. All the above methods have the requirement that the designer of the system needs to decide upon various parameters in advance. These include: appropriate scaling of dimensions, levels of generalisation (kernel and tile sizes), available resources (nodes in a network, number of tiles, density of data), and more. Not only may these be hard to determine a-priori, xing these parameters at design-time imposes an arbitrary xed limit on the eventual performance of the system. In real-valued systems with non-linear dynamics, it is often the case that it is necessary to represent value functions in high detail only in relatively few and disparate parts of the space. Uniformly discretising at the highest resolution yields high memory requirements and slow learning. At the lowest, the best possible performance of the system may be extremely poor. Santamaria, Sutton and Ram [14] present a method of non-uniform preallocation of function approximator resources by applying a skewing function to its input features. Although this was shown to work well even where actions are chosen from some real-valued space, designing such a skewing function is no trivial matter and again requires considerable prior knowledge. The method presented here also employs non-uniform allocation of resources but does not require that the designer specify the distribution a-priori. Also, unlike the methods discussed above the initial parameters for the scaling of the inputs, the degree of generalisation and the available resources are far more arbitrary since they are all adapted as learning progresses.

5 Variable Resolution Model-Free Function Approximation In order to adapt resolution on-line, the algorithm uses a variation of a kd-tree [6]. A kd-tree is a generalisation of a binary tree and for the purposes here, a node in the tree represents an homogeneous region of the state-space, which may be real-valued in any or all of its dimensions. The root node represents the entire space, each branch splits the parent region into one of two discrete 6

R e g io n B ra n c h D a ta

Figure 1: A kd-tree partitioning of a two dimensional space. sub-spaces along a single dimension, and only the leaf nodes contain actual data about their particular small subspace (see gure 1). The decision of whether to further divide a region is taken on how important a region appears to be for the purposes of choosing actions. Consider the following simple learning task: An agent should maximise the discounted reward where:

S = fso j 0 s < 360og A = 8 [L; R] (that is, \go left" and \go right")

s0 = s + 15o and a = R, = : 1; if s0 = s 15o and a = L, 0; if otherwise. a Rss = sin(s0 )

= 0:9

Pssa

< 1; if

0

0

(7) (8)

The world is circular such that f (0o) = f (360o). Although this is a very simple problem, nding and representing good estimates of the optimal Q-function to any degree of accuracy may prove dicult for some classes of function approximator. For instance, it is non-dierentiable and not easy to approximate by a polynomial. However, of particular interest in this and many other problems, is the apparent simplicity of the optimal policy compared to the complexity of its Q-function: 7

10 Q(s,L) Q(s,R) sin(s) 8

Value

6

4

2

0

-2 0

50

100

150

200 State, s

250

300

350

400

Figure 2: The optimal Q-function for Sinworld. The decision boundaries at s = 90o and s = 270o. Note that the maxima and minima are one step away from the maxima and minima of the reward function - Q-functions only represent

information about future reward.

(s) = L; R;

if 90o s < 270o; otherwise

(9)

It is trivial to construct and learn a two region action-value function which nds the optimal policy given only a few experiences. This, of course, relies upon knowing the decision boundaries (where Q(s; L) and Q(s; R) intersect) in advance. Decision boundaries are used to guide the partitioning process since it is here one can expect to nd improvements in policies at a higher resolution { in areas of uniform policy, there is no performance bene t for knowing that the policy is the same in twice as much detail. Whilst it is true that, in general we cannot determine without rst knowing Q , in practical cases it is often possible to nd near or even optimal policies with very coarsely represented Q-functions. A good estimate of is found if, for every region, the best Q-value in a region is, with some minimum degree of con dence, signi cantly greater than the other Q-values in the same region. Similarly, there is little to be gained by knowing more about regions of space where there is a set of two or more near equivalent best actions which are clearly better than others. In this case the expected loss for not knowing which action is exactly best is small. To cover both cases, decision boundaries are de ned to 8

10 Q(s,L) Q(s,R) R(s) 8

Value

6

4

2

0 0

2

4

6

8

10 State, s

12

14

16

18

20

Figure 3: The optimal Q-function for Stepworld. The agent can move a distance of 1 left or right each time step. A reward of 1 is given if 6 s < 14 and 0 if otherwise. be those parts of a Q-function where values diverge after intersecting. Figure 3 for example, shows an example of a problem which contains equivalent actions over a large area of the space. The decision boundaries are at s = 7 and s = 13 where Q(s; L) and Q(s; R) diverge. Finally, it is important to note that in a large class of practical learning problems in real-valued state-spaces, the complexity of representing decision boundaries in this manner is largely independent of the size and dimensionality of the space. Because partitioning is not required where there is no decision boundary, the memory requirements are expected to increase only with the hyper-surfacearea of the decision boundaries. The exact nature of this growth in resources depends upon how the partitioning process is limited. This is discussed in the next section.

5.1 The Algorithm The partitioning process considers every pair of adjacent regions in turn. The decision of whether to further divide the pair is based upon the following heuristic:

9

do not consider splitting if the greedy actions2 in both regions are the

same, only consider splitting if all the Q values for both regions are known with a reasonable degree of con dence, only split if, for either region, taking the recommended action of one region in the adjacent region is expected to be signi cantly worse than taking another, better, action in the adjacent region. The rst point ensures that partitioning only occurs at a decision boundary. The second is important, insofar as that the decision to split regions is based solely upon the Q-values of the regions. In practice it is very dicult to measure the con dence of Q-values since they may ultimately be de ned by Q-values in currently unexplored areas of the state-action space or parts of the space which only appear useful at higher resolutions. For both of these reasons, the Q-function is non-stationary during learning which itself causes problems for statistical con dence measures. The naive solution applied here is to require that all the actions in both regions under consideration must have been experienced (and so had their values re-estimated) some minimum number of times, vmin , which is speci ed as a parameter of the algorithm. This also has the added advantage of ensuring that infrequently visited states are less likely to be considered for partitioning. In the nal point the assumption is made that the agent suers some \signi cant loss" in discounted future rewards only if it cannot determine exactly where it is best to follow one policy over another. If the best action of one region, when taken in the adjacent region is little better than any of the other actions in that adjacent region, then it it reasonable to assume that between the two regions, the agent will not perform much better if it could decide exactly where it is best to take each action in more detail. The \signi cant loss", min , is the second and nal parameter for the decision boundary partitioning algorithm. Setting min > 0 ensures that the partitioning processes is bounded since (at least for dierentiable Q-functions), as the regions become smaller, so does the dierence in Q-values between adjacent regions which must eventually fall below min . In the exceptional case, where decision boundaries are caused by discontinuities, unbounded partitioning along the boundary is the right thing to do provided that there remains the expectation that the extra partitions reduce the loss that 2 A greedy action is the action recommended by the Q-function for a particular state, i.e. the one with the highest Q-value. 10

the agent will receive. The fact that there is a boundary must mean that there is some better representation of the policy that can be achieved3. In both cases, a practical limit is also imposed by the amount of exploration available to the agent. The smaller a region becomes, the less likely it is to be visited. If all states and actions are experienced with equal frequency, then the ratio of visits to a region of a given volume to the total steps taken by the agent per unit of time is expected to decline exponentially with every partitioning. The remainder of this section is devoted to a detailed description of the algorithm. To abstract from the implementation details of a kd-tree, the learner is assumed to have available the set REGIONS , where regi 2 REGIONS , regi = hVol i ; Qi ; Vis i i and: Vol i is description of the hyper-rectangle regi covers. Qi : A ! is the action-value approximation for all states in Vol i (i.e. Q(s; a) = Qi (a), where s 2 Vol i ). V isi : A ! records the number of times an action has been selected within Vol since the region was created. The choice of whether to split a region is made as follows: IR

Z Z

i

3 This isn't true in the unlikely case that regions are already exactly separated at the boundary. But if this is the case, continued partitioning is still necessary to verify this.

11

ADJ such that: ADJ = fhregi ; regj i j regi ; regj 2 REGIONS ^ neighbours(regi ; regj )g 2. Let SPLIT be the set of regions to divide (initially

1. Find the set of adjacent region pairs,

empty).

3. For each pair of adjacent regions,

fhregi ; regj i 2 ADJ g:

i) Find the recommended action and the expected loss given that, for some states in the region, it is better to take the best action in the adjacent region:

ii)

ai = greedy(Qi ) aj = greedy(Qj ) i = j Qi (ai ) Qi (aj ) j j = j Qj (aj ) Qj (aj ) j If ai 6= aj ,

i min or j min , 8a 2 A V isi (a) vmin ^ V isj (a) vmin SPLIT := SPLIT [ fregi g [ fregj g

and and then

SPLIT

4. Partition every region in at the midpoint of its longest dimension, maintaining the prior estimates for each -value in the new regions.

Q

5. Mark each new region as unvisited:

V is(a) := 0 for all a.

A good strategy to dividing regions is to always divide along the longest dimension [10]. Some other strategies, such as dividing in the dimension of the regions' adjacent faces, work particularly poorly, causing long, thin regions appear which are repeatedly split along their length.

6 Experiments In this section the variable resolution algorithm is evaluated empirically in a number of learning tasks. In all the experiments the 1-step Q-learning algorithm in used. Although faster learning is expected with other algorithms (see [21, 15, 12]), Q-learning is employed here because of its simplicity.

12

6.1 Sinworld In the Sinworld experiment (described above) the agent has the task of learning the policy which gets it to (and keeps it at) the peak of a sin curve in the shortest time. To prevent a lucky partitioning of the state space which exactly divides the action-value function at the decision boundaries, a random oset for the reward function was chosen for each trial: sin0 (s) = sin(s + random). Additionally, at each step the agent is started at a random position and experienced a random action. This avoided the possibility of any unfair advantage a learner might have if using an exploration policy which relies upon previously learned values. The performance measure used was the average of the rewards received over 30 repetitions of a 20 step episode under the currently recommended policy. Each episode started from a random state and the results are averaged over 100 trials. In all the experiments conducted here, learning was suspended during performance evaluation and the exploration-exploitation tradeo is ignored. Figure 4 compares the variable resolution algorithm against a number of xed resolution representations. Of particular interest is the rate of the learning. Although in the initial stages performance improves slowly, the variable resolution method approaches its convergent performance far more quickly than those representations which nish with similar performance (see table 1). Figure 6 shows the nal partitioning after 10000 experiences. The highest resolution partitions are seen at the decision boundaries (where Q(s; L) and Q(s; R) intersect). At s = 90o partitioning has stopped as the expected loss in discounted reward for not knowing the area in greater detail is less than min . The decline in the partitioning rate as the boundaries are more precisely identi ed can be seen in gure 5. Of most interest in this experiment is that the eventual performance is equally as good a the xed resolution methods, yet the designer had not speci ed what resolution was required.

6.2 Mountain Car Task In the mountain car task the agent has the problem of driving an under-powered car to the top of a steep hill [15] (shown in gure 7)4 . The actions available to the agent are to apply an acceleration, deceleration or neither (coasting) to the car's engine. However, even at full power, gravity 4

This experiment reproduces the environment described in [19, p. 214]

13

18 16 States = ’Variable’

14 12

Performance

States = 8

States = 16

10 8

States = 32 6 States = 2 4 States = 128

2 0 -2 0

100

200

300 400 Experiences

500

600

700

Figure 4: Comparison of initial learning performance for the variable vs. xed resolution representations in the Sinworld task. States

2 4 8 16 32 64 128 Variable

Mean

6:13 14:73 16:29 16:58 16:72 16:71 16:58 16:75

Std. Dev

9:13 4:07 3:36 3:31 3:26 3:35 4:08 3:25

Table 1: Performance after 10000 experiences in the Sinworld task. Only the 128 state representation appears not to have fully converged. 0:5 8 Partition test freq. Every episode Initial regions 2 0:1

0:9 Qt=0 10 min

vmin

Table 2: Sinworld experiment parameters. 14

20 Performance Regions

18 16 14 12 10 8 6 4 2 0 -2 0

1000

2000

3000

4000

5000 6000 Experiences

7000

8000

9000

10000

Figure 5: The partitioning rate in the Sinworld task.

10 Q(s,’L’) Q(s,’R’) R(s) 8

Value

6

4

2

0

-2 0

1

2

3

4

5

6

7

State, s

Figure 6: The nal partitioning after 10000 experiences in the Sinworld experiment. The highest resolution partitions are seen at the decision boundaries (where Q(s; L) and Q(s; R) intersect).

15

provides a stronger force than the engine can counter. In order to reach the goal the agent must rst reverse back up the hill, gaining sucient height and momentum to propel itself over the far side. In order to reach the goal, the agent must rst move away from it. Once the goal is reached, the episode terminates. In addition, the only feedback provided is a punishment of 1 at every time step, even when the goal is reached. The only useful information is the absence of future punishments from the terminal state. The value of any terminal state is de ned to be zero since there is no possibility of future rewards. No discounting was employed ( = 1) since one can expect the episode to eventually terminate given sucient exploration. In this special case, the Q-values simply represent the negative of the expected number of steps to reach the goal. Every episode started with the agent at a random position and velocity within state-space and continued until the goal was reached. The exploration method used was -greedy. This chooses the currently best action with probability 1 and takes a random action with probability . In conjunction with initial optimistic estimates of Q-values, this provides good initial exploration, preferring unexplored actions as these have higher Q-values. The performance measure used was the number of steps taken to reach the goal from random starting positions using the currently recommended policy, averaged over 200 evaluations. Figure 8 shows the Q-value of the recommended action for each state with the partitioning after 10000 learning episodes. The cli represents a discontinuity in the Q-function. On low side of the cli the agent just has enough momentum to reach the goal. If the agent reverses for a single time step at this point it cannot reach the goal and must reverse back down the hill. It is here where there is a decision boundary and a large loss for not knowing exactly where there is a change in the best action. Figure 9 shows how this region of the state-space has been discretised to a high resolution. Regions where the best actions are easy to decide upon or where there is a smaller loss for not knowing where the decision boundaries lie are represented more coarsely. Due to the large discontinuity, partitioning continued long after there appeared to be a signi cant performance bene t for doing so (shown in gure 10). This simply re ects the performance metric used which measures the policy has a whole from random starting positions. Agents starting on or around this discontinuity still continue to gain performance improvements. One might argue that partitioning has simply occurred in the most frequently visited states due to the greedy-action-biased exploration policy since the most preferable paths converge along the discontinuity. Manual inspection of the regions revealed this not to be the case. Large frequently visited cells adjoining others of a dierent policy remained undivided. 16

The same experiment was also conducted but with the ranges of the state features chosen to be 10 times larger than previously, giving a new state-space of 100 times the original volume. Starting positions for the learning and evaluation episodes were still chosen to be inside the original volume. It was ensured that some edges of the initial regions intersected the states actually experienced in the experiment. Unless a decision boundary is visible at the initial resolution, partitioning cannot start. These changes had little eect upon the amount of memory resources used or the convergent performance (see gure 11). Goal

Gravity

Figure 7: The Mountain Car Task. min

10 15 Partition test freq. Every 20 episodes Initial regions 4 0:15

1 0:3 Qt=0 0

vmin

Table 3: Mountain Car experiment parameters.

7 Related Work KD-trees have been applied to function approximation in reinforcement learning previously. Variable-Resolution Dynamic Programming (VRDP) by Moore [11], starts with some initial model and increases the resolution of the state-space

17

- max[ Q(s, a) ]

Steps-to-go 80 70 60 50 40 30 20 10 0

-1

0.05 -0.5 Position

0 Velocity

0 -0.05

Figure 8: A plot of the value of each region after 10000 episodes. The value is measured as maxa Q(s; a) as this indicates the estimates number of steps to the goal given, that the best actions are always taken.

A c c l.

C o a s t D e c l.

Figure 9: The nal partitioning after 10000 episodes. Position and velocity are measured along the horizontal and vertical axes respectively.

18

450 Performance Regions 400

350

300

250

200

150

100

50

0 0

2000

4000

6000

8000

10000

Episodes

Figure 10: The growth in the the number of states over time.

500 Performance Regions

450 400 350 300 250 200 150 100 50 0 0

2000

4000

6000

8000

10000

Episodes

Figure 11: Results of the mountain car task with poorly chosen ranges for the state features. This had little eect upon the nal performance or the number of regions used.

19

along what it currently considers to be the optimal path, given the current model and partitioning. It does this by running \mental trajectories" through the state space and then partitioning at and around those states visited in simulation. Dynamic programming is performed at the resolution of the newly partitioned state-space in order to determine values for each region. Moore showed that adaptively partitioning the state-space allows dynamic programming to be performed in problems for which xed, uniform resolution solutions would be totally impractical due to the expense in memory and computation time. However, it has the disadvantage of requiring that the task consists of some particular starting state and contains a nite number of optimal policies. Since almost all visitable states are on an optimal path from some other state, it is easy to argue that, if VRDP was required to construct a policy map for general starting positions, the entire state-space would eventually become partitioned at the highest possible resolution. Similarly in many problems, there are large volumes of space with equivalent best actions and therefore many optimal policies. Again, it is expected that such a region of space would eventually becomes partitioned to the highest resolution. In a later paper, the Parti-Game algorithm by Moore and Atkeson [10] attempts to learn an adaptive-resolution model by partitioning at obstacles between regions in the state-space. A local-controller is used to aim at adjacent regions and partitioning occurs when it is deemed that the agent has become \stuck" while moving towards the target region. This method has had great success in high dimensional, real-valued problems and can (for reasons not explained here) quickly explore the state-space to nd an initial locally optimal solution. However, it requires that:

the tasks must be speci ed in terms of a goal regions. This denies the use

of general reward function and also prevents its use in non-terminating tasks, the algorithm attempts no further improvements to the policy once a solution has been found, designing the local greedy controller in advance may not be trivial, though Moore and Atkeson suggest that it may be possible to learn this, a suitable \stuck detector" may not be available or for every environment. The Sinworld task introduced above and many simple grid-worlds are examples of environments which do not contain transition boundaries. The VRDP and Parti-game algorithms are also both examples of model-based reinforcement learning and require that a transition model is either given or 20

learned on-line. Such models are often dicult to construct or are misleading representations of the real underlying process. This is especially true in non-Markovian environments and poor discrete approximations of continuous Markov processes. Perhaps the only successful adaptive resolution example in model-free learning is G-learning [4] which also employed a form of decision tree to generalise over input features. Splitting occurs based upon the statistical dierence of groups of states (i.e. states which appear statistically similar are not split). However, this currently limited to use with input features of binary strings.

8 Discussion and Conclusions The experiments above give evidence that model-free reinforcement learning can indeed be used with adaptive resolution function approximation. It may be applied to a large class of learning problems with real-valued states. However, requiring high con dence in Q-values for regions and the fact the there is generalisation over states but not actions means its applications are currently limited to problems where agents are allowed only a few actions. The work can be extended in several dierent directions but the main question still remains of how to choose the algorithm's parameters. This is of concern since a major objective was to remove some of the burden of parameter selection from a system designer. In the experiments conducted min was easily chosen by asking, \what is the worst loss of discounted reward due to over-coarseness we are prepared to tolerate?" A lower bound of vmin is more dicult to decide upon since it must ultimately be a function of the con dence measure used. If the agent is overly con dent in the Q-values for a region and the successor states of that region are under-estimates of their true values, then over-partitioning is likely to occur. The late discovery of a single new highly useful state can have a global eect on the shape of the Q-function and thus the position of decision boundaries. Such an event is likely to occur if i) the state-action space is poorly explored, or ii) the partitioning of region reveals it to be more useful at the higher resolution. Finally this paper was presented as a method for model-free learning since there are few such examples using adaptive resolution function representation. However, it is expected to work far better with algorithms which make use of a model since it is possible to nd correct estimates for Q-values (with respect to the current model and resolution), thus eliminating all variance in the returns estimate caused by the learning algorithm (equations 5). Although this appears to removes some problems of requiring high con dence in Q-values it only shifts the problem to having high con dence in the model of the environment. 21

9 Acknowledgements I am extremely grateful to Manfred Kerber for his extensive discussion of the ideas and proof reading the drafts of this document, and to also Jeremy Wyatt who is responsible for my interest in the eld of reinforcement learning. This work was funded by the School of Computer Science at the University of Birmingham.

References [1] Christopher G. Atkeson, Andrew W. Moore, and Stefan Schaal. Locally weighted learning. AI Review, 11:75{113, 1996. [2] R. E. Bellman. Dynamic Programming. Princeton University Press, 1957. [3] Justin A. Boyan and Andrew W. Moore. Generalization in reinforcement learning: Safely approximating the value function. In Proceedings of Neural Information Processings Systems, volume 7, page 15. Morgan Kaumann, January 1995. [4] David Chapman and Leslie Pack Kaelbling. Input generalization in delayed reinforcement learning: An algorithm and performance comparisons. In John Mylopoulos and Ray Reiter, editors, Proceedings of the Twelfth International Joint Conference on Arti cial Intelligence (IJCAI-91), pages 726{731, San Mateo, Ca., 1991. Morgan Kaufmann. [5] Robert H. Crites. Large-Scale Dynamic Optimization Using Teams Of Reinforcement Learning Agents. PhD thesis, (Computer Science) Graduate School of the University of Massachusetts, Amherst, September 1996. [6] Jerome H. Friedman, Jon L. Bentley, and Raphael A. Finkel. An algorithm for nding best matches in logarithmic expected time. ACM Transactions on Mathematical Software, 3(3):209{226, September 1977. [7] Leslie Pack Kaelbling, Michael L. Littman, and Andrew W. Moore. Reinforcement learning: A survey. Journal of Arti cial Intelligence Research, 4:237{285, 1996. [8] Maja J. Mataric. Interaction and Intelligent Behaviour. PhD thesis, MIT AI Lab, August 1994. AITR-1495. [9] Marvin L. Minsky. Steps towards arti cial intelligence. In E. A. Feigenbaum and J. Feldman, editors, Computers and Thought, pages 406{450. McGrawHill, 1963. Originally published in Proceedings of the Institute of Radio Engineers, January, 1961 49:8{30. 22

[10] A. W. Moore and C. G. Atkeson. The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces. Machine Learning, 21, 1995. [11] Andrew W. Moore. Variable resolution dynamic programming: Eciently learning action maps on multivariate real-value state-spaces. In L. Birnbaum and G. Collins, editors, Proceedings of the 8th International Conference on Machine Learning. Morgan Kaufman, June 1991. [12] Jing Peng and Ronald J. Williams. Incremental multi-step Q-learning. In W.W.Cohen and H.Hirsh, editors, Machine Learning: Proceedings of the 11th International Conference, pages 226{232, 1994. [13] T. Sandholm and R. Crites. Multiagent reinforcement learning in the iterated prisoner's dilemma. In Biosystems. Special Issue on the Prisoner's Dilemma, volume 37, pages 147{166, 1995. [14] Juan Carlos Santamaria, Richard Sutton, and Ashwin Ram. Experiments with reinforcement learning in problems with continuous state and action spaces. Adaptive Behavior 6(2), 1998. also appeared as Technical Report UM-CS-1996-088, Department of Computer Science, University of Massachusetts, Amherst, MA 01003. [15] Satinder P. Singh and Richard S. Sutton. Reinforcement learning with relacing eligibility traces. Machine Learning, 22:123{158, 1996. [16] Peter Stone and Manuela Veloso. Multiagent systems: A survey from a machine learning perspective. Technical Report CMU-CS-97-193, Carnegie Mellon University CS, December 1997. Currently being reviewed for journal publication. [17] Richard S. Sutton. Temporal Credit Assignment in Reinforcement Learning. PhD thesis, University of Massachusetts, 1984. [18] Richard S. Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding. Advances in Neural Information Processing Systems 8, 1038{1044 1996. [19] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. Adapive computation and machine learning. MIT Press, Cambridge, MA., 1998. [20] G.J. Tesauro. Temporal dierences learning and TD-gammon. Communications of the ACM, 38:58{68, 1995. [21] C.J.C.H. Watkins. Learning from Delayed Rewards. PhD thesis, King's College, Cambridge, UK, May 1989. 23

Decision Boundary Partitioning: Variable Resolution ...

Jul 21, 1999 - Reinforcement learning is the problem of learning to map situations .... In order to adapt resolution on-line, the algorithm uses a variation of a kd-tree ..... att who is responsible for my interest in the eld of reinforcement learning.

Download PDF

303KB Sizes 1 Downloads 205 Views

Report

Decision Boundary Partitioning: Variable Resolution ...

Recommend Documents