Ranking policies in discrete Markov decision processes - Springer Link

Viewer
Transcript

Ann Math Artif Intell (2010) 59:107–123 DOI 10.1007/s10472-010-9216-8

Ranking policies in discrete Markov decision processes Peng Dai · Judy Goldsmith

Published online: 16 November 2010 © Springer Science+Business Media B.V. 2010

Abstract An optimal probabilistic-planning algorithm solves a problem, usually modeled by a Markov decision process, by finding an optimal policy. In this paper, we study the k best policies problem. The problem is to find the k best policies of a discrete Markov decision process. The k best policies, k > 1, cannot be found directly using dynamic programming. Naïvely, finding the k-th best policy can be Turing reduced to the optimal planning problem, but the number of problems queried in the naïve algorithm is exponential in k. We show empirically that solving k best policies problem by using this reduction requires unreasonable amounts of time even when k = 3. We then provide two new algorithms. The first is a complete algorithm, based on our theoretical contribution that the k-th best policy differs from the i-th policy, for some i < k, on exactly one state. The second is an approximate algorithm that skips many less useful policies. We show that both algorithms have good scalability. We also show that the approximate algorithms runs much faster and finds interesting, high-quality policies. Keywords Probabilistic planning · Markov decision process · Policy ranking Mathematics Subject Classification (2010) 90c40

P. Dai (B) Computer Science & Engineering, University of Washington, Seattle, WA 98195-2350, USA e-mail: [email protected] URL: http://www.cs.washington.edu/homes/daipeng J. Goldsmith Department of Computer Science, University of Kentucky, Lexington, KY 40506-0046, USA e-mail: [email protected] URL: http://www.cs.uky.edu/˜goldsmit

108

P. Dai, J. Goldsmith

1 Introduction Markov Decision Processes (MDPs) [1] are a powerful and widely-used formulation for modeling probabilistic planning problems [2, 3]. For instance, NASA researchers use MDPs to model the Mars rover decision making problems [4, 5]. MDPs are also used to formulate military operations planning [6], coordinated multi-agent planning [7], agriculture [8], etc. An optimal planner typically takes an MDP model of a problem and outputs an optimal plan. This is not always sufficient. In many cases, a planner is expected to generate more than one solution. Furthermore, in the modeling phase, not every aspect of nature can be easily factored into a problem representation. For a NASA’s rover, for example, there are many safety constraints that need to be satisfied [5]. An optimal plan might have nonzero chance of landing in a high-risk state—but another may not have many risks and so it is better to prefer the slightly sub-optimal one. Similarly there are many decision criteria—probability of reaching the goal, expected reward, expected risk, various preferences, etc. It is also possible that an optimal policy is only optimal with respect to one criterion, and that others arise (or become clear) later. By generating a set of policies a priori, a planner can then choose a policy from the list that has good enough utility with respect to all known criteria. If all criteria are known ahead of time, it is possible to solve a multi-criteria MDP problem [9]. In this paper, we look at the k best policies problem. Given an MDP model, the problem is to find the k best policies, ranked by the expected value of the initial state, tie-broken by the “closeness” to a better policy, followed by lexicographic order of the policies. The classical optimal planning problem is a special case of the k best policies problem where k = 1. The optimal planning problem can be solved by dynamic programming. Finding the k-th best policy can be brute-force reduced to exponentially many instances of the optimal planning problem. Our experiments show that solving the k best policies problem this way requires unreasonable time even when k = 3. A very similar problem has been explored by Nielsen et al. [10–12]. Nielsen and Kristensen observed that the problem of finding optimal history-dependent policies (maps from the state space crossed with the time step to the action space) can be modeled as finding “a minimum weight hyperpath” in directed hypergraphs. A vertex in the hypergraph represents a state of the MDP at a particular time; the hypergraphs are, therefore, acyclic. They present an elegant and efficient algorithm for finding the k best time-dependent policies in a finite horizon MDP with discrete actions and states. However, their algorithm cannot handle general MDPs that contain probabilistic cycles, therefore its usefulness is limited. Our new solution to the k best policies problem follows from the property: The k-th best policy dif fers from a better policy on exactly one state. We propose an original algorithm for the k best policy problem that leverages this property. We demonstrate both theoretically and empirically that the new algorithm has low complexity and good scalability. We also introduce an approximate algorithm, which generates more useful policies. We show that the approximate algorithms runs much faster and finds interesting, high-quality policies.

Ranking policies in discrete Markov decision processes

109

2 Background 2.1 Markov decision processes AI researchers often use MDPs to formulate probabilistic planning problems. Definition 1 An MDP is defined as a tuple S , s0 , G, A, T, C, where – – – – –

–

S is a finite set of discrete states, s0 ∈ S is the start state or initial state,1 G ⊆ S is the set of goal states, A is a finite set of all applicable actions, T : S × A × S → [0, 1] is the transition function describing the domain dynamics (Ta (s |s) is the probability of transitioning from state s to state s on action a), and C : S × A → R is the cost of action transitions.

The agent executes its actions in discrete time steps called stages. At each stage, the system is at one distinct state s ∈ S . The agent can pick any action a from a set of applicable actions Ap(s) ⊆ A, incurring a cost of C(s, a)(> 0). The action takes the system to a new state s stochastically, with probability Ta (s |s). The horizon of an MDP is the number of stages for which costs are accumulated. We focus our attention on a special set of MDPs called stochastic shortest path (SSP) problems. The horizon in such an MDP is indefinite, meaning that policy executions end after finitely many steps with probability 1, and costs are accumulated with no discounting. There are two more component of a SSP: – –

s0 is the initial state G ⊆ S is the set of sink goal states. Reaching any one of g ∈ G , terminates execution.

The cost of the execution is the sum of all costs along the path from s0 to g. Despite its simplicity, SSP is a general representation. Any infinite-horizon, discountedreward MDP can easily be converted to an undiscounted SSP [13].2 To solve the MDP we need to find an optimal policy (π ∗ : S → A), a contingent execution plan that reaches a goal state with the minimum expected cost. We evaluate any policy π by a value function. Vπ (s) = C(s, π(s)) + Tπ(s) s |s Vπ s . (1) s ∈S

1 Note

that a probability distribution of several initial states can be achieved by adding a nominal state with one applicable action, so that the probabilities of successor correspond to the desired distribution over possible initial states.

construct an equivalent SSP for a discounted MDP with discount factor α, the high-level idea is to first create a goal state g. Then for any state s, add a transition from s to g with probability 1 − α. For any transition Ta (s |s), construct a transition from s to s with probability α · Ta (s |s). See [13, pp. 39–40] for the complete proof. 2 To

110

P. Dai, J. Goldsmith

Any optimal policy must satisfy the following system of Bellman equations: V ∗ (s) = 0 if s ∈ G , else ∗

V (s) = min

a∈Ap(s)

C(s, a) +

Ta s |s V s .

∗

(2)

s ∈S

The corresponding optimal policy can be extracted from the value function: ∗ ∗ Ta s |s V s . π (s) = argmina∈Ap(s) C(s, a) +

(3)

s ∈S

2.2 Dynamic programming We wish to consider restricted MDPs in which some actions are fixed. Definition 2 Let M = S , s0 , G, A, T, C. A sub-problem M with f lexible state space S ⊆ S with respect to a policy π is an MDP MπS = S , s0 , G, AπS , T, C, where AπS ⊆ A assigns a single action, π(s), to each s ∈ S − S , and all available actions Ap(s) for all states s ∈ S . Given M, S , π , and sub-MDP M , and policy π , we can consider π as a policy over M with two parts: the actions for s ∈ S − S , on which π has no choices (since those states have exactly one action each in M ), and the policy π M over M , mapping s ∈ S to ∪s∈S Ap(s). We call π M a sub-policy with respect to S . By Bellman’s principle of optimality an optimal policy π ∗ satisfies the following necessary and sufficient condition: for any sub-problem with respect to π ∗ , the corresponding sub-policy is also optimal. Many optimal MDP algorithms are based on dynamic programming. Its usefulness was first proved by a simple yet powerful algorithm called value iteration (VI) [1]. Value iteration first initializes the value function arbitrarily. Then the values are updated iteratively using an operator called Bellman backup to create successively better approximations per state per iteration. Value iteration stops updating when the value function converges (one future backup can change a state value by at most , a pre-defined threshold). Another algorithm, named policy iteration (PI) [14], starts from an arbitrary policy and iteratively improves the policy. Each iteration of PI consists of two sequential steps. The first step, policy evaluation, finds the value function of the current policy. Values are calculated by solving the system of linear equations given in the current policy as (1) (in the original PI algorithm), or by iteratively updating the value function as in VI until it converges (modified policy iteration [15]). The second step, policy improvement, updates the current policy by choosing a greedy action for each state independently. The update uses the previously calculated value function and does a one step lookahead. PI is guaranteed to terminate after finitely many iterations, when the policy improvement step does not change the policy. The policy at that point will be optimal. In this paper, we consider sub-optimal policies.

Ranking policies in discrete Markov decision processes

111

3 k Best policies problem Classical dynamic programming successfully finds one optimal policy of an MDP in time polynomial in |S | and |A| [16, 17]. In this paper, we find the k best policies of an MDP. We first give the formal definition of the k best policies problem. Then we introduce the main theoretical contribution of the paper by proving a useful result about the k-th best policy. Definition 3 A policy graph, Gπ = (V , E ), for an MDP with the set of states S and policy π is a directed, connected graph with vertices V ⊆ S , where s0 ∈ V , and for any s ∈ S , s ∈ V iff s is reachable from s0 under policy π . Furthermore, ∀s, s ∈ V , s, s ∈ E (the arcs of the policy graph) if Tπ(s) (s |s) > 0. We say s is a policy descendant of s with respect to π if there exists a path in Gπ from s to s . Furthermore, let Policydesc(s, π ) = {s ∈ V | ∃ a path from s to s in Gπ }. We represent a policy as a vector of actions, and define the distance between policies and the lexicographic order on policies. Definition 4 Given two policies π and π , we can consider them as vectors of length |S | over alphabet A, and define the Hamming distance Ham(π, π ) to be the number of states on which π and π differ. Formally, Ham π, π = | s ∈ S : π(s) = π (s) |. We also define
112

P. Dai, J. Goldsmith

Next we define an ordering on policies that refines the ordering imposed by the initial value, Vπ (s0 ). Definition 5 Given an MDP M and a dynamic list of p best policies generated so far, {π1 . . . , π p }, the next best policy is computed based on the following ordering ≺ on the rest of the policies for M. For π, π ∈ / {π1 . . . , π p }: π ≺ π if Vπ (s0 ) < Vπ (s0 ) else if min j≤ p Ham(π j, π ) < min j≤ p Ham(π j, π ) else if π 1. Then there is some m < k such that πk dif fers from πm on exactly one state. The complete proof of Theorem 1 is provided in the Appendix.

4 Algorithm Consider πi , the i-th (i > 1) best policy of an MDP M. We can reduce the problem of finding πi , given {π1 , . . . , πi−1 }, to many optimal planning problems, each solved by dynamic programming. This yields an algorithm, presented as Algorithm 1. We call it k best naïve algorithm (KBN), as it is a brute force algorithm that does not use Theorem 1. KBN is based on the following observation: The i-th best policy must differ from each of the i − 1 best policies on at least one state. Therefore, we enumerate the possible sets of state-action pairs the new policy must avoid, and find an optimal policy for each thus-constrained MDP, then take the best of those policies. For instance, given the best and second best policies, π1 and π2 , to find π3 , we say that either it differs from π1 on s0 and from π2 on s0 , or from π1 on s0 and from π2 on s1 , or . . .. In this case, we solve |S |2 many optimal planning problems, each with two fewer possible actions than the original MDP. To find the i-th best policy in this manner, we solve |S |i−1 many. Each newly-computed policy will be compared

Algorithm 1 k best naïve (KBN) 1: Input: M (an MDP), k 2: find best policy π1 by VI 3: ← {π1 } 4: for i ← 2 to k do 5: πi ← best policy that differs from any policy π ∈ by at least one state 6: ← ∪ {πi } 7: end for 8: return π1 , . . . , πk

Ranking policies in discrete Markov decision processes

113

with the best policy computed so far, so that the number of comparisons is linear in the number of policies computed. Supposing we use VI to solve those optimal planning problems, to find the i-th best, KBN has a complexity |S |i−1 × O(V I) plus an algorithm to find the best of the new policies, which is linear in |S |i−1 . The overall k time complexity of KBN is i=1 |S |i−1 × O(V I) + |S |i−1 , an exponential function of k. Some of these combinations of constraints may constrain away all actions for a particular state, so do not yield a next-best policy. However, the next best policy must be among those computed, and will be the best such. Algorithm 2 k best improved (KBI) 1: Input: M (an MDP), k 2: find best policy π1 by VI 3: P ← empty list 4: for i ← 2 to k do 5: generate k − i + 1 distinct best policies each of which differs from πi−1 on exactly one state and differs from {π1 , . . . , πi−1 } and insert them into P in order, discarding duplicates 6: πi ← the best policy in P 7: delete πi from P 8: end for 9: return π1 , . . . , πk Using Theorem 1, we have a new algorithm, called k best improved (KBI). The KBI pseudo-code is shown in Algorithm 2. KBI takes as input an MDP, M, and number, k. KBI begins by generating an optimal policy, by value iteration. As KBI runs, it generates a set of candidate policies P , which is initially empty. To find the i-th best policy, it generates k − i + 1 distinct candidate policies from πi−1 . These candidates (1) must not be duplicates of any policy in P , and (2) each differs from πi−1 on exactly one state. We then merge these k − i + 1 new policies into P and only keep the best k − i + 1 policies of P . We have the following theorem. Theorem 2 The i-th best policy must be an element of P . Proof From Theorem 1 the i-th (i ≤ k) best policy is exactly one state different from one of π1 , . . . , πi−1 , say, π j, where j < i. Therefore, it must have been generated when π j+1 was computed. Since it is the i-th best policy, it would have been amongst the i − j-th best of those policies that are one state different from π j, so it belongs to the k − j best policies added to P at stage j + 1. Thus, we find the i-th best policy by picking the best policy in P . We only need to keep a maximum of k − i + 1 policies in P before πi is chosen. There are (|A| − 1) × |S | policies that are exactly one state different from πi . Finding the best k − i of them has a complexity |A| × |S | × O( policy evaluation), plus the complexity of keeping the list P in sorted order (O(k log k)).3 KBI computes these

3 In

implementation, each policy is stored as a vector of actions. We can enumerate all vectors of Hamming distance 1 from πi−1 , and discard those already in the sorted list P. For each remaining π, we use policy evaluation to find Vπ (s0 ), and sort the policies by those values, and then lexicographically.

114

P. Dai, J. Goldsmith

Table 1 Running time in seconds of KBN and KBI on nine problems with different k values Domain DAP 1 Elevator 1 Racetrack 1 SAP 1 WF 1 WF 2 SAP 2 DAP 2 Racetrack 2

States

k=2

|S|

KBN

KBI

KBN

k=3 KBI

KBN

k=4 KBI

625 1,008 1,847 2,500 2,500 6,400 10,000 10,000 21,371

6.07 6.31 58.63 260.63 1,180.88 4,973.52 9,249.85 3,356.65 20,005.94

4.13 36.55 16.91 704.66 2,276.61 16,833.71 14,848.27 948.42 3,484.91

2,939.52 771.83 9,219.15 105 (*) 106 (*) 107 (*) 107 (*) 107 (*) 108 (*)

8.41 73.12 27.80 1,405.75 4,566.13 33,663.11 29,353.52 1,891.78 6,951.93

106 (*) 106 (*) 108 (*) 109 (*) 109 (*) 1011 (*) 1011 (*) 1011 (*) 1012 (*)

12.58 109.61 39.03 2,103.77 6,856.76 50,492.53 43,837.54 2,835.19 10,415.44

The running times of KBN on k > 2 are expectations except for the three smallest problems when k = 3. KBI outperforms KBN on four out of the nine problems when k = 2, and on all problems when k = 3, most by three orders of magnitude. When k = 4, the expected running time for KBN is too high for all of our benchmarks. The ones with (*) are expected running times

policies k − 1 times, so its complexity is (k − 1) × |A| × |S | × O( policy evaluation), a linear function of k. (Note that the sorting term is dominated by |A| × |S | × O( policy evaluation).) For each policy in the best-policy list, at most (|A| − 1) × |S | policies are one state away from it. If k is unknown, then the size of P is not restricted by k − i + 1 and it would require additional overhead to keep all such policies in the sorted order. The size of P grows linearly in k.

5 Experiments We address the following three questions in our experiments: (1) How does KBI compare with KBN on different problems and k values? (2) Does KBI scale well on large k values? (3) How different are the k best policies from the optimal policy? We implemented KBN and KBI in C. All experiments were performed preemptively on a 2.2 GHz Dual-Core Intel(R) Core(TM)2 Processor with 6 GB memory running Linux 2.6 and gcc 4.4.1. We picked problems from five domains, namely Racetrack [18], Single-arm pendulum (SAP), Double-arm pendulum (DAP) [19], Wet-floor(WF) [20], and Elevator [21]. A threshold value of = 10−6 was used. Runs of KBI were terminated if they did not produce a policy within five hours.4 5.1 Comparing KBI and KBN We compared KBN and KBI on a suite of ten problems of various sizes. Of the nine solvable problems, the running times of both algorithms when k = 2 are listed in Table 1. According to our analysis in Section 4, when k increases by 1, the running time of KBN increases by a factor of |S |. For k = 3, only the three small problems were tested on KBN. For other problems and when k = 4, we only included the

4 The

executable and the benchmark problems are available at http://www.cs.washington.edu/homes/ daipeng/kbest.html.

Ranking policies in discrete Markov decision processes

115

expected running times based on its performance on the same problem with smaller k. First, We see that KBI outperforms KBN on all problems when k = 3 by at least an order of magnitude, and most by three orders of magnitude. KBI is slower than KBN on some problems when k = 2, probably since value iteration is more efficient for them than policy iteration, on which KBI is based. Second, even for small k values the running times of KBN are prohibitively long. For example, for the Racetrack 2 problem, its expected running time is approximately ten years for k = 3 and over 20,000 years for k = 4. This shows that KBN is not even suitable for very small problems unless k is very small. On the other hand, KBI solves both cases within three hours. Third, we discover one interesting phenomenon. Any policy π usually contains many unreachable states, more formally, states that belong to the set S − Policydesc(s0 , π ). If one changes the policy of one such state, a new policy π is generated. It’s easy to see that Vπ (s0 ) = Vπ (s0 ), as Policydesc(s0 , π ) = Policydesc(s0 , π ) and ∀s ∈ Policydesc(s, π ), π(s) = π (s). We call π a trivial extension of π . Those trivially extended policies, although they have low initial state values, are less interesting, as they are essentially the same as the baseline policy. We tried k = 100 and found that, for problems other than the DAP domain the k(> 1) best policies reported by KBI are all trivial extensions of their optimal policies. To circumvent all the trivially extended policies, we designed a new approximation algorithm. The approximation algorithm, which we call KBA, short for k best approximations, differs slightly from Algorithm 2 in Line 5. KBA further requires that the policies differ from πi−1 on exactly one state in Policydesc(s0 , πi−1 ). In this case, any trivially extended policies from the current best list are excluded. However, there is no guarantee that the policies generated by KBA are the best in terms of the initial state values. But the policies generated by KBA are quite good in practice. We implemented the KBA algorithm, and compared its running times with KBI on the same set of problems (Table 2). With KBA, we can quickly solve Elevator 2, which was unsolvable using KBI. This is because the sizes of its policy graphs are much smaller compared to |S |. KBA also outperforms KBI on eight of the nine other problems (except DAP2, where it performed the same as KBI). The difference is at most three orders of magnitude. For example, KBA finds the fourth best policy of

Table 2 Running time in seconds of KBI and KBA in various problems with different k values Domain DAP 1 Elevator 1 Racetrack 1 SAP 1 WF 1 WF 2 SAP 2 DAP 2 Elevator 2 Racetrack 2

States

k=2

|S|

KBI

KBA

KBI

k=3 KBA

KBI

k=4 KBA

625 1,008 1,847 2,500 2,500 6,400 10,000 10,000 14,976 21,371

4.13 36.55 16.91 704.66 2,276.61 16,833.71 14,848.27 948.42 – 3,484.91

3.23 0.04 0.39 58.02 54.90 482.94 1,282.69 947.07 6.60 4.40

8.41 73.12 27.80 1,405.75 4,566.13 33,663.11 29,353.52 1,891.78 – 6,951.93

6.60 0.07 0.76 120.19 122.12 975.86 2,612.42 1,945.69 12.85 8.22

12.58 109.61 39.03 2,103.77 6,856.76 50,492.53 43,837.54 2,835.19 – 10,415.44

9.97 0.14 1.14 182.57 191.07 1,492.21 3,936.45 2,916.19 18.85 12.05

KBA solves one problem (Elevator 2) that is not solvable by KBI. It also outperforms KBI on eight of the rest nine problems, most by three orders of magnitude

116

P. Dai, J. Goldsmith

Fig. 1 Initial state value (y-axis) of the k-th (k = 1, . . . , 100) best policies generated by KBA (x-axis) on Elevator1 and Elevator2 problems. KBA is able to generate sub-optimal policies that are of high quality

problem Racetrack 2 in 19 seconds, where KBI takes three hours, and KBN has an expected running time of over 20,000 years. Next we investigated the qualities of the policies generated by KBA. Results show that except for the two Elevator problems, all the sub-optimal policies have the same initial state values as for the corresponding optimal policy. For the small Elevator problems, the value of the 100-th policy is less than 10% higher than for the optimal policy. For the large Elevator problem, the increase is less than 2%. See the initial state values of the 100 relevant policies in Fig. 1. This set of experiments shows that KBA is much more efficient than KBI, and is able to generate high-quality, nontrivially-extended policies. 5.2 The scalability of KBI and KBA In this experiment we investigated whether the KBI and the KBA algorithms scale to large k values. We ran KBI on the four small problems, and KBA for k = 100

Fig. 2 Running times in seconds (y-axis) of KBI when k = 2, . . . , 100 (x-axis) on four simple problems. The running times increase linearly in k for all problems

Ranking policies in discrete Markov decision processes

117

Fig. 3 Running times in seconds (y-axis) of KBA when k = 2, . . . , 100 (x-axis) on six hard problems. The running times increase linearly in k for all problems

on six of the hardest problems, and recorded the elapsed times when each algorithm finished generating the i-th best policy (i = 2, . . . , k)5 . Figures 2 and 3 clearly show that, for all problems both algorithms spend times linear in k when calculating k-th best policies. 5.3 How k non-trivially extended best policies differ from the optimal policy We are also curious to know how the k non-trivially-extended best policies differ from the optimal policy. We analyzed the policies calculated in the previous experiment, and compared the total number of different states, Ham(πk , π1 ) (k > 1), between the k-th best policy and the optimal policy, π1 , for each problem. When the Ham(πk , π1 ) values are small for a problem, it means that the k best policies are very similar to the optimal policy. This shows that the optimal policy is loosely coupled, i.e., many good policies can be generated by one or a few small changes to the optimal

5 We

chose four small problems for KBI, as it runs much more slowly than KBA.

118

P. Dai, J. Goldsmith

Fig. 4 The total number of different states (y-axis) between the k-th non-trivially-extended best policy and the optimal policy, Ham(πk , π1 ), when k = 2, . . . , 100 (x-axis) on all ten problems. All k best policies are quite close to their π1 ’s

policy. In other words, it is possible to change the actions for several states and have very little impact on the quality of the resulting policy. When the Ham(πk , π1 ) values are large, the optimal policy is more tightly coupled. Or, if a sub-optimal action is

Ranking policies in discrete Markov decision processes

119

chosen for a state, in order to get a good sub-optimal plan, changes to other states are usually also required. We plotted the Ham(πk , π1 ) values for all ten problems in Fig. 4. First, Hamming distances tend to increase, though not monotonically, as k increases. This is expected as more deviation from the optimal policy should provide a worse value. Second, all problems have relatively low Ham(πk , π1 ) values (< 10 for all k, on problems with state spaces ranging from 625 to 21,371). This shows that the top 100 nontrivially-extended best policies for these problems are quite close to the the optimal policies. Third, for each domain the larger a problem is, the more valid policies6 that problem contains, and therefore, the more policies are close to the optimal. This explains why for the three large problems, WF2, SAP2, and DAP2, each of the 100 best policies differs from the optimal one on just one state. Fourth, the plots show strong domain-dependent patterns. The plots for the two problems from the Elevator and Racetrack domains are very similar. Larger problems do not necessarily produce smaller Hamming distances, especially because those are handcrafted domains, aiming at making the problems more challenging for planners in a planning competition.

6 Conclusions This paper extends the work introduced in [22]. We have greatly expanded the experimental section, including adding a new algorithm and its analysis, and provided new and complete proofs for all results, in particular, Lemmas 2–4 and Theorem 1. This work makes several contributions. First, we introduce the k best policies problem, and argue for its importance. Second, we prove a strong and useful theorem that the k-th best policy differs from some m(< k)-th best policy on exactly one state. This gives a feasible algorithm, whereas the brute-force algorithm for solving the k best policies problem (KBN), which does not use the theorem, runs in exponential time in k. Third, we propose a new algorithm, named k best improved (KBI), based on our theorem. We show that the time complexity of KBI is dominated by a computation with time complexity linear in k. Fourth, we propose an approximate algorithm, named k best approximate (KBA), which decreases the use of unreachable states and generates only non-trivially-extended policies. Fifth, we demonstrate that KBI outperforms KBN by at least an order of magnitude when k = 3. The KBN algorithm does not scale to larger k values, as its running time increases exponentially in k. We also show that KBA not only outperforms KBI on most problems by at least an order of magnitude, but also generates high-quality policies. The running times of KBI and KBA increase only linearly in k. This makes them suitable for problems for which we want a long list of best policies. Sixth, we notice that the k best nontrivially-extended policies for all the MDP problems are quite similar to the optimal policies, though some problems’ optimal policies are more tightly coupled than others’. This is just the beginning of work on k best policies. There is much to be done in improving the algorithms, and in looking at applications-driven variants. For

6A

valid policy is one that reaches a goal state with probability 1.

120

P. Dai, J. Goldsmith

example, deeper studies can be continued on the KBA algorithm. What non-triviallyextended policies from KBN’s best list does KBA skip? Can we improve KBA to only skip non-trivally-extended ones? Finally, we can turn Theorem 1 around and consider the sensitivity of policy values to individual states, or state-action pairs. Acknowledgements Dai was partially supported by Office of Naval Research grant N00014-06-10147. We thank Patrice Perny for suggesting this problem, for helpful suggestions, and for pointing out the previous work. We thank Mausam for helpful discussions on the problem. We thank the anonymous reviewers for suggestions that improved the paper.

Appendix Lemma 2 Let M be an MDP, and π and π be two policies for M that dif fer only on two states s1 and s2 . Suppose that Vπ (s0 ) ≤ Vπ (s0 ). Consider the following two policies π 1 , π 2 obtained from π by replacing exactly one distinct action each from π(s), s ∈ {s1 , s2 }, with the corresponding π (s). Without loss of generality, suppose π i (si ) = π (si ). Then π 1 and π 2 cannot both have larger initial state values than π does. / {s1 , s2 }, the Proof Suppose otherwise. Construct an MDP M in which, for all s ∈ only available action is π(s) = π (s), and the only actions available for si ’s are those specified by π and π . We will overload notation and use the same policy names, but limit consideration, for the moment, to M . We consider the policy iteration (PI) algorithm. PI has two main features: (1) starting from an arbitrary policy, PI terminates on some optimal policy in finitely many steps, and (2) given the current policy, the policy improvement step picks the greedy action for each state independently. Let us suppose we start policy iteration from π 1 and mimic one policy improvement step. We first look at state s1 and try to improve its action. Note that the improvement is independent of s2 given its value, Vπ 1 (s2 ). This is exactly the same as considering a further simplified MDP M1 , where M1 has the same actions as M except for s2 , where it only has action π 1 (s2 ) = π(s2 ). Since Vπ 1 (s0 ) > Vπ (s0 ), and the two policies differ only on s1 , we improve the action for s1 from π 1 (s1 ) to π(s1 ). Next we look at state s2 , for the same reason, we improve the action for s2 from π 1 (s2 ) to π (s2 ). This gives us π 2 . If we once again apply PI, we will, by similar arguments, get π 1 . We know that there cannot be cycles in PI, since otherwise PI would not converge in finite time. Thus, our initial assumption, that for both i, Vπ i (s0 ) > Vπ (s0 ), is contradicted. Lemma 3 Let M be an MDP, and π and π be two policies for M that dif fer only on m states s1 , s2 , . . . , sm , m > 1. Suppose that Vπ (s0 ) ≤ Vπ (s0 ). Consider the 2m − 2 distinct policies π T , T( = ∅) ⊂ {s1 , s2 , . . . , sm } that agree with π on all states not in T, and agree with π on T. Then for at least one such T Vπ T (s0 ) ≤ Vπ (s0 ). Proof Suppose otherwise. First, as in the proof to Lemma 2, construct an MDP M in which, for all s ∈ / {s1 , s2 , . . . , sm }, the only available action is π(s) = π (s), and the

Ranking policies in discrete Markov decision processes

121

only actions available for si ’s are those specified by π and π . Note that in M , π has the optimal initial state value, and no policy other than π and π has the optimal initial state value. Then we draw an undirected graph of 2m vertices, where each vertex corresponds to a policy of M . Two policies (vertices) have an edge between them if and only if 1 3 1 they differ on exactly one state. For example, π {s ,s } and π {s } are neighbors. We put all the vertices on a leveled structure where π is the only vertex on the top (0-th) level and π is the only vertex on the bottom (m-th) level. Each policy (vertex) on a given level has the same number of actions that are different from π , and that number is their level number. There are a total of m + 1 levels. (See Fig. 5). Note that each vertex π¯ on the i-th level has exactly m neighbors, where i of them are on the i − 1-st level and the other m − i are on the i + 1-st level. Then we add directions to the edges. The directed edges reflect PI’s preference between the two vertices (policies). Suppose we have two policies π¯ and π¯ that differ on exactly one state s. A directed edge from π¯ to π¯ means that if the current policy is π¯ , in the policy improvement step PI will switch the action of s from π¯ (s) to π¯ (s). We know there cannot be a cycle between two states. Suppose that our version of PI breaks ties by preferring a lexicographically earlier policy. Therefore each edge has exactly one direction. Consider the vertices on level 1 (the level right below vertex π ). We assume they all have higher initial state values than Vπ (s0 ) (which is ≥ Vπ (s0 )), so the edges between the level 1 vertices and π must all point up towards π . Similarly, the edges from vertices on level m − 1 to π are all pointing down. Figure 6 gives an illustration. First we argue that Vπ (s0 ) = Vπ (s0 ). By the assumption that all edges point toward π , if we start PI on π it converges to π . If Vπ (s0 ) < Vπ (s0 ), that would mean that PI converged to a suboptimal policy. Furthermore, we see that there cannot be any policy π T in this graph with Vπ T (s0 ) < Vπ (s0 ) by this argument.

Fig. 5 Illustration of the structure of a m + 1-level graph constructed from the 2m distinct policies π T , T ⊆ {s1 , s2 , . . . , sm }. Each vertex in the graph represents a policy. Two policies are adjacent if and only if they differ on exactly one state

π

...

π {s } 1

π {s

m

}

...

π {s ,...,s 1

m −1

π {s ,...,s 2

}

...

π'

m

}

122 Fig. 6 A directed graph version of Fig. 5, where an arrow in the edge reflects preference of policy improvement step over the two adjacent policies, independent of all other state-actions the two policies have in common

P. Dai, J. Goldsmith

π

...

1

π {s }

π {s

m

}

...

1

π {s ,...,s

m −1

}

2

...

π {s ,...,s

m

}

π'

We already proved in Lemma 2 that such a directed graph for m = 2 would imply that PI fails. We now assume that no such graph exists with level m < k and show that none exists with level m = k. Given a policy and its value, the policy improvement step on each single state is independent. So this is just like checking all its neighboring policies. If the corresponding edge is pointing away from the policy (vertex), then PI will switch the action for that state; otherwise, it keeps the current action. Let m = k. By definition, M has exactly 2k policies. Based on the assumptions, π and π are the only two policies that have the optimal value. Policy iteration is correct if, starting from any policy, it terminates at π or π . Without loss of generality, let’s assume that, starting from at least one policy, PI converges to π . Suppose i(> 0) is the highest level (i being minimum) that contains a policy π¯ that converges to π in one step. Then for this particular policy (vertex), all its edges connecting with vertices on the i + 1-st level must point upward, otherwise PI will update the actions of at least one state to those of π . This defines a subgraph that contains 2k−i vertices: the 0-th level of the subgraph only contains π¯ with all k − i edges pointing towards itself. The bottom is π with all k − i edges pointing towards itself. Furthermore, as we have shown above, the two policies on the 0-th and k − i-th level must have the same initial state value, i.e., Vπ(s ¯ 0 ) = Vπ (s0 ). This contradicts with the assumption that π and π are the only two policies that have the optimal value. Proof of Theorem 1 Let M be an MDP, and = {π1 , . . . , πk−1 } be the list of its k − 1 best policies. We claim that, for k > 1, there is some j < k and state s such that π j differs from πk exactly on s. Suppose otherwise. Assume πi ∈ is the policy that differs by the fewest states, namely, m > 1, from πk . From Lemma 3, we know that there exists at least one policy π¯ with Vπ¯ (s0 ) = Vπk (s0 ), such that π¯ differs from πk in m − n (1 ≤ n < m) states and from πi in n states. We know π¯ cannot be in , since that would contradict

Ranking policies in discrete Markov decision processes

123

the assumption that πi is the policy in that differs from πk on the fewest states. But then πk cannot be the k-th best policy, because Vπ¯ (s0 ) = Vπk (s0 ) and π¯ is (Hamming) closer to πi , which contradicts Definition 2.

References 1. Bellman, R.: Dynamic Programming. Princeton University Press, Princeton (1957) 2. Boutilier, C., Dean, T., Hanks, S.: Decision-theoretic planning: structural assumptions and computational leverage. J. Artif. Intell. Res. 11, 1–94 (1999) 3. Bonet, B., Geffner, H.: Planning with incomplete information as heuristic search in belief space. In: Proceedings of ICAPS, pp. 52–61 (2000) 4. Bresina, J.L., Dearden, R., Meuleau, N., Ramkrishnan, S., Smith, D.E., Washington, R.: Planning under continuous time and resource uncertainty: a challenge for AI. In: Proceedings of UAI, pp. 77–84 (2002) 5. Bresina, J.L., Jónsson, A.K., Morris, P.H., Rajan, K.: Activity planning for the Mars exploration rovers. In: Proceedings of ICAPS, pp. 40–49 (2005) 6. Aberdeen, D., Thiébaux, S., Zhang, L.: Decision-theoretic military operations planning. In: Proceedings of ICAPS, pp. 402–412 (2004) 7. Musliner, D.J., Carciofini, J., Goldman, R.P., Durfee, E.H., Wu, J., Boddy, M.S.: Flexibly integrating deliberation and execution in decision-theoretic agents. In: Proceedings of ICAPS Workshop on Planning and Plan-Execution for Real-World Systems (2007) 8. Nielsen, L.R., Jorgensen, E., Kristensen, A.R., Ostergaard, S.: Optimal replacement policies for dairy cows based on daily yield measurements. J. Dairy Sci. 93(1), 75–92 (2010) 9. Perny, P., Weng, P.: On finding compromise solutions in multiobjective markov decision processes. In: ECAI Multidisciplinary Workshop on Advances in Preference Handling (2010) 10. Nielsen, L.R., Kristensen, A.R.: Finding the k best policies in finite-horizon MDPs. Eur. J. Oper. Res. 175(2), 1164–1179 (2006) 11. Nielsen, L.R., Pretolani, D., Andersen, K.A.: Finding the k shortest hyperpaths using reoptimization. Oper. Res. Lett. 34(2), 155–164 (2006) 12. Nielsen, L.R., Andersen, K.A., Pretolani, D.: Finding the k shortest hyperpaths. Comput. Oper. Res. 32, 1477–1497 (2005) 13. Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. Athena Scientific, Belmont (1996) 14. Howard, R.: Dynamic Programming and Markov processes. MIT Press, Cambridge (1960) 15. Puterman, M.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York (1994) 16. Littman, M.L., Dean, T., Kaelbling, L.P.: On the complexity of solving Markov decision problems. In: Proceedings of UAI, pp. 394–402 (1995) 17. Bonet, B.: On the speed of convergence of value iteration on stochastic shortest-path problems. Math. Oper. Res. 32(2), 365–373 (2007) 18. Barto, A., Bradtke, S., Singh, S.: Learning to act using real-time dynamic programming. Artif. Intell. 72, 81–138 (1995) 19. Wingate, D., Seppi, K.D.: Prioritization methods for accelerating MDP solvers. J. Mach. Learn. Res. 6, 851–881 (2005) 20. Bonet, B., Geffner, H.: Learning in depth-first search: A unified approach to heuristic search in deterministic and non-deterministic settings, and its applications to MDPs. In: Proceedings of ICAPS, pp. 142–151 (2006) 21. ICAPS-06: 5th International Planning Competition (2006). http://www.ldc.usb.ve/˜bonet/ipc5/ 22. Dai, P., Goldsmith, J.: Finding best k policies. In: Proceedings of ADT, pp. 144–155 (2009)

Identification in Discrete Markov Decision Models

Limit Values in some Markov Decision Processes and ...

Finding Best k Policies - Springer Link

Hydrogeochemical Processes in the Kafue River ... - Springer Link

Using hidden Markov chains and empirical Bayes ... - Springer Link

Social Image Search with Diverse Relevance Ranking - Springer Link

Effects of child support and welfare policies on ... - Springer Link

An Approach for the Local Exploration of Discrete ... - Springer Link

Jump-Diffusion Processes: Volatility Smile Fitting and ... - Springer Link

Finite discrete Markov process clustering

Using Fuzzy Cognitive Maps as a Decision Support ... - Springer Link

Complex Systems Models for Strategic Decision Making - Springer Link

Clustering Finite Discrete Markov Chains

Markov Processes on Riesz Spaces

Ambiguity in electoral competition - Springer Link

Exploring Cultural Differences in Pictogram ... - Springer Link

Complexified Gravity in Noncommutative Spaces - Springer Link

Directional dependence in multivariate distributions - Springer Link

Molecular diagnostics in tuberculosis - Springer Link