Convex Synthesis of Optimal Policies for Markov ...

Viewer
Transcript

Convex Synthesis of Optimal Policies for Markov Decision Processes with Sequentially-Observed Transitions Mahmoud El Chamie Abstract— This paper extends finite state and action space Markov Decision Process (MDP) models by introducing a new type of measurement for the outcomes of actions. The new measurement allows to sequentially observe the next-state transition for taking an action, i.e., the actions are ordered and the next action outcome in the sequence is observed only if the current action is not chosen. We show that the sequentially-observed MDP (SO-MDP) shares some properties with a standard MDP: among history dependent policies, Markovian ones are still optimal. SO-MDP policies have the advantage of producing better rewards than standard optimal MDP policies due to additional measurements. Computing these policies, on the other hand, is more complex and we present a linear programming based synthesis of the optimal decision policies for the finite horizon SO-MDPs. A simulation example of multiple autonomous agents is also provided to demonstrate the SO-MDP model and the proposed policy synthesis method.

I. I NTRODUCTION Markov Decision Processes (MDPs) have been used to formulate many decision-making problems for stochastic dynamical systems [1]–[8]. MDPs have been widely studied since the pioneering work of Bellman [9], which provided the foundation of dynamic programming, and the book of Howard [10] that popularized the study of decision processes. MDP models are applied to diverse fields including robotics [11], automatic control [12], flight control [13], economics [14], revenue management [15], queuing [16], and communication networks [8]. Standard MDP problems assume fullstate feedback: at every time instant (epoch), the state of the system is known and deterministic rewards are collected for choosing an action. Action outcomes are characterized by a next-state transition. The next-state in standard MDPs is stochastic, i.e., states at the next time-instants cannot be predicted in a deterministic manner. There have been several extensions and generalizations of the MDP models to fit into these models the requirements and considerations of applications. Partially observed MDPs (POMDPs) take into account uncertainties in the full state knowledge [17]. There can also be uncertainties in state transition/reward models. Learning methods are developed to handle such uncertainties, e.g., reinforcement learning [18], which adapt the decision policy of the MDP during the course of the process, thus increasing performance. Another extension in the MDP literature is the Bandit problem [19], where the agents can observe the random reward of different actions and have to choose the actions that maximize the sum of rewards The authors are with the University of Texas at Austin, department of Aerospace Engineering, 210 E. 24th St., Austin, TX 78712 USA. Emails: [email protected] and [email protected]

Behc¸et Ac¸ıkmes¸e through a sequence of repeated experiments. Determination of optimal stopping time is also studied to determine optimal epoch for a particular action [20, Chapter 13]. In other applications, multi-objective cost functions or constraints are considered for the computation of the optimal MDP policies [21]. In most of the relevant literature, the extensions to the standard MDP models are obtained by relaxing some of the assumptions, e.g., observability of the current state, known rewards, transition probabilities, etc. In this paper, we extend standard MDP problems by considering additional measurements on the action outcomes. Specifically we assume that the action outcomes are observed in a prescribed sequence, and the next action’s outcome is observed only if the current action in the sequence is not taken. We refer to this model with the extended set of information as the sequentially-observed MDP (SO-MDP) model. We show that among history dependent policies, Markovian policies are still optimal for SO-MDPs, whose proof utilizes similar arguments as those of the standard MDP. The main advantage of SO-MDP policies is that they produce better rewards than standard optimal MDP policies benefiting from the additional observations. Some interesting examples where SOMDP outperforms standard MDP (such as routing, university admission, and market investment) are given in [22] due to page limitations. The policy synthesis for SO-MDPs, on the other hand, is more complex because the synthesis problem is not convex in the decision variables. We show that, with an appropriate change of variables, a linear-programmingbased method can be used to synthesize optimal SO-MDPs decision-making policies. The introduced model and the synthesis results are also demonstrated on an autonomous agents simulation example. II. S EQUENTIALLY-O BSERVED MDP (SO-MDP) In this section we compare decision policies of the standard MDP to those of the introduced SO-MDP model. A. Stochastic System The dynamic behavior of a one-dimensional stochastic system [1, p. 10] is modeled by an equation of the form: Xt+1 = ft (Xt , Ut , ωt ), for t = 1, . . . , N − 1

(1)

where t is the decision epoch of horizon N , and Xt is the state, e.g., the state can have a finite number of possibilities as will be our assumption in this paper. Ut is the control that defines a probability distribution U := P[A] over the action set A (assumed finite) where an action At ∈ A is

Fig. 1.

Standard MDP full-state feedback model.

then chosen at instant t following such probability law. ωt is a random variable that characterizes the stochastic nature of the process. In MDP problems, every control {Ut ; t ≥ 1} induces a discrete time Markov process {Xt ; t ≥ 1}. Thus MDPs are sometimes called controlled Markov chains. B. Policy Description A Markovian policy for standard MDP is a sequence of controls {U1 , . . . , UN −1 } under the assumption of a full state feedback. At any decision epoch t the system observes the state Xt and applies a control Ut (Xt ) given by the policy. Figure 1 provides the standard controlled Markov chain approach for the standard MDP system. Note that, it can be shown that optimal policies for standard MDPs are Markovian and deterministic [14]. In SO-MDP, we assume that actions are ordered in a prescribed sequence. When the system reaches an action ak , the next random state Xt+1 can be observed (Ot,k in Figure 2) before taking the action. Then the decision is a probability distribution P[{Yes, No}] over acceptance/rejection of the observed transition. The yes/no decisions are taken at instances called phases. A phase starts with an observation for the transition outcome of an action and ends with a decision about whether to take this action or not. A Markovian policy for SO-MDP is also a set of controls {U1 , . . . , UN −1 }, but with a different input for the control functions. SO-MDP has a feedback law that observes {Xt , Ot,1 , . . . , Ot,m } and applies a control Ut (Xt , Ot,1 , . . . , Ot,m ) that is a probability distribution over each of the yes/no, acceptance/rejection, possibilities (see Figure 2). In SO-MDP, the order of the actions a1 , ..., am can affect the optimal policy because they are observed and acted upon sequentially. The SO-MDP can be further enriched by making this order a solution variable, which is not pursued in this paper. Remark. The proposed model is different from the well known “secretary problem” in MDP literature [23], [24]. In the simplest form of the secretary problem, a fixed number of people are interviewed for a job in a sequential manner, and based on the (observed) rank of the current interviewed candidates, a decision should be taken whether to accept or reject the latest interviewed candidate. The main difference with the sequentially observed MDPs is that in the “secretary problem”, observing a candidate changes the probability of future transitions (because of the correlation

Fig. 2.

Sequential MDP full-state feedback model.

between the events). However, in our model an observation is independent from the further environmental dynamics (i.e., observing a transition at a given phase does not change the transition probabilities for next phases or epochs). Another fundamental difference is that our model does not necessarily have a stopping time, and the horizon can go to infinity which is not possible for the secretary problem. C. Mathematical Formulation of SO-MDPs 1) States and Actions: Let the set S = {1, . . . , n} be the set of states having a cardinality |S| = n. Let us define As = {a1 , . . . , am } to be the set of actions available in state s (without loss of generality the number of actions does not change with the state, i.e., |As | = m for any s ∈ S). We consider a discrete-time system where actions are taken at different decision epochs. Let Xt , At , and Zt be respectively the random variables corresponding to state, action, and the phase at the t-th decision epoch. 2) Decision Rule and Policy: Given the history of the process: ht = {s1 , y1 , s2 , y2 , . . . , st−1 , yt−1 , st } where sk and yk are respectively a realization of the random variables Xk and Ak . Let Ht be the set for all possible realizations, i.e., ht ∈ Ht . We define a randomized decision rule dt at time t for the standard MDP to be the following function dt : Ht → P[Ast ] where P[Ast ] is a probability distribution over the set of possible actions Ast . This policy is history dependent as the current and preceding states are all taken into account to decide on the current action. For a Markovian policy, Ht = S and the decision variables are Prob{At = ak |Xt = i} for an action ak ∈ Ai and given any state i. For a SO-MDP, we define a decision rule dt at time t to be the following function dt : Ht × S × {1, ..., m} → P[{Yes, No}] where Ht is the set of all possible state-action history, S is the possible observed states, and {1, . . . , m} is the set of possible phases (i.e., the ordering of actions). For a Markovian policy, Ht = S and the decision variables are Prob{Yes|Xt = i, Ot,k = j, Zt = k} given any state i and an observed transition to state j at phase k. Let π = (d1 , d2 , . . . , dN −1 )

be the policy for the SO-MDP given that there are N − 1 decision epochs. To show explicit dependence of a probability on the policy, Probπ {E} is used to denote the probability of an event E given a certain policy π (e.g., Probπ {At = a|Xt = i} is the probability that action a is taken given that the system is at state i at time t and given the policy π). 3) Rewards: Given a state s ∈ S and action ak ∈ A, we define the reward rt (s, k) ∈ R ⊆ R. We define the expected reward for a given decision rule dt at time t to be r¯t (s) = Edt [rt (s, At )] =

m X

Probπ {At = ak |Xt = s}rt (s, k),

k=1

(2) where At is the action taken due to the decision rule dt (it is a random variable), and ¯rt ∈ Rn is the vector of expected rewards for each state. Given there are N −1 decision epochs, then there are N reward stages and the final stage reward is given by rN (s) (or ¯rN the vector having as its elements the final reward at a given state). 4) State Transitions: We now define the transition probabilities as follows, Gi (j, k, t) = Prob{Ot,k = j|Xt = i, Zt = k}, where Ot,k is the observed transition for a given phase k. Let Gi (t) ∈ Rn×m be the matrix whose j-th row and k-th column entry is Gi (j, k, t) (for simplicity we drop the index t when there is no confusion and transitions are denoted simply by Gi ). Notice that Gi is independent of the policy π and it models the dynamic environment. Let G be the set having these transition matrices. In standard MDPs, the transition probabilities of the environment are defined in a slightly different way, as Prob{Xt+1 = j|Xt = i, At = a}. In SOMDPs, contrary to standard MDPs, this latter expression is a function of the policy π as shown later in the proof of Lemma 1. 5) Sequentially Observed Markov Decision Processes (SO-MDPs): Let γ ∈ [0, 1] be the discount factor, which represents the importance of a current reward in comparison to future possible rewards. We consider γ = 1 throughout the paper, but the results remain applicable after a suitable scaling when γ < 1. A discrete SO-MDP is a 5-tuple (S, As , G, R, γ) where S is a finite set of states, As is a finite set of actions available for state s, G is the set that contains the transition probabilities as explained above, and R is the set of rewards at a given time epoch due to the current state and action. 6) Performance Metric: The expected discounted total reward is the performance metric, which is given as "N −1 # X π vN = Ex1 rt (Xt , At ) + rN (XN ) , (3) t=1

where Xt is the state at decision epoch t, At is the action due to π at decision epoch t, and the expectation is conditioned on a probability distribution over the initial states (i.e., x1 ∈ P[S] where x1 (i) = Prob{X1 = i}). When x1 is a point distribution (i.e., there exists a state s such that x1 (s) = 1), π then we denote the performance metric as vN (s).

7) Optimal Policy: The optimal policy π ∗ is defined as the policy (when it exists) that maximizes the performance π ∗ measure, π ∗ = argmaxπ vN , and vN to be the optimal value, ∗ π i.e., vN = maxπ vN . For the standard MDP, the backward induction algorithm [14, p. 92] gives the optimal policy as well as the optimal value. However, in our new model the optimization variables are different and another algorithm for finding optimal policies is needed. In the following sections, we will provide such an algorithm for SO-MDPs. D. Decision Variables for Markovian Policies In standard MDPs, the decision variables are the probability distributions for each state i ∈ S, pi (a, t) := Prob{At = a|Xt = i}, that define an action selection policy. In the SOMDPs, the decision variables are whether to accept or reject a given transition at phase k, which are defined as follows: Pi (j, k, t) := Prob Accepting transition | Observed transition to state j, System is in state i, phase k, and epoch t . = Prob{Yes |Ot,k = j, Xt = i, Zt = k} Since there are only m−1 phases, we assume Pi (j, k, t) = 1 if k = m. In standard MDPs, the decision dt is defined by the decision variables pP i (a, t) ≥ 0 for all a ∈ Ai and decision epoch t satisfying a pi (a, t) = 1. In the SO-MDP, the decision dt is defined by the independent matrix variables {P1 (t), . . . , Pn (t)} where Pi (t) ∈ Rn,m is the matrix having the probabilities Pi (j, k, t) ∈ [0, 1] for all destination states j ∈ S, for k = 1, . . . , m, and decision epoch t. For simplicity we drop the index t when there is no confusion and the variables are denoted simply by Pi . Note that this decision rule has a Markovian property because it depends only on the current state. Next we define an intermediate variable qi (ak ), which is the probability of choosing action ak given that the previous actions a1 , ..., ak−1 are rejected, that is, qi (ak ) = Prob{Yes|Xt = i, Zt = k} X = Prob{Yes, Ot,k = j|Xt = i, Zt = k} j∈S

=

X

Gi (j, k)Pi (j, k).

j∈S

Then the probability of choosing action ak , 1 ≤ k ≤ m, is the probability that the first k − 1 actions are rejected (i.e., Qk−1 l=1 (1 − qi (al ))) and then the k-th action is accepted (i.e., qi (ak )): pi (ak ) = Prob{At = ak |Xt = i} ! k−1 Y = (1 − qi (al )) qi (ak ),

(4) (5)

l=1

Qk−1 where, by convention, if k = 1 then l=1 (1 − qi (al )) := 1. We observe that qi (ak ) = 1 if k = m, i.e., the last action, if reached, is automatically accepted. The above relation shows that the decision variables of the standard MDP (pi (ak ) for k = 1, . . . , m and i = 1, . . . , n) are non-convex functions of the decision variables of the SO-MDP (Pi for i = 1, . . . , n).

The transition probability from state i to state j is given by the probability of reaching phase k and accepting the transition to state j, i.e.,

Proposition 1. Suppose π is a history dependent policy, then for each s ∈ S there exists a Markovian policy π 0 for which the following holds 0

Mt (j, i) = Prob{Xt+1 = j|Xt = i} ! m k−1 X Y = (1 − qi (al )) Gi (j, k)Pi (j, k). k=1

π π vN = vN

(6)

l=1

Also in this case, the transition is not linear in the decision variables of the SO-MDP. Let xt (i) = Prob{Xt = i} be the probability of being at state i at time t, and xt ∈ Rn be the vector of these probabilities. Then the system evolves according to the following recursive equation (which defines a Markov Chain): xt+1 = Mt xt , (7) where Mt ∈ Rn,n (or simply M ) is the matrix having the elements Mt (j, i) (or simply M (j, i)). It is important to note that the i-th column of M , denoted by mi , is a function of the decision variables in the matrix Pi only (i.e., independent of the variables of the matrices Ps for s 6= i).

for N ≥ 1.

Proof. Since the terms on the right-hand-side of equation (8) can be replaced by those of a Markovian policy via Lemma 1, we can establish the equivalence of the rewards/costs, i.e., π π0 vN = vN for N ≥ 1. IV. DYNAMIC P ROGRAMMING (DP) A PPROACH FOR SO-MDP S In this section, we transform the one-dimensional stochastic MDP problem of dynamics (1) into an equivalent ndimensional deterministic Dynamic Programming (DP) problem and use this approach to devise an efficient algorithm for finding optimal policies for SO-MDPs. First note that using equation (8), the performance metric can be written as follows: N X π vN = xTt ¯rt , t=1

III. H ISTORY D EPENDENT AND M ARKOVIAN P OLICIES In this section we show that, as done in the standard MDPs [14], it is sufficient to consider only the Markovian policies for the SO-MDPs because for any history dependent policy, we can construct a Markovian policy that gives the same total expected reward. First note that the performance metric can be written as π vN (s) =

N −1 X X

X

rt (j, a)Probπ {Xt = j, At = a|X1 = s}

t=1 j∈S a∈A

+

XX

rN (j)Probπ {XN = j, AN = a|X1 = s}.

j∈S a∈A

(8) Therefore, it is sufficient to show that for any history dependent policy π, there exits a Markovian policy π 0 such that Probπ {Xt = j, At = a|X1 = s} = 0 Probπ {Xt = j, At = a|X1 = s} for all t. It is indeed the case as the following lemma shows Lemma 1. Consider the SO-MDP with a history dependent policy π = (d1 , d2 , . . . ). Then, for each s ∈ S, there exists a Markovian policy π 0 = (d01 , d02 , . . . ), satisfying, for t = 1, . . . , N : Probπ {Xt = j, At = a|X1 = s} = 0

Probπ {Xt = j, At = a|X1 = s} .

(9)

Proof. The proof is in given in the Appendix. Remark: Lemma 1 holds true for the standard MDP as well [14]. When considering the SO-MDP model, the main difference lies in the construction of the equivalent MDP policy given the history dependent one. The following Proposition shows that we can focus, without loss of generality, on Markovian policies in what follows.

where xt is the state probability vector propagated as in (7). We can now give the DP formulation. The discretetime dynamical system describing the evolution of the state probability vector xt can then be given by xt+1 = ft (xt , P1 (t), . . . , Pn (t)) for t = 1, . . . , N − 1, (10) = Mt x t where Mt = Mt (P1 (t), . . . , Pn (t)) is the transition matrix, which is a function of the optimization variables.1 The above dynamics shows that the probability vector evolves deterministically. It is important to note that given a policy π, the π performance metric vN defined in (3) of the one-dimensional stochastic system (1) is equivalent to the performance of the n-dimensional deterministic system (10). In other words, any closed-loop feedback law (policy) for the stochastic system defines a policy for the deterministic system with the same performance. The additive reward per stage is defined as gN (xN ) = xTN ¯rN and gt (xt , P1 (t), . . . , Pn (t)) = xTt ¯rt , for t = 1, . . . , N − 1. The dynamic programming with a set of admissible control ∗ C provides a method to calculate the optimal value vN ∗ (and closed-loop policy π ) by running Algorithm 1 [25, Proposition 1.3.1, p. 23]. A. Backward Induction for the SO-MDP model This section presents the backward induction algorithm for solving the SO-MDP by using the dynamic programming approach. The set of admissible controls at time t is given by C defined as follows: 0 ≤ Pi (j, k, t) ≤ 1 for all i ∈ S, j ∈ S, ak ∈ Ai . 1 Note

that i-th column of Mt is a function of only Pi (t) matrix.

Algorithm 1 Dynamic Programming 1: Start with JN (x) = gN (x) 2: for t = N − 1, . . . , 1 n Jt (x) = max gt (x, P1 (t), . . . , Pn (t))+ P1 (t),...,Pn (t)∈C o Jt+1 (ft (x, P1 (t), . . . , Pn (t))) . 3:

Result: J1 (x) =

∗ vN .

variables. However, in SO-MDP, these values are non-convex in the decision variables and a further processing is needed for a numerically tractable implementation of the algorithm, which is discussed next. B. A Numerically Tractable Implementation of Algorithm 2 In the internal loop of Algorithm 2, the value function at a given decision epoch t is given by the following equation: Vt∗ (i) = max Vt (i), Pi ∈C

Using the dynamic programming Algorithm 1, we can now give the following proposition: Proposition 2. The term Jt (x) in the dynamic programming algorithm for the SO-MDP has the following closed-form solution: Jt (x) = xT Vt∗ ,

(11)

∗ where Vt (i) = r¯t (i) + j∈S Mt (j, i)Vt+1 (j). In this formulation, r¯t (i) and Mt (j, i) are functions of the decision variable Pi (t), for given state i and time epoch t. In particular, the explicit expression can be deduced from Eq. (2), Eq. (5), and Eq. (6) as follows: X pi (a)rt (i, a) r¯t (i) =

P

a∈As

where Vt∗ is a vector that satisfies the following recursion, VN∗ = ¯rN and for t = N − 1, . . . , 1 we have ∗ Vt∗ (i) = max r¯t (i) + mTi,t Vt+1 for i = 1, . . . , n,

m X

k−1 Y

k=1

l=1

=

! (1 − qi (al )) qi (ak )rt (i, ak ),

Pi (t)

where mi,t is the i-th column of Mt .

Mt (j, i) =

Proof. The proof is given in the Appendix. Notice that Jt (x) has a closed-form equation as a function of x and so it suffices to calculate Vt∗ for t = N, . . . , 1 for ∗ = finding the optimal value of the SO-MDP given by vN J1 (x1 ) = xT1 V1∗ . The backward induction algorithm is given in Algorithm 2.

∗ Result: V1∗ (s1 ) = vN where s1 is the initial state.

Remark: We want to stress two points about the algorithm. First, the policy calculated by Algorithm 2 is optimal (maximizing the total expected reward) because of line 3 in Algorithm 1 and Proposition 2. Second, r¯t (s) and Mt (j, s) are both functions of the decision variables in Pi . In standard MDPs, these values are simply linear in the decision

! (1 − qi (al )) Gi (j, k)Pi (j, k),

j∈S

+

 ! n X (1 − qi (al ))  Gi (j, k)Pi (j, k) rt (i, ak )

k=1

l=1

n X

m X

k−1 Y

j=1

k=1

l=1

where

4:

l=1

k−1 Y

Pi∗ (t)

j∈S

k=1

m X

=

and the optimal policy given by:     X ∗ Pi∗ (t) = argmax r¯t (i) + Mt (j, i)Vt+1 (j)  Pi ∈C 

k−1 Y

P where qi (ak ) = j Gi (j, k)Pi (j, k). By substituting these equations in the expression of Vt (i), we obtain X ∗ Vt (i) = r¯t (i) + Mt (j, i)Vt+1 (j) (12)

Algorithm 2 Backward Induction: Sequential MDP Optimal Policy π 1: Definitions: hP For any state i ∈ S, wei define Vt (i) = N −1 ∗ Ext =ei and Vt (i) = k=t rk (Xk , Ak ) + rk (XN ) π maxπ Vt given that Xt = i. ∗ 2: Start with VN (i) = rN (i) ∗ 3: for t = N − 1, . . . , 1 given Vt+1 and for i = 1, . . . , n calculate the optimal value     X ∗ Vt∗ (i) = max r¯t (i) + Mt (j, i)Vt+1 (j) Pi ∈C   j∈S

m X

j=1

! (1 − qi (al ))

! ∗ Gi (j, k)Pi (j, k)Vt+1 (j)

(13) = =

m X n X k=1 j=1 m X n X

∗ rt (i, ak ) + Vt+1 (j) Gi (j, k)Xi (j, k) Hi (j, k)Xi (j, k).

(14) (15)

k=1 j=1

Xi (j, k)

:= Pi (j, k)

k−1 Y

(1 − qi (al ))

(16)

l=1

Hi (j, k)

:=

∗ rt (i, ak ) + Vt+1 (j) Gi (j, k).

Note that Hi (j, k) is independent of the decision variables. For efficient implementation of the algorithm, it remains to show what conditions Q should Xi (j, k) satisfy so that k−1 the mapping Xi (j, k) = l=1 (1 − qi (al )) Pi (j, k) is invertable. Notice that if qi (al ) 6= 1 for l = 1, . . . , m − 1, then the mapping is one-to-one mapping and we will give the expression for Pi in terms of Xi shortly after. If there exists l such that qi (al ) = 1, then the phases k > lmin

are not reached because an earlier action must necessarily be accepted where lmin = min{l|qi (al ) = 1}. This means that Vt (i) is independent of Pi (j, k) when k > lmin (i.e., the optimal value is not affected by these variables) and without loss of generality we can consider Pi (j, k) = 1 for j = 1, . . . , n and k = lmin + 1, . . . , m. We can give now the expression of Pi in terms of Xi by the following lemma:

Proposition 3. For a given decision epoch t and state i, the optimal value and optimal policy terms in Algorithm 2 are given by Vt∗ (i) = T r(HiT Xi∗ ),

Lemma 2. For a given state i, the following equation holds for Xi (j, k), j = 1, . . . , n and k = 1, . . . , m, in Eq. (16): ! k−1 n XX Xi (j, k) = 1 − Gi (s, l)Xi (s, l) Pi (j, k). (17)

where Xi∗ is the solution of the linear program (18).

l=1 s=1

Proof. The proof is given in the Appendix. It remains to derive the constraints on Xi (j, k) when Pi ∈ C. Since Pi (j, k) ∈ [0, 1] for all j = 1, . . . , n and k = 1, . . . , m − 1, then we can derive the following conditions: 0 ≤ Xi (j, k) ≤ 1 −

k−1 n XX

Gi (s, l)Xi (s, l),

l=1 s=1

and since by definition Pi (j, m) = 1 for all j = 1, . . . , n: Xi (j, m) = 1 −

n m−1 XX

Gi (s, l)Xi (s, l).

l=1 s=1

As a result, Vt∗ (i) is the solution of the following linear program n m X X

maximize Xi

Hi (j, k)Xi (j, k)

k=1 j=1

subject to: for j = 1, . . . , n and k = 1, . . . , m − 1 0 ≤ Xi (j, k) ≤ 1 − Xi (j, m) = 1 −

k−1 n XX

Gi (s, l)Xi (s, l),

l=1 s=1 n m−1 XX

Gi (s, l)Xi (s, l).

l=1 s=1

To write it in matrix form, let 1n be the vector of all ones and dimension n, J = 1n 1Tn , and B be a constant m × m matrix defined as B(l, k) = 1 if k > l and B(l, k) = 0 otherwise. The linear program for finding the optimal value and policy at a given internal for loop iteration of Algorithm 2 is given as follows:

maximize

T r(HiT Xi )

subject to

Xi + J(Gi Xi )B ≤ 1n 1Tm

Xi

(18)

(Xi + J(Gi Xi )B) em = 1n Xi ≥ 0. Pk−1 Pn Let z(k) = 1 − l=1 s=1 Gi (s, l)Xi∗ (s, l) if k = 2, . . . , m and z(1) = 1. The following proposition summarizes our results

and for j = 1, . . . , n and k = 1, . . . , m ( Xi∗ (j, k)/z(k) if z(k) > 0, ∗ Pi (j, k, t) = 1 else .

(19)

Proof. The proof is based on the fact that the linear program over the decision variables Xi is equivalent to the original optimization over the Pi variables when considering the additional (redundant) constraints: Pi (j, k) = 1 for j = 1, . . . , n and k = lmin + 1, . . . , m. V. S IMULATIONS This section presents a simulation example to demonstrate the proposed policy synthesis method for the SO-MDPs. We consider autonomous vehicles (agents) explore a region F , which can be partitioned into n disjoint subregions (or bins) Fi for i = 1, . . . , n such that F = ∪i Fi [26], [27]. We can model the system as an MDP where the states of agents are their bin locations and the actions of a vehicle are defined by the possible transitions to neighboring bins. Each vehicle collects rewards while traversing the area where, due to the stochastic environment, transitions are stochastic (i.e., even if the vehicle’s command is to move to “right”, the environment can send the vehicle to “left” with some probability). For example, with probability 0.6 the given command will lead to the desired bin, while with probability 0.4 the agent would land on another neighboring bin. We assume a region describe by a 10 by 10 grid. Each vehicle has 5 possible actions: “up”, “down”, “left”, “right”, and “stay”. When the vehicle is on the boundary, we set the probability of actions that cause transition outside of the domain to zero. The total number of states is 100 with 5 actions, and a decision time horizon N = 10. The reward vectors Rt for t = 1, . . . , N − 1 and RN are chosen randomly with entries in the interval [0, 100]. Since any feasible policy for a standard MDP is also a feasible solution for the proposed sequential model (i.e., πM DP ⊆ πSM DP ), then the following holds: π∗

π∗

vNM DP ≤ vNSM DP . Figure 3 shows the difference in values due to optimal policies of the standard∗ MDP model and the proposed π π∗ sequential MDP (i.e., vNSM DP − vNM DP ). The figure shows that, depending on initial state, the new model can have significant improvement by utilizing the additional information (observing the transitions before deciding on actions). VI. C ONCLUSION This paper introduces a novel model for MDPs that incorporates into the policy design additional observations for the transitions of actions in a sequential manner. This model, referred to as Sequentially Observed MDP (SOMDP), achieves better expected total rewards than those of

Utility difference between sequential MDP and typical MDP 30

probability. Let us show next that equation (20) holds. 0

Utility Gain

25

20

40 20 0 1

15

1 2 2

3 3

10

4 4

5 5

6

6 7

7 8

5

8 9

9 10

10

0

Fig. 3. The figure shows the difference in the utility (optimal value) of the sequential MDP strategy that takes advantage of the observed transitions and the standard MDP that does not use this extra information. The figure shows that the difference in the utility depends on the initial position of agents. Some bins can give a higher than expected reward than other bins.

Probπ {Xt = j|Xt−1 = i, At−1 = ak } Pi (j, k, t − 1)Gi (j, k) = Pn l=1 Pi (l, k, t − 1)Gi (l, k) Probπ {Xt = j, At−1 = ak |Xt−1 = i, X1 = s} = Pn π l=1 Prob {Xt = l, At−1 = ak |Xt−1 = i, X1 = s} Probπ {Xt = j, At−1 = ak |Xt−1 = i, X1 = s} = Probπ {At−1 = ak |Xt−1 = i, X1 = s} = Probπ {Xt = j|Xt−1 = i, At−1 = ak , X1 = s} . Let us show now equation (21). 0

Probπ {At = ak |Xt = i} = pi (ak ) =

the optimal policies for the standard MDP models studied in the literature due to the use of additional information. We also propose an efficient linear programming based algorithm for the synthesis of optimal policies.

=

k−1 Y

(1 − qi (al )) qi (ak )

l=1 n k−1 X Y

(

(1 − qi (al )))Pi (j, k, t)Gi (j, k)

j=1 l=1

Now by using Equation (22), A PPENDIX pi (ak ) = P ROOF OF L EMMA 1 Let π 0 be a Markovian policy, then the decision variables at a decision epoch t are defined as the matrices {Pi (t), i = 1, . . . , n}. The proof proceeds by constructing these decision variables such that the following equations hold for all t 0

Probπ {Xt = j|Xt−1 = i, At−1 = ak } = Probπ {Xt = j|Xt−1 = i, At−1 = ak , X1 = s} (20) 0

Probπ {At = ak |Xt = i} = Probπ {At = ak |Xt = i, X1 = s}. (21) Define the following Markovian policy d0t for a nonzero denominator where the decision variables Pi (j, k, t) are chosen as function of π as follows Probπ {Xt+1 = j, At = ak |Xt = i, X1 = s} . Qk−1 Gi (j, k) l=1 (1 − qi (l)) (22) Without loss of generality, we can choose P (j, k, t) = 1 i Qk−1 when Gi (j, k) = 0 or when l=1 (1 − qi (l)) = 0. Next we show that the numerator is always less than or equal to the denominator to have a well-defined Pi (j, k, t). Let L = {σ ∈ Ht |X1 = s, Xt = i}, ασ = Probπ (σ), Piσ (j, k,P t) is the history dependent decision variσ σ able, q (a ) = k i s∈S Q Gi (s, k)Pi (s, k, t), and ασ,k = k−1 σ ασ l=1 (1 − qi (al )) . It can be shown by induction on k α P σ (j,k,t) σ∈L P σ,k i

that Pi (j, k, t) = , the proof is omitted due σ∈L ασ,k to page space constraints. Thus Pi (j, k, t) is a well defined

Probπ {Xt+1 = j, At = ak |Xt = i, X1 = s}

j=1

= Probπ {At = ak |Xt = i, X1 = s}. Now that we established equations (20) and (21), we can proceed by using induction to complete the proof of the lemma as done for the standard MDPs in [14, Theorem 5.5.1]. Note that, in the standard case, establishing (20) is trivial, which is not the case for SO-MDPs. Clearly equation (9) holds with t = 1. Assume it holds also for t = 2, . . . , u − 1. Then Probπ {Xu = j|X1 = s} m n X X = Probπ {Xu−1 = i, Au−1 = ak |X1 = s} × i=1 k=1

Probπ {Xu = j|Xu−1 = i, Au−1 = ak , X1 = s} n X m X 0 = Probπ {Xu−1 = j, Au−1 = ak |X1 = s} ×

Pi (j, k, t) =

P

n X

i=1 k=1 0

Probπ {Xu = j|Xu−1 = i, Au−1 = ak }

0

= Probπ {Xu = j|X1 = s} , (23) where we used the induction hypothesis and (20). Therefore, 0

Probπ {Xu = j, Au = ak |X1 = s} 0

0

= Probπ {Au = ak |Xu = j} Probπ {Xu = j|X1 = s} = Probπ {Au = ak |Xu = j, X1 = s} Probπ {Xu = j|X1 = s} = Probπ {Xu = j, Au = ak |X1 = s} where we used equations (21) and (23). This ends the proof.

P ROOF OF P ROPOSITION 2 We will show that by induction. From the definition of gN (.) we have the base case satisfied, i.e., Jt (x) = xT ¯rN = xT VN∗ . Suppose the hypothesis is true from N − 1, . . . , t + 1, then we show it is true for t. From the DP algorithm, we can write T x ¯rt + Jt+1 (Mt x) Jt (x) = max (24) P1 (t),...,Pn (t)∈C

=

max P1 (t),...,Pn (t)∈C

=

max P1 (t),...,Pn (t)∈C

=

X i

xi

∗ {xT ¯rt + xT MtT Vt+1 } X ∗ { xi (¯ rt (i) + mTi,t Vt+1 )}

(25) (26)

i

T ∗ max r¯t (i) + mi,t Vt+1

(27)

Pi (t)∈C

where mTi,t indicates the transpose of the i-th column of M which is a function of the decision variables of the Pi matrix only. The transition from (24) to (25) is due to the induction assumption, and the transition from (26) to (27) is because xi ≥ 0 for all i and the function is separable in terms of the optimization variables. The maximization inside the parenthesis is nothing but Vt∗ (i), then Jt (x) = P T ∗ ∗ i xi Vt (i) = x Vt and this ends the proof. P ROOF OF L EMMA 2 We Qk−1

will prove this lemma showing that Pk−1 Pby n (1 − q (a )) = 1 − G i l l=1 l=1 s=1 i (s, l)Xi (s, l) by induction. It is true for k = 2 by the definition of qi (al ). Suppose it is true till k − 2, and let us show it true for k − 1. We have ! k−1 k−2 Y Y (1 − qi (al )) = (1 − qi (al )) (1 − qi (ak−1 )) l=1

l=1

=

k−2 Y

(1 − qi (al ))

l=1

−

X

=1−

Gi (s, k − 1)Xi (s, k − 1)

s k−1 n XX

Gi (s, l)Xi (s, l).

l=1 s=1

where the last equality uses the induction hypothesis. R EFERENCES [1] P. R. Kumar and P. Varaiya, Stochastic Systems: Estimation, Identification and Adaptive Control. Upper Saddle River, NJ, USA: PrenticeHall, Inc., 1986. [2] D. C. Parkes and S. Singh, “An MDP-based approach to Online Mechanism Design,” in Proc. 17th Annual Conf. on Neural Information Processing Systems (NIPS’03), 2003. [3] D. A. Dolgov and E. H. Durfee, “Resource allocation among agents with mdp-induced preferences,” Journal of Artificial Intelligence Research (JAIR-06), vol. 27, pp. 505–549, December 2006. [4] P. Doshi, R. Goodwin, R. Akkiraju, and K. Verma, “Dynamic workflow composition using markov decision processes,” in Web Services, 2004. Proceedings. IEEE International Conference on, July 2004, pp. 576–582. [5] R. S. Sutton and A. G. Barto, Introduction to reinforcement learning. MIT Press, 1998.

[6] C. Szepesv´ari, “Algorithms for reinforcement learning,” Synthesis Lectures on Artificial Intelligence and Machine Learning, vol. 4, no. 1, pp. 1–103, 2010. [7] E. Feinberg and A. Shwartz, Handbook of Markov Decision Processes: Methods and Applications, ser. International Series in Operations Research & Management Science. Springer US, 2002. [8] E. Altman, “Applications of Markov Decision Processes in Communication Networks : a Survey,” INRIA, Research Report RR-3984, 2000. [9] R. Bellman, Dynamic Programming, 1st ed. Princeton, NJ, USA: Princeton University Press, 1957. [10] R. A. Howard, Dynamic Programming and Markov Processes. Cambridge, MA: MIT Press, 1960. [11] S. Feyzabadi and S. Carpin, “Risk-aware path planning using hirerachical constrained markov decision processes,” in Automation Science and Engineering (CASE), 2014 IEEE International Conference on, Aug 2014, pp. 297–303. [12] L. Blackmore, M. Ono, A. Bektassov, and B. Williams, “A probabilistic particle-control approximation of chance-constrained stochastic predictive control,” IEEE Transactions on Robotics, vol. 26, no. 3, pp. 502–517, June 2010. [13] S. Balachandran and E. M. Atkins, A Constrained Markov Decision Process for Flight Safety Assessment and Management. American Institute of Aeronautics and Astronautics, 2015/08/11 2015. [Online]. Available: http://dx.doi.org/10.2514/6.2015-0115 [14] M. L. Puterman, Markov decision processes : discrete stochastic dynamic programming, ser. Wiley series in probability and mathematical statistics. New York: John Wiley & Sons, 1994, a Wiley-Interscience publication. [15] D. Zhang and W. L. Cooper, “Revenue management for parallel flights with customer-choice behavior,” Operations Research, vol. 53, no. 3, pp. 415–431, 2005. [16] P. Nain and K. Ross, “Optimal priority assignment with hard constraint,” IEEE Transactions on Automatic Control, vol. 31, no. 10, pp. 883–888, Oct 1986. [17] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, “Planning and acting in partially observable stochastic domains,” Artificial Intelligence, vol. 101, no. 12, pp. 99 – 134, 1998. [18] L. P. Kaelbling, M. L. Littman, and A. P. Moore, “Reinforcement learning: A survey,” Journal of Artificial Intelligence Research, vol. 4, pp. 237–285, 1996. [19] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis of the multiarmed bandit problem,” Machine Learning, vol. 47, no. 2-3, pp. 235–256, 2002. [20] M. H. DeGroot, Optimal Statistical Decisions. Hoboken, NJ: John Wiley & Sons, 2004. [21] E. Altman, Constrained Markov Decision Processes, ser. Stochastic Modeling Series. Taylor & Francis, 1999. [22] M. El Chamie and B. Acikmese, “Finite-Horizon Markov Decision Processes with Sequentially-Observed Transitions,” ArXiv, Research Report 1507.01151, July 2015. [23] T. S. Ferguson, “Who solved the secretary problem?” Statist. Sci., vol. 4, no. 3, pp. 282–289, 08 1989. [24] M. Babaioff, N. Immorlica, D. Kempe, and R. Kleinberg, “A knapsack secretary problem with applications,” in Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, ser. Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2007, vol. 4627, pp. 16–28. [25] D. P. Bertsekas, Dynamic Programming and Optimal Control, Vol.I, 3rd ed. Athena Scientific, 2005. [26] B. Ac¸ıkmes¸e and D. Bayard, “A Markov chain approach to probabilistic swarm guidance,” in American Control Conference (ACC), 2012, June 2012, pp. 6300–6307. [27] B. Ac¸ıkmes¸e, N. Demir, and M. Harris, “Convex necessary and sufficient conditions for density safety constraints in markov chain synthesis,” Automatic Control, IEEE Transactions on, 2015.

Convex Synthesis of Randomized Policies for ...

Optimal risk control and investment for Markov ...

Speculative Markov Blanket Discovery for Optimal ...

Speculative Markov Blanket Discovery for Optimal Feature Selection

Cross-layer Optimal Decision Policies for Spatial ... - Semantic Scholar

Optimal Stochastic Policies for Distributed Data ... - RPI ECSE

Cross-layer Optimal Decision Policies for Spatial ... - Semantic Scholar

Optimal Stochastic Policies for Distributed Data ...

Optimal Stochastic Policies for Distributed Data ... - RPI ECSE

Cross-Layer Optimal Policies for Spatial Diversity ...

Optimal Policies for Distributed Data Aggregation in ...

Ranking policies in discrete Markov decision processes - Springer Link

Computing Uniform Convex Approximations for Convex ...

Optimal Social Policies in Mean Field Gamesâ

Delay Optimal Policies Offer Very Little Privacy

RESONANCES AND DENSITY BOUNDS FOR CONVEX CO ...

Equidistribution of Eisenstein Series for Convex Co ...

Construction of non-convex polynomial loss functions for ... - arXiv

On upper bounds for high order Neumann eigenvalues of convex ...

Probe skates for electrical testing of convex pad topologies

The method of reflection-projection for convex feasibility ...

On the tangent cone of the convex hull of convex sets

Synthesis of substituted ... - Arkivoc