Convex Synthesis of Randomized Policies for ...

Viewer
Transcript

Convex Synthesis of Randomized Policies for Controlled Markov Chains with Density Safety Upper Bound Constraints Mahmoud El Chamie

Yue Yu

Abstract— The main objective of this paper is to synthesize optimal decision-making policies for a finite-horizon Markov Decision Process (MDP) while satisfying a safety constraint that imposes an upper bound on the state probability density function (pdf) for all time steps. Since the safety constraint is imposed on the state pdf, we must control the underlying Markov Chain (MC) via the synthesis of the decision-making policy for the MDP. It is well-known for standard unconstrained MDP problems that the optimal policies are deterministic and Markovian; however, that does not necessarily hold with the safety constraints. In many cases, safe policies can only be randomized. The classical approach based on state-action frequencies for constrained MDPs yields decision policies that provide safety constraint satisfaction for stationary distributions (i.e., asymptotically), but they do not extend naturally to hard safety constraints on the transient behavior. This paper introduces a new synthesis method for randomized Markovian policies for finite-horizon MDPs, where the safety constraint satisfaction is guaranteed for both the transient and the stationary distributions. The safe policies are designed for the worst-case initial state pdfs, i.e., they aim to maximize the minimal reward for all initial feasible distributions. An efficient Linear Programming (LP) based synthesis algorithm is proposed, which produces a convex set of feasible policies and ensures that the expected total reward is above a computable lower-bound. A simulation example of a swarm of autonomous agents is also presented to demonstrate the proposed solution algorithm.

I. I NTRODUCTION Markov Decision Processes (MDPs) have been used to formulate many decision-making problems in a variety of areas of science and engineering [1]–[3]. MDPs are useful in modeling decision-making problems for stochastic dynamical systems [4] where the dynamics cannot be fully captured by using first principle formulations. Measured data can be used to construct the state transition probabilities for MDP models. Hence MDPs play a critical role in big-data analytics. Indeed very popular methods of machine learning such as reinforcement learning and its variants [5] [6] are built on the MDP framework. With the increased interest in Cyber-Physical Systems, there is even more interest in MDPs to facilitate rigorous construction of new hierarchical decision-making architectures, where MDP framework can integrate physics-based models with data-driven models. In many applications [7] [8], MDP models are used to compute optimal decisions when future actions contribute to the overall system performance. Here we consider MDPbased sequential stochastic decision-making models [9], The authors are with the University of Texas at Austin, department of Aerospace Engineering and Engineering Mechanics, 210 E. 24th St., Austin, TX 78712 USA. Emails: [email protected], [email protected], [email protected].

Behc¸et Ac¸ıkmes¸e

which are composed of a set of time epochs, actions, states, and immediate rewards/costs. Actions transfer the system in a stochastic manner from one state to another and rewards are collected based on the actions taken at the corresponding states. The objective is to synthesize the best decision (action selection) policies to maximize expected rewards (minimize costs) for the MDP. This paper presents new results that aim to increase fidelity of MDPs for decision-making by incorporating a general class of safety constraints, which imposes upper bounds on the state probability density function (pdf) for all time epochs. Since the time evolution of the state pdf is determined by the underlying Markov Chain (MC), resulting from the decision policy, we can view the problem as an MDP policy synthesis problem to achieve a safe MC [10]– [12]. An important feature of such decision policies is that, in many cases, they must be randomized policies rather than deterministic ones. Existence of safety constraints on the underlying MC can eliminate the feasibility of deterministic actions, which are optimal for the standard MDPs [9]. The classical approach in presence of constraints is based on state-action frequencies [13], [14]. This approach provides safety constraints satisfaction for stationary distributions (i.e., asymptotically); however, it does not necessarily satisfy the hard constraints on the transient behavior imposed by the safety upper bound constraints studied in this paper. In particular, we consider MDPs with finite number of states and actions subject to hard safety constraints on the state pdf. We give an efficient Linear Programming (LP)-based policy synthesis algorithm, which is obtained by using the duality theory of convex optimization [15]. The algorithm optimizes over the convex set of all feasible policies and guarantees the expected total reward to be above a computable lower bound. To best of our knowledge, this is the first result in safety constrained MDP problems that gives an efficient algorithm for generating finite horizon randomized policies for controlled MCs that satisfy hard transient and stationary safety constraints with reward/cost guarantees. Another advantage of the proposed solution is that it is independent of initial state pdf. Thus it can be solved offline and implemented in large-scale systems, e.g., multiagent systems. Moreover, being able to ensure these safety constraints can provide advancements in machine learning methods [16]–[18], which explore solutions of complex MDPs via utilizing data. The rest of the paper is organized as follows. Section II gives a brief literature review for constrained MDP problems. Section III formally defines the notation and the mathemat-

ical concepts for MDPs. Section IV gives the general form of safety constraints addressed in this paper. Section V gives the classical approach for constrained MDPs and shows that they can give policies that violate feasibility for the class of constraints studied in this paper. Section VI gives the dynamic programming approach and establishes the main technical results of the paper. Section VII provides simulations of the theoretical results applied on autonomous multiagent systems. Finally, Section VIII concludes the paper. II. R ELATED P RIOR R ESEARCH In MDPs, constraints can be utilized to handle multiple design objectives where decisions are computed to maximize rewards for one of the objectives while guaranteeing the value of the other objective to be within a desired range [19]. The constraints can also be imposed by the underlying physical environment, e.g., safety constraints imposed by a mission as in multi-agents autonomous systems [12], or constraints in telecommunication applications [14]. In these constrained MDPs, the calculation of optimal policies can be much more difficult, so the constraints are usually relaxed with the hope that the resulting decisions would still provide feasible solutions. However, in some applications, constraints can be critical and must be satisfied for all times [10]–[12], [20]. Previous research has focused on finding optimal infinitehorizon stationary policies for constrained MDPs. Due to the constraints, the optimal policies might no longer be deterministic and stationary [21]. [22] gives an example of a transient multi-chain MDP with state constraints and shows that the optimal policy is not necessarily stationary. It is well known that in the presence of constraints, randomization in the actions can then be necessary for obtaining optimal policies [23] [24]. For stationary policies to be optimal, specific assumptions on the underlying Markov chain, such as regularity, are often used [13]. Optimal stationary policies for these specific models can be obtained using algorithms based on Linear Programming (LP) or Lagrange multipliers [25], [26], [27], [28]. LP algorithms are also developed for approximate dynamic programming [29] in applications where exact methods do not scale (e.g., due to curse of dimensionality). More recently, MDPs with constraints are applied in path planning for robotics [30] and to flight control problems [31]. From an algorithmic point of view, reference [32] provides an efficient algorithm for solving large-scale constrained MDP problems. Constraints also allow direct relationship with the chance-constrained optimal control [33], e.g., chance constrained motion planning [34]. As most related works on policy synthesis (e.g., the classical approach leveraging on state-action frequencies) for constrained MDP problems are developed for the stationary policies, the resulting policies do not take into account the transient behavior of the indicted underlying MC. Thus these policies can lead to violation of the safety upper bound constraints during transients. In this paper, we aim to develop decision policies for MDPs that consider both, the transient behavior of the system as well as the stationary regime. In

particular, we consider decision policies for the worst case initial condition that provide safety for the entire duration of the MDP process. III. P RELIMINARIES AND N OTATION For completeness, we define notation used in the paper, which is fairly standard in MC and MDP literature [9]. A. States and Actions Let the set S = {1, . . . , n} be the set of states with finite of cardinality, i.e., |S| = n. A = {1, . . . , p} is the set of actions, which can be dependent on the state, i.e., As can be the set of actions available at state s. Let Xt and At be random variables corresponding to the state and action at the t-th time epoch. es is a vector of all zeros except for the s-th element which is equal to 1, 1 is the vector of all ones, and denotes the element-wise, Hadamard, product. B. Decision Rule and Policy We define a decision rule Dt at temporal epoch t to be the following randomized function Dt : S → A that defines for every state s ∈ S a random variable At = Dt (s) ∈ A with a probability distribution defined on P(A) as follows qDt (s) (a) = Prob[At = a|Xt = s] for any action a ∈ A. For simplicity we will drop the index from the decision variable qDt (s) (a) notation when there is no confusion and we will simply denote it by qt (s, a), let also Qt ∈ Rn,p be the decision matrix where the entry of the s-th row and a-th column is the decision variable qt (s, a). Let π = (D1 , D2 , . . . , DN −1 ) be the decision-making policy given that there are N − 1 decision epochs. Note that this decision rule has a Markovian property because it depends only on the current state. For unconstrained MDP problems, deterministic Markovian policies are shown to be optimal [9]. Indeed this paper considers only the Markovian policies, and the study of history dependent policies [9] is subject of future research. For a given policy π, the process {Xt , t = 1, 2, . . . } is a discrete-time MC whose state space is S and the transition probability matrix Mt (not necessarily time-homogeneous, i.e., can change with time t) is defined by the policy π. This dynamical system is referred to as the one-dimensional stochastic system in the rest of the paper. C. Rewards Given a state s ∈ S and action a ∈ A, we define the reward rt (s, a) ∈ R to be any real number, let Rt ∈ Rn,p be the matrix having the elements rt (s, a), and let R be the set having these values for all t. We define the expected reward for a given decision rule Dt at time t to be P r¯t (s) = E[rt (s, At )] = a∈A qt (s, a)rt (s, a), and the vector ¯rt ∈ Rn to be the vector with the expected rewards for each state. Note that ¯rt is a linear function of the decision matrix Qt , i.e., ¯rt (Qt ) = (Rt Qt )1.

(1)

This linearity property will be used for a convex synthesis of policies and decision rules later in the paper. Since there are N − 1 decision epochs, there are N reward stages and the final stage reward is given by rN (s) (or ¯rN the vector whose entries are the final rewards for each state).

variables are qt (s, a) for any s ∈ S and a ∈ A.1 The backward induction algorithm [9, p. 92] based on dynamic programming provides the optimal policy in the absence of constraints on the state pdf vector xt for t = 1, . . . , N . Next we introduce safety constraints as follows

D. State Transitions We now define the transition probabilities given the current state and current action to be pt (i|j, k) = Prob[Xt+1 = i|Xt = j, At = k]. Let Gt,k ∈ Rn,n be the matrix having the elements Gt,k (i, j) = pt (i|j, k) and let G be the set having all transition probabilities. For a given policy π, the elements of the induced MC transition matrix Mt ∈ Rn,n are: Mt (i, j) = Prob[Xt+1 = i|Xt = j] P = a∈A qt (j, a)pt (i|j, a).

(2)

Let xt (i) = Prob[Xt = i] to be the probability of being at state i at time t, and xt ∈ Rn to be a vector having these elements. The probability vector xt is then the state probability density function (pdf ). The state pdf vector evolves according to the following recursive equation xt+1 = Mt xt ,

p X

Gt,k 1(Qt ek )T .

(3)

k=1

E. Markov Decision Processes (MDPs) Let γ ∈ (0, 1] be the discount factor, which represents the importance of a current reward in comparison to future possible rewards. A discrete MDP is a 5-tuple (S, A, G, R, γ) where S is a finite set of states, A is a finite set of actions available, G is the set that contains the transition probabilities given the current state and current action, and R is the set of rewards. F. Performance Metric For a policy to be better than another policy we need to define a performance metric. We will use the expected discounted total reward for our performance study, "N −1 # X π vN = Ex1 γ t−1 rt (Xt , At ) + γ N −1 rN (XN ) , (4) t=1

where the expectation is conditioned on knowing the initial state pdf (i.e., x1 ∈ P(S) where x1 (i) = Prob[X1 = i]). For example, if the agent is in state s at t = 1, then x1 = es where es is a vector of all zeros except for the s-th element which is equal to 1. It is worth noting that in the above expression, both Xt and At are random variables.

(5)

where ≤ denotes the element-wise inequalities, and d ∈ [0, 1]n is a vector giving upper bounds on the system state pdf. These hard safety constraints lead to correlations between decision rules at different states of the induced MC and the backward induction algorithm cannot then be used to find optimal policies when the safety constraints exist. Even finding a feasible policy can be very challenging. We refer to this problem as Safety Constrained MDP (SC-MDP). The optimal policy synthesis problem for SC-MDP can then be written as follows, maximize

Q1 ,...,QN −1

s.t.

t = 1, . . . , N − 1

where Mt is a linear function of the decision matrix Qt , Mt (Qt ) =

xt ≤ d, for t = 1, . . . , N,

π vN

xt ≤ Qt 1 = Qt ≥

d, for t = 1, . . . , N − 1 1, for t = 1, . . . , N − 1 0, for t = 1, . . . , N − 1,

(6)

where Qt ∈ Rn,p is the decision matrix. The last two sets of constraints guarantee that the variables define probability distributions. Without the first set of constraints, the problem is the well-studied standard unconstrained MDP problem where decision variables in different rows of Qt are independent and they are not correlated. With the added first set of constraints correlation would exist between the rows of Qt ’s and the backward induction, which leverages on the independence of the rows of Qt , cannot be applied directly. Also, note that these constraints are non-convex because xt = Mt−1 . . . M2 M1 x1 where, using equation (2), each of the matrices Mu is a linear function of the variables Qu as given by (3).

V. C LASSICAL A PPROACH C ONSTRAINED MDP This section details the classical approach for constraints in an MDP problem, for which we show that the type of constraints used in the classical approach does not generalize to hard constraints on the transient state pdf of the same form of this paper. Thus the corresponding policies due to the classical approach can violate the safety upper bound constraints (5). The classical approach is based on state-action frequencies to address infinite horizon MDP problems.2 It formulates a linear program for an MDP [9, p. 299]:

IV. S AFETY C ONSTRAINED MDP P ROBLEM The optimal policy π ∗ is given as the policy that maxπ imizes the performance measure, π ∗ = argmaxπ vN , and ∗ ∗ π vN to be the optimal value, i.e., vN = maxπ vN . Note that this maximization is unconstrained and the optimization

1 Since v π is continuous in the decision variables that belong to a closed N and bounded set, then the max is always attained and argmax is well defined. 2 For the finite horizon, the linear program can be tuned by including time as an additional state variable.

maximize

XX

y(s,a),s∈S,a∈A

s

r(s, a)y(s, a)

(7)

a

subject to: for i = 1, . . . , n X XX y(i, a) − γ p(i|s, a)y(s, a) = α(i) a∈A

Probability of being at state i, xt [i]

0.9

(8)

s∈S a∈A

X

y(i, a) ≤ d(i)

a∈A

XX

y(s, a)

(9)

s∈S a∈A

y(i, a) ≥ 0, ∀a ∈ A

(10)

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

where α(i) is the initial state pdf xt=1 (i), γ is the discount factor3 for the infinite horizon. The variable y(i, a) is interpreted as the frequency of performing action a in state s, i.e., ∞ X

γ t−1 Prob[Xt = i, At = a].

t=1

Thus by adding constraints on y(i) in the LP, equation (9), and using equations (11) and (12), we find that y(i) must satisfy P∞ t−1 γ Prob[Xt = i] y(i) = t=1 P∞ t t=1 γ ∞ X γ t−1 xt (i) ≤ d(i). (13) = (1 − γ) t=1

Equation (13) provides important insights relating y(i) to the transient probability of state pdf xt (i). Let y ∗ (s, a) be the solution of the LP. The constrained MDP-classical approach provides the following policy: the probability of choosing action k at a state i is: y ∗ (i, k) q(i, k) = P ∗ . a y (i, a)

(14)

Note that for a given policy where (5) is satisfied, the second set of constraints in the LP, equation (9), is also satisfied. However, the reverse argument is not always true, i.e., satisfying y(i) ≤ d(i) does not guarantee that xt (i) ≤ d(i) for all t. In fact, since limγ→1 y(i) = x∞ (i) where x∞ (i) is the stationary distribution of the MC, then for N → ∞ and γ → 1 in the performance metric (4), the hard constraints are satisfied asymptotically. Even in this case, the classical approach policy gives no guarantees on the transient behavior. Thus the classical approach cannot be used for the constraints given by (5). As a numerical example, Figure 1 presents a case where the optimal stationary policy of the classical approach does not satisfy the hard constraints on 3 For

convergence of the LP, γ should be less than 1 (i.e., γ < 1).

5

10

15

20

25

30

Time Epoch

(11)

The state pdf xt is indirectly related to the variables y(i, a). In fact, y(i, a) can be viewed as a “cost” that is a function of the overall process of the state pdf xt , t ≥ 1. To write this function explicitly we define y(i) to be P a y(i, a) . (12) y(i) = P P s a y(s, a)

0

0.5

Probability of being at state j, xt [j]

y(i, a) =

γ=0.900 γ=0.990 γ=0.999 safety upper bound

γ=0.900 γ=0.990 γ=0.999 safety upper bound

0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

0

5

10

15

20

25

30

Time Epoch

Fig. 1. The dynamics of the probability of a state i (upper figure) and state j (lower figure) in a dynamical system for different values of the discount factor γ. The detailed description of the dynamical system is given in the Simulation section. The figure shows that the safety upper bound can always be violated independent of the value of γ. As γ approaches 1, the stationary distribution satisfies the safety constraints asymptotically, but at least one state would still violate the constraints in the transient behavior.

the states’ pdfs xt for all time instances t = 1, 2, . . . . In fact, finding optimal policies with hard constraints turns out to be a non-convex problem as discussed further in the paper and thus it cannot be solved using Linear Programming. The approach in this paper uses approximations in dynamic programming algorithm to provide a feasible policy (a policy that guarantees the satisfaction of the hard safety constraints) with some guarantees on the performance. VI. DYNAMIC P ROGRAMMING (DP) A PPROACH TO M ARKOVIAN P OLICY S YNTHESIS In this section, we use Dynamic Programming (DP) to provide feasible policies for the SC-MDP. First note that the performance metric can be written as follows: "N −1 # X π t−1 N −1 vN = Ex1 γ rt (Xt , Dt (Xt )) + γ rN (XN ) t=1

=

=

N −1 X t=1 N X

γ t−1 Ex1 [rt (Xt , Dt (Xt ))] + γ N −1 Ex1 [rN (XN )]

γ t−1 xTt ¯rt ,

(15)

t=1

where the last equality utilized the fact the pdf of the random variable Xt is xt . The discrete-time dynamical

system describing the evolution of the state pdf xt can then be given by xt+1 = Mt (Qt ) xt ,

(16)

where Mt (Qt ) is a column stochastic matrix, which is linear in the optimization variables Qt as (3) shows. The dynamics given by equation (16) show that even though the one-dimensional system (discrete-time MC) is stochastic, the state pdf evolves deterministically. A policy Π = (D1 , . . . , DN −1 ) for system (16) consists of a sequence of functions that map “states” xt to controls Qt = Dt (xt ) such that Dt (xt ) ∈ C(xt ) where C(xt ) is the set of admissible controls — the “states” are put between quotations because they are different from the states of the one-dimensional stochastic system. Since the performance metric given in equation (15) can be written as a function of the probability vector xt , we can now directly work with the n-dimensional system (16) by defining the additive reward per stage to be

problem in line 2 of Algorithm 1 can be very hard. In some special cases, for example when Jt (x) can be expressed analytically in a closed from, the solution complexity can be reduced significantly, as in the unconstrained MDP problems. However, in general, the SC-MDP decision problem is nonconvex and hence it is hard to solve. Algorithm 2 Backward Induction: SC-MDP 1: Definitions: Qt for t = 1, ..., N − 1 are the optimization variables, describing the decision policy. Let X = {x ∈ Rn : 0 ≤ x ≤ d, 1T x = 1} and C = ∩ C(x) with x∈X

C(x) = Q ∈ Rn×p : Q1 = 1, Q ≥ 0, M (Q)x ≤ d 2: 3:

Qt ∈C

and the vector of expected rewards

and

ˆt = ¯rt (Q ˆ t ) + γMt (Q ˆ t )T U ˆt+1 . U

gt (xt , Qt ) = xTt ¯rt , for t = 1, . . . , N − 1.

Algorithm 1 Dynamic Programming (DP) 1: Start with JN (x) = gN (x) 2: for t = N − 1, . . . , 1 Jt (x) = max {gt (x, Qt ) + γJt+1 (Mt (Qt )x)} . Qt ∈C(x)

3:

∗ Result: J1 (x) = vN .

x∈X

(17)

gN (xN ) = xTN ¯rN

Without the safety constraints xt ≤ d, C(xt ) is independent of xt and all admissible controls belong to the same convex set C for all states. The two systems, n-dimensional deterministic system and the one-dimensional stochastic system, are in that case equivalent. Thus any optimal policy Π∗ for the n-dimensional system would define the optimal policy (feedback law) π ∗ for the Markov decision process. On the contrary, in the presence of the safety constraints, the optimal policies for the n-dimensional system are then a function of the current state pdf vector, i.e., Q∗t = Qt (xt ); an information not available for the one-dimensional system. Nevertheless, using a worst case study (i.e., for worst case probability vector xt ), we can construct an admissible policy π ˆ for the one-dimensional case that satisfies the hard constraints and gives some performance guarantees. ∗ The DP algorithm calculates the optimal value vN (and ∗ policy Π ) for the n-dimensional system as follows [35, Proposition 1.3.1, p. 23]:

where M (Q) is the transition matrix linear in Q. ˆN = ¯rN . Set U ˆt+1 , compute the policy For t = N − 1, . . . , 1, given U ˆt+1 , ˆ t = argmax min xT ¯rt (Qt ) + γMt (Qt )T U Q

4:

∗ ˆ1 ≥ xT1 U Result: vN

A. DP Synthesis for Feasible Policies of SC-MDPs When the safety constraints are present, Jt (x) does not have a closed-form solution, and hence finding an optimal (even a feasible) solution is challenging. This section presents a new algorithm, Algorithm 2, to compute a feasible solution of the SC-MDPs with lower bound guarantees on the expected reward. Theorem 1. Algorithm 2 provides a feasible policy for the SC-MDP problem (6), which guarantees the expected total ˆ1 , i.e., v ∗ ≥ R# . reward to be greater than R# = xT1 U N Proof. The proof is based on applying the DP Algorithm 1. Letting JN (x) = xT ¯rN , it will be shown by induction that ˆt . It is true for t = N . Now supposing that Jt (x) ≥ xT U it is true for t + 1, let’s prove it true for t. We have from Algorithm 1 that Jt (x) = max {gt (x, Qt ) + γJt+1 (ft (x, Qt ))} Qt ∈C(x) = max xT ¯rt + γJt+1 (Mt x) Qt ∈C(x) n o ˆt+1 ≥ max xT ¯rt + γxT MtT U Qt ∈C(x) n o ˆt+1 ) ≥ max xT (¯rt + γMtT U Qt ∈C

Remark. There are several difficulties in applying the DP Algorithm 1. Note that, in the expression Jt+1 (Mt (Qt )x), Qt is an optimization variable. For a given Qt and x, we can efficiently compute the value of Jt+1 . But since Qt itself is an optimization variable, the solution of the optimization

ˆt , ≥ xT U where in the third line we applied the induction hypothesis ˆ t . Since J1 (x1 ) = v ∗ and the last line by the definition of Q N (line 3 in Algorithm 1), this ends the proof.

ˆt for t = N − 1, . . . , 1 via Therefore, by calculating U Algorithm 2, we can find a policy that gives guarantees on π ˆ the total expected reward, namely vN ≥ R# , where π ˆ = ˆ ˆ ˆ (Q1 , Q2 , . . . , QN −1 ). ˆt B. LP formulation of Q ˆ The calculation of Qt for t = 1, . . . , N − 1 in equation (17) is the main challenge in the application of Algorithm 2. This section describes a linear programming approach to the ˆ t solution in every iteration loop of Step 3 computation of Q ˆ t is given as follows: in Algorithm 2. From the algorithm, Q ˆ t = argmax min xT Ut (Qt ), Q (18) x∈X

Qt ∈C

ˆt+1 where Ut (Qt ) = ¯rt (Qt ) + γMt (Qt )T U ˆN = ¯rN and U ˆt = Ut (Q ˆ t ), t = N −1, . . . , 1 with U where Mt (Qt ) and ¯rt (Qt ) are given by equations (3) and (1) respectively: X Gt,k 1(Qt ek )T and ¯rt (Qt ) = (Rt Qt )1. Mt (Qt ) = k

(19)

Theorem 2. The max-min problem given by (18) with (19) can be solved by the following equivalent linear programming problem (given t, d, Gt,k for k = 1 . . . p, Rt , γ, and ˆt+1 ): U − dT y + z

maximize Q,y,z,¯ r,M,S,U,s,K

M=

subject to

p X

Gt,k 1(Qek )T

k=1

¯r = (Rt Q)1 ˆt+1 U = ¯r + γM T U − y + z1 ≤ U K = M + S + s1T s + d ≥ Kd Q1 = 1, Q ≥ 0, y ≥ 0, S ≥ 0, K ≥ 0. (20) Proof. The proof will use the duality theory of linear programming [15], which implies that the following primal and dual problems produce the same cost PRIMAL

DUAL

T

maximize cT y

minimize b x s.t. AT x ≥ c, x ≥ 0

s.t. Ay ≤ b, y ≥ 0.

Since the set X is defined as 0 ≤ x ≤ d and xT 1 = 1, then the min in (18) can be obtained by a minimization problem with the following primal problem parameters: b = Ut , A = [−In 1 − 1], and cT = [−dT 1 − 1]. The dual of this program is maximize y,z

subject to

− dT y + z − y + z1 ≤ Ut (Qt ) y ≥ 0, z unconstrained.

Next by considering the argmax in (18), it remains to show that the set C can be represented by linear inequalities to write (18) as a maximization LP problem. It is indeed the case by using [12, Theorem 1] which says the following: M ∈M

⇔

∃S ≥ 0, K ≥ 0, s such that K = M + S + s1T s + d ≥ Kd,

where M = ∩ M(x) and M(x) = {M ∈ Rn×n , 1T M = x∈X

1T , M ≥ 0, M x ≤ d}. As M in (19) is a linear function of the decision variable Q, the set C is equivalently described by C = {Q ∈ Rn,p , Q1 = 1, Q ≥ 0, M (Q) ∈ M}, which implies that C can be described by linear inequalities. Now combining this result with the dual program, we can ˆ t can be obtained via the linear program conclude that Q given in the theorem, which concludes the proof. VII. S IMULATIONS This section simulates an example to demonstrate the performance of the proposed methodology for SC-MPDs as compared to algorithms from the literature on a vehicle swarm coordination problem [12], [36]. In this application, autonomous vehicles (agents) explore a region, which can be partitioned into n disjoint subregions (or bins). We can model the system as an MDP where the states of agents are their bin locations (n states) and the actions of a vehicle are defined by the possible transitions to neighboring bins. Each vehicle collects rewards while traversing the area where, due to the stochastic environment, transitions are stochastic (i.e., even if the vehicle’s command is to go “North”, the environment can send the vehicle with a small probability to “East”). Note that the safety constraints, bounding the probability of the system to be in some states, developed in this paper can be interpreted as follows: If a large number of vehicles is used where each is performing an identical MDP, then the density of vehicles evolves following a Markov chain dynamics by the law of large numbers. Since the physical environment (capacity/size of bins) can impose constraints on the number of vehicles in a given bin, the safety constraints on the density can be imposed for conflict avoidance. For simplicity we consider the operational region to be a 3 by 3 grid, so there are essentially 9 states. Each vehicle has 5 possible actions: “North”, “South”, “East”, “West”, and “Stay”, see Figure 2. When a vehicle is in a boundary bin and takes an action to go off the boundary, the environment will make it stay in the same bin for the next time epoch. Also note that the action ”Stay” is defined as deterministic, which means if a vehicle decides to stay in a certain bin, the environment will not send it anywhere else. The reward Rt (i, a) is assumed to be independent of the action a or time t, and only changes with the state i, thus we can write Rt as vectors for t = 1, . . . , N , and define them as follows T (21) Rt = 10 1 1 3 3 1 1 5 1

5

7 1

a4

8

a2

3

6

9

Fig. 2. Illustration of 3 × 3 grid describing the nine MDP states, and the five actions (North, South, West, East, and Stay).

where Rt (i) is the reward collected at bin (state) i and is assumed independent of the action or time taken. Density (safety) constraints for different bins are given as follows T d = 0.6 1 1 0.05 0.05 1 1 1 1 , (22) where any bin i should have xt (i) ≤ d(i) for t = 1, 2, . . . . If the density constraints are relaxed, the unconstrained MDP solution (which is known to give deterministic policies) will send all the vehicles to the bin with highest reward (in this case it is bin 1). The classical approach, on the other hand, may cause a transient violation of constraints even though later the density converge to a feasible fix point. However, with our proposed policy, not only the constrained are satisfied all the time, but also the solution gives guarantees on the expected total reward. Note that the linear program (20) generates the policies independent of the initial distribution. Therefore, even if the latter was unknown (which is usually the case in autonomous swarms), the generated policy satisfy the constraints. We now consider that all the vehicles initially are in bin number 6, i.e., x0 = 0 0 0 0 0 1 0 0 0 (23) Note that this is a feasible starting density. Figure 3 shows that in the scenario considered in this simulation with a discount factor γ ' 1, the unconstrained MDP policy makes the vehicles go to one bin (bin number 1) regardless of the constraints there since it gives the highest reward. The policy computed with classical approach, on the other hand, makes the density converge to a feasible point asymptotically. However, both of these policies cause violations in bins 4 and 5 to collect higher reward. Note that here the classical approach will not prevent this transient violation. However, our policy generated from Algorithm 2 leads to a distribution of the swarm in such a way the constrains are satisfied at every iteration. To further investigate the efficiency of the algorithm we have to study the rewards associated to the proposed policy. In Figure 4 we compare the reward of the SC-MDP policy (Algorithm 2) with the infeasible policies of the unconstrained MDP and the classical approach for dealing with constraints. The lower bound here is derived by Theorem 1 which provides optimality guarantees for the LP generated policy. VIII. C ONCLUSION In this paper, we have studied finite-state finite-horizon MDP problems with hard constraints on the probability of

unconstrained MDP constrained MDP classical approach SC-MDP feasible policy density upper bound

0.9

Density of agents in bin 1, xt [1]

a3

4a

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

5

10

15

20

25

30

Time Epoch bin 4

0.7

Density of agents in bin 4, xt [4]

2

a5

unconstrained MDP constrained MDP classical approach SC-MDP feasible policy density upper bound

0.6 0.5 0.4 0.3 0.2 0.1 0

0

5

10

15

20

25

30

Time Epoch bin 5 0.8

Density of agents in bin 5, xt [5]

1

bin 1

1

unconstrained MDP constrained MDP classical approach SC-MDP feasible policy density upper bound

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

5

10

15

20

25

30

Time Epoch

Fig. 3. The figure shows the density of autonomous vehicles and how the other policies can violate the constraints. Under unconstrained policy, the density converges to an infeasible density distribution. Classical approach policy makes the density converge to a feasible fix point but it causes a transient violation. This problem is resolved by the synthesized SC-MDP policy using Algorithm 2.

being in a state of the induced Markov chain. It is shown that neither policies due to unconstrained MDP algorithms or that generated from classical approach are feasible (i.e., they don’t satisfy constraints of the same form introduced in this paper). We provide an efficient algorithm based on linear programming and duality theory to generate feasible Markovian policies that not only satisfy the constraints, but also insures some theoretical guarantees on the expected reward. These safe policies define a probability distribution over possible actions and require that agents randomize their actions depending on the state. For future work, it would be interesting to provide tight bounds on optimal rewards for the feasible policies as compared with the optimal (nonconvex)

300 unconstrained MDP constrained MDP classical approach SC-MDP feasible policy lower bound R#

π Total Expected Reward vN

250

200

150

100

50

0 0

5

10

15

20

25

30

Finite-Horizon N

Fig. 4. The plot corresponding to unconstrained MDP is the total expected reward for the optimal MDP policy without density constraints. The constrained MDP-classical approach corresponds to the optimal policy considering density constraints given by classical approach. Due to (transient) violation of the constraints, neither of them is feasible. The SC-MDP feasible policy is the reward corresponding to a policy from the feasible set computed by the linear programming (20). Also the lower bound derived in Theorem 1 is also shown here.

safe policies. We would like also to extend the proposed policy for the infinite-horizon case using a similar algorithm as the “value iteration” of standard MDP problems. R EFERENCES [1] D. C. Parkes and S. Singh, “An MDP-based approach to Online Mechanism Design,” in Proc. 17th Annual Conf. on Neural Information Processing Systems (NIPS’03), 2003. [2] D. A. Dolgov and E. H. Durfee, “Resource allocation among agents with mdp-induced preferences,” Journal of Artificial Intelligence Research (JAIR-06), vol. 27, pp. 505–549, December 2006. [3] P. Doshi, R. Goodwin, R. Akkiraju, and K. Verma, “Dynamic workflow composition using markov decision processes,” in IEEE Int. Conference on Web Services 2004, July 2004, pp. 576–582. [4] P. R. Kumar and P. Varaiya, Stochastic Systems: Estimation, Identification and Adaptive Control. Upper Saddle River, NJ, USA: PrenticeHall, Inc., 1986. [5] R. S. Sutton and A. G. Barto, Introduction to reinforcement learning. MIT Press, 1998. [6] C. Szepesv´ari, “Algorithms for reinforcement learning,” Synthesis Lectures on Artificial Intelligence and Machine Learning, vol. 4, no. 1, pp. 1–103, 2010. [7] E. Feinberg and A. Shwartz, Handbook of Markov Decision Processes: Methods and Applications, ser. International Series in Operations Research & Management Science. Springer US, 2002. [8] E. Altman, “Applications of Markov Decision Processes in Communication Networks : a Survey,” INRIA, Research Report RR-3984, 2000. [9] M. L. Puterman, Markov decision processes : discrete stochastic dynamic programming, ser. Wiley series in probability and mathematical statistics. New York: John Wiley & Sons, 1994. [10] A. Arapostathis, R. Kumar, and S. Tangirala, “Controlled Markov chains with safety upper bound,” IEEE Transactions on Automatic Control, vol. 48, no. 7, pp. 1230–1234, 2003. [11] S. Hsu, A. Arapostathis, and R. Kumar, “On optimal control of Markov chains with safety constraint,” In Proc. of American Control Conference, pp. 4516–4521, 2006. [12] B. Acikmese, N. Demir, and M. Harris, “Convex necessary and sufficient conditions for density safety constraints in markov chain synthesis,” Automatic Control, IEEE Transactions on, vol. 60, no. 10, pp. 2813–2818, Oct 2015. [13] E. Altman and A. Shwartz, “Markov decision problems and stateaction frequencies,” SIAM J. Control Optim., vol. 29, no. 4, pp. 786– 809, Jul. 1991. [14] E. Altman, Constrained Markov Decision Processes, ser. Stochastic Modeling Series. Taylor & Francis, 1999.

[15] S. Boyd and L. Vandenberghe, Convex Optimization. New York, NY, USA: Cambridge University Press, 2004. [16] C. A. Rothkopf and D. H. Ballard, “Modular inverse reinforcement learning for visuomotor behavior,” Biological Cybernetics, vol. 107, no. 4, pp. 477–490, 2013. [17] P. Geibel, “Reinforcement learning for mdps with constraints,” in Machine Learning: ECML 2006, ser. Lecture Notes in Computer Science, J. Frnkranz, T. Scheffer, and M. Spiliopoulou, Eds. Springer Berlin Heidelberg, 2006, vol. 4212, pp. 646–653. [18] M. El Chamie and G. Neglia, “Newton’s method for constrained norm minimization and its application to weighted graph problems,” in American Control Conference (ACC), 2014, June 2014, pp. 2983– 2988. [19] D. Dolgov and E. Durfee, “Stationary deterministic policies for constrained mdps with multiple rewards, costs, and discount factors,” in Proceedings of the 19th International Joint Conference on Artificial Intelligence, ser. IJCAI’05. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2005, pp. 1326–1331. [20] N. Demir, B. Ac¸ıkmes¸e, and M. Harris, “Convex optimization formulation of density upper bound constraints in Markov chain synthesis,” American Control Conference, pp. 483– 488, 2014. [21] E. A. Feinberg and A. Shwartz, “Constrained Markov Decision Models with Weighted Discounted Rewards,” Mathematics of Operations Research, vol. 20, no. 2, pp. 302–320, 1995. [22] M. Haviv, “On constrained markov decision processes,” Operations Research Letters, vol. 19, no. 1, pp. 25 – 28, 1996. [23] P. Nain and K. Ross, “Optimal priority assignment with hard constraint,” IEEE Transactions on Automatic Control, vol. 31, no. 10, pp. 883–888, Oct 1986. [24] E. B. Frid, “On optimal strategies in control problems with constraints,” Theory of Probability & Its Applications, vol. 17, no. 1, pp. 188–192, 1972. [25] A. S. Manne, “Linear programming and sequential decisions,” Management Science, vol. 6, no. 3, pp. 259–267, 1960. [26] E. Altman and F. Spieksma, “The linear program approach in multichain markov decision processes revisited,” Zeitschrift fr Operations Research, vol. 42, no. 2, pp. 169–188, 1995. [27] A. Hordijk and L. C. M. Kallenberg, “Constrained undiscounted stochastic dynamic programming,” Mathematics of Operations Research, vol. 9, no. 2, pp. 276–289, 1984. [28] F. J. Beutler and K. W. Ross, “Time-average optimal constrained semimarkov decision processes,” Advances in Applied Probability, vol. 18, no. 2, pp. pp. 341–359, 1986. [29] D. P. de Farias and B. Van Roy, “The linear programming approach to approximate dynamic programming,” Oper. Res., vol. 51, no. 6, pp. 850–865, Nov. 2003. [30] S. Feyzabadi and S. Carpin, “Risk-aware path planning using hirerachical constrained markov decision processes,” in Automation Science and Engineering (CASE), 2014 IEEE International Conference on, Aug 2014, pp. 297–303. [31] S. Balachandran and E. M. Atkins, A Constrained Markov Decision Process for Flight Safety Assessment and Management. American Institute of Aeronautics and Astronautics, 2015/08/11 2015. [Online]. Available: http://dx.doi.org/10.2514/6.2015-0115 [32] C. Caramanis, N. Dimitrov, and D. Morton, “Efficient algorithms for budget-constrained markov decision processes,” Automatic Control, IEEE Transactions on, vol. 59, no. 10, pp. 2813–2817, Oct 2014. [33] M. Ono, Y. Kuwata, and J. Balaram, “Mixed-strategy chance constrained optimal control,” in American Control Conference (ACC), 2013, June 2013, pp. 4666–4673. [34] L. Blackmore, M. Ono, A. Bektassov, and B. Williams, “A probabilistic particle-control approximation of chance-constrained stochastic predictive control,” IEEE Transactions on Robotics, vol. 26, no. 3, pp. 502–517, June 2010. [35] D. P. Bertsekas, Dynamic Programming and Optimal Control, Vol.I, 3rd ed. Athena Scientific, 2005. [36] B. Acikmese and D. Bayard, “A markov chain approach to probabilistic swarm guidance,” in American Control Conference (ACC), 2012, June 2012, pp. 6300–6307.

Convex Synthesis of Optimal Policies for Markov ...