Safe Markov Chains for ON/OFF Density Control with ...

Viewer
Transcript

1

Safe Markov Chains for ON/OFF Density Control with Observed Transitions Nazlı Demirer, Mahmoud El Chamie, and Behc¸et Ac¸ıkmes¸e

Abstract—This paper presents a convex optimization approach to control the density distribution of autonomous mobile agents (single or multiple) in a stochastic environment with two control modes: ON and OFF. The main new characteristic distinguishing this model from standard Markov decision models is the existence of the ON control mode and its observed actions. During the ON mode, the instantaneous outcome of one of the actions of the ON mode is measured and a decision is made to whether this action is taken or not based on this new observation. If this action is not taken, the OFF mode is activated where a transition occurs based on a predetermined set of transitional probabilities, without making any additional observations. In this decisionmaking model, an agent acts autonomously according to an ON/OFF decision policy, and the discrete probability distribution for the agent’s state evolves according to a discrete-time Markov chain that is a linear function of the stochastic environment and the ON/OFF decision policy. The relevant policy synthesis is formulated as a convex optimization problem where safety and convergence constraints are imposed on the resulting Markov matrix.

Fig. 1. 2D air balloon illustration with actions, ai . Gi is the discrete probability density distribution for the x-position of the balloon resulting from taking action ai from its current position. In this example, the balloon observes outcome of an action by changing its altitude based on the action, and the OFF mode is an MDP [5] running over all actions, hence Goff is a function of Gi ’s, i.e., Goff = f (G1 , G2 , ...).

I. I NTRODUCTION This paper presents a convex optimization based approach for the synthesis of randomized decision policies to control the density of mobile agents that switch between two modes: ON and OFF. Each mode consists of a (possibly overlapping) finite set of actions, that is, there exist a set of actions for the ON mode and another set for the OFF mode, which may have a non-empty intersection. At each time step, the instantaneous outcome, i.e. transition, for a single action chosen among the set of actions for the ON mode is measured (observed). Based on this observation, a decision on whether to accept or reject this transition is made. Both decisions, i.e., selection of an action to observe and acceptance/rejection of corresponding transition are made based on a randomized decision policy. If the proposed transition is rejected, the OFF mode is activated where transition to the next state is made based on a predetermined set of transitional probabilities, without making any additional observations. In this setting, the probability distribution of the agent’s state evolves according to the resulting, underlying, finite-state and discrete-time Markov chain (MC). The proposed Markov decision model and formulation is applicable for systems with single or multiple agents, i.e., the density distribution can be interpreted as the temporal probability distribution of the state of a single agent, or the state probability distribution over multiple agents. Among many possible applications of the theoretical framework presented, we illustrate the results with a swarm control example. Authors are with the Department of Aeronautics and Astronautics, University of Washington, Seattle, WA 98195. Emails: {ndemirer, melchami, behcet}@uw.edu

Our previous research has developed methods for density control policy synthesis without the notion of agent modes or actions [1], [2], [3], then the idea is extended to density control of ON/OFF agents in [4], where we only considered the case when there is only a single action for the ON mode and a deterministic action for the OFF mode. In this paper, we generalize this model for the case when there is an ON mode with multiple actions and an OFF mode with stochastic transitions, through a new Markov decision model with additional measurements for state transitions. It is noteworthy that the OFF mode captures multiple interesting scenarios: (i) There is only a single action for the OFF mode for which the outcome is not observable; (ii) There are multiple actions for the OFF mode without outcome observations and a standard decision policy running over these actions, which results in an effective transition matrix for the underlying Markov chain; iii) Transitions can be observed for all available actions and there is a standard decision policy that can run over the actions without requiring transition observations, which is utilized as the default policy when the observed action is rejected. Note that in the third interpretation, the set of actions for the ON mode and the OFF mode are identical, and a rejected observed action at some instant may be still chosen when the default OFF mode decision policy is executed after rejection. When implementing this method, the decision policies are computed offline by solving LMI problem and given to the agent assuming that at each time step, the current state can be observed, one-step outcome of a single action for the ON mode can be measured, and the transition corresponding to the selected action can be accepted

2

or rejected. In this sense ON and OFF modes can be seen as higher level actions, under which there are the lower level motion actions. The proposed model has potential applications for autonomous systems operating in complex dynamic environments like flow fields where both historical and observation based data can be utilized. These applications include density control of blimp type UAV’s [6], wind-energy based UAV’s like gliders [7], underwater vehicles [8], etc. An interesting application is controlling air balloons, referred to as Montgolfi`ere balloons, in uncertain wind fields [9] for scientific measurements in Earth and other planets [10]. Here, probabilistic wind velocity field information, i.e., the probability density functions for the wind speed and direction, changes as a function of the balloon’s altitude. The balloons can change their altitudes in order to choose the velocity field that they ride with, i.e., horizontal motion induced by the wind. Previous research proposed controlled Markov process models for this example [9]. In our approach, the action in this Markov process model is the choice of altitude (see Figure 1). Then we can synthesize a default Markov Decision Process (MDP) policy to distribute the balloons based on this prior wind field knowledge (transition probabilities as a function of the altitude), which defines the OFF mode. We can also design an ON policy, for which we measure the instantaneous velocity observed at different altitudes. We can accept or reject the selected altitude (action) for the ON mode based on the instantaneous velocity observed at this altitude. If it is rejected based on the observed outcome, we can go to the altitude suggested by the OFF mode policy. Note that, the sets of actions for the ON and OFF modes are the same in this example. II. R ELATED R ESEARCH The proposed Markov model is applicable to both decisionmaking for single and multi-agent systems in stochastic environments. Our particular interest is motivated by the problem of guiding a multi-agent system, which has been a recent subject of research in Markov Decision Processes (MDPs) [11], [12], [13]. In general, the decision model presented here is complementary to the MDP models [5] with finite number of states and actions, with the addition of a new set of observations in the decision process. Furthermore, rather than having a reward function for each action and transition [14], the mission objectives here are embedded within the underlying Markov chain. Most of the work in the area of MDPs and multiagent systems focus on the decentralized control of Markov decision processes (Dec-MDP) and decentralized partiallyobserved Markov decision processes (Dec-POMDP), e.g., [15], where each agent has partial or incomplete observations of its state. These problems, generally, are very difficult to solve and become intractable for large scale problems [16]. These problems with incomplete observations are quite interesting, but not directly comparable to the problem considered in this paper, where we utilize additional measurements for statetransitions. Using a Markov chain for density control of agent systems is a relatively new idea. [17] considers a similar

problem in the task allocation framework where probability of switching between tasks are designed to achieve maximum redistribution rate, without any additional constraints. A different Markov chain based method is proposed in [18] by using a probabilistic disablement approach found in [19]. [20] uses biased random walk, which leads vehicle positions to evolve toward a probability density. Other approaches to multi-agent coordination includes: using nearest neighbor information to establish consensus [21], centroidal Voronoi diagrams [22] and PDE-based methods [23]. An interesting application introduced in the previous section, which can also be explored in the proposed ON/OFF framework, is the control of balloons in stochastic wind fields for atmospheric science observations [24]. Earlier work converted this motion planning problem into a more standard Markov decision model [10], [9]. In summary, the main distinction of our paper from the listed references above is the existence of the ON control mode and its observed actions. This allows us to devise new methods to control the density distribution of autonomous agents via a new Markov decision model with measurements on the state transitions. Measurements for the ON mode can be obtained by the deployment of additional sensors to extend the agents’ sensing capabilities. Hence, the key contributions of this paper are: i) Formulation of a new Markov chain synthesis problem through a new Markov decision model, with additional measurements for the state transitions, where a policy is designed to ensure that the desired safety and convergence properties for the underlying Markov chain; ii) Convexification of the synthesis problem; iii) Application of the model to density control of swarm of autonomous mobile agents. The rest of the paper is organized as follows: Section III introduces the density control problem for ON/OFF agents and presents the main result of the paper, it also provides the algorithm for implementation and makes connections to MDPs; Section IV describes the ergodicity, transition and safety constraints and formulates the convex LMI problem with this constraints and Section V has an illustrative example which uses the Markov matrix with the safety constraints. Section VI concludes the paper. Notation The following is a partial list of notation used: 0 (1) is the vector/matrix of zeros (ones); I is the identity matrix; ei is a vector with its ith entry +1 and its other entries zeros; x[i] = eTi x for x ∈ Rn , and A[i, j] = eTi Aej for A ∈ Rn×m ; Q = QT ()0 implies that Q is a symmetric positive (semi)definite matrix; R > (≥)H implies that R[i, j] > (≥)H[i, j] for all i, j; x ∈ Pn is a probability vector, i.e., x ≥ 0 and 1T x = 1; matrix M ∈ Pm×m is a Markov matrix if M ≥ 0 and 1T M = 1T ; P denotes probability of a random variable; Rn is the n dimensional real vector space; N is set of nonnegative integers, i.e., N = {0, 1, 2, . . .}; N+ n = {1, 2, . . . n}; represents the Hadamard product; i(A) is the indicator matrix for any matrix A, whose entries are given by i(A)[i, j] = 1 if A[i, j] 6= 0 and i(A)[i, j] = 0 otherwise; η ∼ U (0, 1) denotes a random variable sampled from the uniform distribution in the interval [0, 1].

3

In this section, we introduce the probabilistic density control problem for ON/OFF agents and then present the main result of the paper. Our main objective is: To synthesize decisionmaking policies for ON/OFF agents to make statistically independent decisions, which result in a desired behavior of the overall agent density distribution while satisfying safety constraints on the density. The density control problem is formulated as a Markov chain synthesis problem. For that, we define a finite set of states S = {s1 , . . . , sn }, that is, there is a discrete state space with cardinality n and sj is referred to as “j th state”. We consider a discrete-time system where s(t) ∈ S is the state of the agent at time epoch t, i.e., s(t) = si is the event that the state is the ith state at time t. Then, probability density distribution x(t) ∈ Pn is defined as: x[i](t) = P{s(t) = si },

i ∈ N+ n , t ∈ N,

(1)

where t is the discrete time index. Hence x[i](t) is the probability of a mobile agent to be in the ith state at time t. In the rest of the section, we present the formulation for the time evolution of the density x(t) as the following Markov chain, which is defined over the state-space S: x(t + 1) = M x(t), (2) where M is the transition matrix, i.e., M [i, j] = P{s(t + 1) = si |s(t) = sj }. The transition matrix M is a function of the stochastic environment and the ON/OFF policy, as will be explained next. Note that we have two modes of operation σ(t) ∈ {σon , σoff }. In the ON mode, the “next step” outcomes of the actions are observable, while a Markov chain, Goff , is propagated when OFF mode is chosen. We have multiple actions in the ON mode whose transitions can be observed, i.e., σ(t) = σon =⇒ a(t) ∈ Aon = {a1 , . . . , am }, (3) where a(t) is the action taken in the ON mode. We define the following events to properly define the stochastic environment and the decision policy, for t ∈ N: y(t + 1) = sl : Observing a transition to state sl , v(t) = ak : Observing the outcome of taking action ak ∈ Aon , a(t) = ak : Accepting to execute action ak . Even though y(t+1) has a time index t+1, this observation occurs at time t. In particular, observation of a transition is different from the actual transition taking place. For example, y(t + 1) = sl is the event that the stochastic environment would have caused a transition to lth state at time t + 1 if the action for the ON mode were to be accepted at time t (i.e., observing one-step ahead in the future), whereas s(t+1) = sl is the event that the transition to lth state has actually occurred. The event “v(t) = ak ” is used to define the probability of choosing an action whose outcome will be observed. In this paper, the environment transition matrices and the decision policy are assumed to be time-invariant (i.e., the processes are stationary), hence the corresponding Markov chain transition

matrix given in (2) is also time-invariant. Our objective is to synthesize a decision policy for an agent to accept or reject the corresponding transition observed for the action in the ON mode at each time epoch, i.e. to decide whether it should be ON or OFF, such that the resulting Markov chain will satisfy the desired transition and safety constraints while guiding the density distribution to a desired final distribution. The stochastic environment is defined with the transition matrices Gk , Goff ∈ Pn×n , k ∈ N+ m where Gk [i, j] gives the probability of observing a transition from j th state to ith state, given that the k th action is selected to be observed, i.e., Gk [i, j] = P{y(t + 1) = si |s(t) = sj , v(t) = ak },

(4)

where i, j ∈ N+ k ∈ N+ n, m and similarly Goff [i, j] defines the corresponding transition probabilities for the action in the OFF mode (e.g., Goff = I when being OFF means no motion as in the previous section). The model for ON/OFF decision-making has the following assumptions (see Fig. 2): • • • • •

Agent measures its own state at time instant t. Agent chooses a single action for the ON mode, say ak , whose outcome will be observed, i.e., v(t) = ak . Agent accepts or rejects to take the observed action. If action is accepted then it is taken, i.e., a(t) = ak , and transitions occurs according to Gk . If action is rejected, the agent chooses the OFF mode, σ(t) = σoff and the transition occurs according to Goff . select action to observe current state

j

∝1[j] ∝2[j] ∝3[j]

observe transition

i1 i2 i3

…

III. M ARKOV C HAIN M ODEL FOR D ENSITY C ONTROL OF ON/OFF AGENTS

∝m[j]

accept or reject make transition i3 ON mode

Q3[i3,j] OFF mode

im iOFF Fig. 2. Implementation of the decision policy.

Remark 1. Goff can have two interpretations: (i) the environment transition matrix when there is only one action for which the outcome is not observable; (ii) the effective transition matrix when there are multiple actions with a prescribed decision policy. For the latter case, Goff is a Markov matrix defining the underlying Markov chain of an MDP with an existing policy running over the actions, which is discussed in more detail in the next subsection. In this model, when an action is accepted, the resulting transition may not be certain, i.e., there exists a probability distribution over the states, which is updated with the current observation. In order to capture this uncertainty, we define the following variable: Rk,i [l, j]=P{s(t+1) = si |y(t+1) = sl ,a(t) = ak ,v(t) = ak ,s(t) = sj}, (5)

4

which defines a probability distribution over the states given that a transition is observed. Note that, certain transitions can be captured by setting Rk,i [l, j] = δil which means Rk,i = ei 1T , ∀k ∈ N+ m. We consider two sets of decision variables (to be designed offline): αk [j] = P{v(t) = ak |s(t) = sj }

(6)

Qk [i, j] = P{a(t) = ak |s(t) = sj , v(t) = ak , y(t + 1) = si }

(7)

+ with k ∈ N+ m , i, j ∈ Nn . Namely, αk [j] is the probability of choosing action ak ∈ Aon at state sj and Qk [i, j] is the probability of accepting an achievable transition j → i observed as an outcome of taking an action ak . Clearly Qk matrices must be non-negative, Qk ∈ [0, 1]n×n , k ∈ N+ . Also, non-negative action variables m Pm αk should satisfy the inequality k=1 αk [j] ≤ 1, j ∈ N+ n . It turns out in our model (which will become more clear later) that we can combine these two variables by considering the change of variables Pk := Qk diag(αk ) = Qk (1αkT ), k ∈ N+ (8) m.

The following property holds for Pk matrices, Pk ≤ 1 αkT , k ∈ N+ m.

(9)

This inequality can simply be proven by contradiction. Suppose there exist i and j such that Pk [i, j] > αk [j], then Qk [i, j] = Pk [i, j]/αk [j] > 1 which is a contradiction because Qk [i, j] ∈ [0, 1]. We can now give by Algorithm 1 the ON/OFF decisionmaking policy for the general case. For this algorithm, we define variable φj ∈ Rm+2 for each state j, where φj [1] = Pr−1 0, φj [r] = l=1 αl [j], φj [m + 2] = 1, r = 2, . . . , m + 1. Algorithm 1: ON/OFF Decision-Making Policy – General case Inputs: {αk , Qk : k ∈ N+ m } (designed offline), S, tmax 1 for t ← 1 to tmax do 2 Determine current state s(t) ∈ S (assume s(t) = sj ); 3 Generate a random numbers µ(t) ∼ U (0, 1) and η(t) ∼ U (0, 1); 4 if µ(t) ∈ [φj [k], φj [k + 1]) then 5 v(t) = ak ; 6 Observe the next achievable transition for ak : y(t + 1) (suppose y(t + 1) = si ); 7 if η(t) ∈ [0, Qk [i, j]] then 8 The agent switches to the mode ON, a(t) = ak and s(t + 1) = si ; 9 end 10 else 11 The agent switches to the mode OFF a(t) = aoff , and s(t + 1) transitions according to Goff ; 12 end 13 end The following theorem presents the key result in converting the ON/OFF decision policy design problem into a Markov chain synthesis problem.

Theorem 1. Consider a system of single or multiple mode switching ON/OFF agents moving in a stochastic environment defined by a finite number of states S with the transition probabilities given by Gk as in (4) for the k th action ak ∈ Aon in the ON mode and Goff for the OFF mode. Suppose that each agent executes the ON/OFF decision-making Algorithm 1 with the matrices Qk as in (7) and the vectors αk as in (6). Then the state p.d.f. x(t), defined in (1), evolves based on the Markov chain (2) with the Markov matrix M ∈ Pn×n given by m X n X M= ei 1T (Rk,i Gk Pk ) k=1 i=1 T

+ Goff 1 1 −1

T

m X n X

!! T

ei 1 (Rk,i Gk Pk )

k=1 i=1

where Pk , k ∈ N+ m are given by (8) and they satisfy m X k=1

max Pk [i, j] ≤ 1,

i∈N+ n

j ∈ N+ n.

(10) (11)

The proof of the theorem is given the appendix. The model in (10) captures an important case when the transitions corresponding to all actions are certain, i.e., Rk,i = ei 1T , ∀k ∈ N+ m . (See Corol.1) Corollary 1. When Rk,i = ei 1T , ∀k ∈ N+ m , the model given in (10) is equivalent to: !! m m X X T T Gk Pk + Goff 1 1 −1 Gk Pk . M= k=1

k=1

(12)

Another useful case captured by (10) is where there is a single action in the ON mode and the OFF mode has deterministic outcomes, which is first presented in [4]. Corollary 2. If Rk,i = ei 1T , ∀k ∈ N+ m , A = {a1 } and Gof f = I, the model given in (10) is equivalent to: M = G K + diag(1T − 1T (G K)).

(13)

Theorem 1 shows that M is a linear function of {Pk : k ∈ N+ m }. This linearity property will be used for a convex synthesis of Pk so that the matrix M satisfies some favorable properties as convergence and safety as we will show later on in the paper. The algorithm design parameters αk and Qk can then be extracted from Pk as we will show next. Thus the design of {Pk : k ∈ N+ m } is an intermediary step to set the parameters of Algorithm 1. A. Extraction of αk and Qk from Pk Once Pk matrices are computed (via solving (28) given in the next section), αk and Qk can parametrized in multiple ways. The choice must preserve the following conditions on αk and Qk (since they contain probabilities of events as entries) m X 0 ≤ Qk ≤ 11T , 0 ≤ αk ≤ 1, k ∈ N+ , αk ≤ 1. (14) m k=1

Our default parameterization is: + αk [j] = max Pk [i, j], k ∈ N+ m , j ∈ Nn . i∈N+ n

(15)

5

Hence the last inequality in (14) is satisfied (due to (11)), and we can choose Qk as Qk = Pk diag(αk )−1 , k ∈ N+ (16) m. Here, Qk is obtained by dividing each entry in a column of Pk by the maximum element in that column, hence 0 ≤ Qk ≤ 11T . For the same reason, note that any choice of αk greater than or equal to the choices given in (15) would have resulted in feasible Qk . Also observe that this particular choice Pm αk allows “no action observed” cases, since it leads to k=1 αk ≤ 1 (the sum does not have to be one). We can normalize αk ’s such that an action is always observed, without changing the resulting M , as follows: Form a matrix Γ with αk ’s computed via (15) as its columns. Then compute a new set of αk ’s by using the following expression and Qk ’s by using (16), Λ := [α1 . . . αm ] = diag(Γ 1)−1 Γ.

(17)

Pm

Note that k=1 αk = Λ1 = diag(Γ1)−1 Γ1 = 1. Since the second choice of αk always produces values that are at least as large as the first choice in (15), we ensure that 0 ≤ Qk ≤ 11T . B. Connections to Markov Decision Processes Any designed matrices {Pk , k ∈ N+ m } for the modeswitching agent model given in this paper defines a Markov chain for the system (2) whose transition matrix M is determined by Theorem 1. Hence, the model can be considered as a controlled Markov chain model, and inherently there is a connection to Markov Decision Processes (MDPs). Notwithstanding this, the main objectives of the mode-switching agent is to shape the transition matrix M to achieve a density distribution with some favorable properties such as safety or a desired stationary distribution as we will discuss later in the paper. MDPs on the other hand, select policies that optimize a reward-based function. While we do not explicitly define a reward function, an optimized decision-making policy can be implicitly embedded into the system through the OFF mode. Since MDP models in general do not observe the outcome of actions before transition, the OFF mode can correspond to a Markov chain resulting from an MDP with a standard π decision policy. Hence, the transition matrix Goff = MMDP would correspond to an MDP with a reward-optimized policy π. This default MDP policy may not satisfy some of the desired steady-state distribution or the safety constraints, for which we utilize the additional observations in the ON mode via the synthesized decision policies to achieve these design specifications. Remark 2. Even though the safety constraints in (22) are linear in the probability density distribution vector x(t), these constraints cannot be captured by the classical framework for constraints used in standard MDPs [25] as shown in our previous work [13]. To elaborate more on the connection with MDPs, we give in Figure 3 a simple motion planning example to show that the mode-switching agent model can implicitly incorporate a reward-optimizing MDP policy. The objective of the simplified motion planning example is to send an agent from a source state (the green bin) to a destination state (the red bin) using

Fig. 3. An MDP optimal policy for deterministic motion planning where the objective is to go from the green bin to the red bin. Black bins are obstacles.

the least number of transitions (i.e., moving along the shortest path). The arrows in the figure show an optimal MDP policy to go from any state to the destination state using a shortest path. This optimal policy can be obtained using the classical backward induction (dynamic programming) algorithm. Furthermore, in this example, environment is non-stochastic and thus transitions due to the actions of the optimal policy are assumed deterministic. Note that if many agents start from the same green bin, they will all follow the same path. Thus, if bins are subject to varying capacity constraints (i.e., each bin has a maximum capacity for the number of agents that can be present in that bin at any time instant), then the optimal MDP policy violates the constraints. The mode-switching agent model given in this paper can be used to handle such a situation as we argue next. Let the OFF mode be the transition matrix corresponding to the optimal MDP policy given in Figure 3. Since the environmental transitions are deterministic, then so are the observations. Thus, {αk , k ∈ N+ m } boils down to the probability of choosing an action deviating from the optimal policy to satisfy the capacity constraints. We can impose such capacity constraints directly on the resulting transition matrix of the system after substituting the relevant quantities of the problem in equation (2) to obtain a “randomized” policy that deviates from the optimal policy to satisfy the imposed constraints. These type of constraints are a special type of safety constraints that are discussed in more details further in the paper. It is worth noting that, in this example, both ON and OFF modes have the same set of actions, where the OFF mode executes a standard MDP policy (e.g., choosing actions for the shortest path) while the ON mode selects an action to observe in order to deviate from the policy in the OFF mode. Remark 3. It is reasonable to consider a cost for having an observation causing the deviation from an optimal policy. However, having costs for observations and comparing the effect of deviation on the overall performance of the resulting MDP is an ongoing research direction and is not pursued further in this paper.

IV. C ONVEX S YNTHESIS OF S AFE M ARKOV C HAIN FOR ON/OFF AGENTS We can describe the decision policy design problem for ON/OFF agents as follows: Design Qk , αk , k ∈ N+ m , such that the density, as defined in (1), should satisfy the following

6

constraints for t ∈ N: Transition :P{s(t+1) = si |s(t) = si}= 0 if Aa [j, i] = 0

(18) Safety :Lx(t) ≤ q, ∀x(t) ≤ p

(19)

Convergence : lim x(t) = v, ∀x(0) ∈ Pn

(20)

t→∞

where v is a desired, discrete, probability distribution, and Aa is the adjacency matrix defining the allowable transitions between states between two consecutive time steps, L, q, and p are given matrices and arrays specifying safety constraints. In this section, we provide brief discussions on these constraints and express them as equivalent convex constraints on the Markov matrix M . Since the matrix M depends linearly on the ON/OFF decision policy matrices Pk as in (12), this will imply that the constraints will be convex constraints on Pk ’s, which are our design variables. Once Pk ’s are computed, we can choose αk ’s and Qk ’s by using (15) or (17) and (16). a) Transition Constraints: It is useful to impose constraints on the physically realizable state transitions by specifying some entries of the matrix M as zeros. In the case of ON/OFF density control, these constraints are automatically ensured by stochastic transition matrices: If a state transition is simply not possible naturally, then the corresponding entry in the matrix Gk , ∀k is zero, which implies that it is also zero in the matrix M , i.e., M [i, j] = 0 if Gk [i, j] = 0, ∀k. If additional transitions are also needed to be eliminated due to mission specific reasons, we can impose additional constraints with the following equation: (11T − ATa ) M = 0 (21) where Aa is the adjacency matrix, i.e., Aa [i, j] = 1, if transition i → j is allowable and Aa [i, j] = 0 otherwise. Note that, this constraint is already linear, hence convex in M .

where f ≥ 0 bounds the flow rate. Note that this is equivalent to case where M −I f L= , q= , I −M f and p = d in (19). For more examples on the linear safety constraints captured by the general form and for the proof of Lemma 1, see [3]. The safety constraints given in (19) are ensured by the following lemma, which gives necessary and sufficient conditions for safety as linear inequalities on M . Lemma 1. [26], [3] Consider the Markov chain given by (2). Then, Lx(t) ≤ q, ∀x(t) ≤ p, (24) if and only if there exist S ∈ Rn×n and y ∈ Rn such that S ≥ 0, L + S + y1T ≥ 0, y + q ≥ L + S + y1T p. (25) c) Formulation of Convergence/Coverage Constraints: As a part of the mission objectives for multi-agent systems, the agent distribution x(t) is required to converge to a desired distribution v ∈ Pn as given in (20). Since M is a Markov matrix, hence column stochastic, a necessary condition for the desired convergence is that the desired distribution v ∈ Pn is an eigenvector of M : M v = v.

(26)

Notice that M v = v is a simple linear equality, and hence it is a convex constraint but not sufficient for convergence. In order to obtain the convex equivalent of the ergodicity constraint (limt→∞ x(t) = v ∀ x(0) ∈ Pn ), we use the result presented in [27] that proposes a necessary and sufficient condition for ergodicity for reversible Markov chains and v > 0: −λI H −1 M H − hhT λI, 1/2

(27)

1/2

b) Safety Constraints: The safety constraints are hardest to capture. Since the state is multiplied by the Markov matrix at each time step, bounding the state for all time requires infinite dimensional constraints. In order to handle these constraints, we used the result presented in [26], [3] that uses the duality theory of convex optimization to express these constraints as linear finite dimensional constraints. We consider a general form of linear density safety constraints given in (19) which covers several types of safety constraints with different selections of L, p and q. Two examples of these constraints are: (i) Density upper bound constraints; (ii) Density rate constraints. Density upper bound constraint ensures that the density of each state stays below a prescribed value, that is, x(t) ≤ d, t ∈ N+ , (22)

where h = (v1 , . . . , vm ) and H = diag(h). Note that λ, the convergence rate, can be minimized for fastest convergence within a convex problem formulation. d) Formulation of synthesis as an optimization problem: So far, we have obtained the linear equivalent conditions on Markov matrix for the transition, safety and ON/OFF constraints. Hence, for the convex optimization problem, we define a set for feasible Markov matrices which satisfy safety, transition, convergence and ON/OFF constraints:

where 0 ≤ d ≤ 1 defines the density upper bounds in each state and it is assumed that x(0) ≤ d. Note that, this form can be obtained by letting L = M , q = d and p = d in the general form given in (19). The density rate constraint is used to limit the rate of change of density for each state of Markov chain.

where Pk (:, j) is the j’th column of Pk . Then we can replace the last inequality by the following inequalities

−f ≤ x(t + 1) − x(t) ≤ f,

t ∈ N,

(23)

MF = {M ∈ Pn×n : M satisfies (10), (11), (21), (25), (26), (27)}. Note that the inequality in (11) can be written by using linear inequalities, hence it is a convex constraint. To see that define Zj := [P1 (:, j) P2 (:, j) . . . Pm (:, j)] , j ∈ N+ n,

Zj ≤ 1βjT , 1T βj ≤ 1,

j ∈ N+ n,

where βj ’s are m × 1 slack variables. We can formulate the Markov matrix synthesis as a minimization problem on the variables M and Pk with

7

the desired constraints. One example cost function is 1T (1 − diag(M )) which aims to minimize overall action, i.e., M ' I: Aa , L, q, p, v, λ, Gk , Rk,i

Given:

1T (1 − diag(M )) such that

min

M,Pk

(28)

M ∈ MF . Remark 4. The above problem is a Linear Matrix Inequality (LMI) optimization problem which is generally solved via Interior-point methods which have polynomial-time complexity [28]. V. N UMERICAL E XAMPLE This section presents an illustrative numerical example for the density control problem for autonomous agents with ON/OFF control modes. We consider a swarm of mobile agents that are distributed over the configuration space (see Figure 4) that is partitioned to 8 subregions, which are referred to as bins. In this configuration, probabilistic density distribution is given as x[i](t) := P{r(t) ∈ Ri } where r(t) is the position vector of an agent at time step t. For this setting, safety upper bound constraint is used to limit the expected number of agents in each bin. Two sets of simulations are performed by using the ON/OFF policy synthesized by solving the LMI problem in (28), both with the density upper bound constraints and m = 5 actions for the ON mode: (i) Total Ns1 = 3000 simulations with same safe initial condition, i.e., same x(0) with different realizations; (ii) Total Ns2 = 3000 simulations with randomly generated 3000 safe initial conditions. For all cases, OFF case corresponds to “no action”, i.e., Goff = I and the actions in the ON mode are fully observable. For the actions in the ON mode, column stochastic Gk matrices are selected such that they have different steadystate final distributions and do not satisfy safety and transition constraints. Other parameters for the simulations are set as follows: Na = 3000, λ = 0.975,  1 1  1  0 Aa= 0 0  0 0

1 1 1 1 1 0 0 0

1 1 1 0 0 0 0 0

0 1 0 1 1 0 1 0

0 1 0 1 1 0 1 0

0 0 0 0 0 1 1 1

0 0 0 1 1 1 1 1

       0 0.5 0.005 1  0   0.02  0.15 0        0.5 0.005  1  0        0  0   0.04  0.12  ,x(0) = ,v =  ,d =  0  0   0.05  0.12  0   0.34   1  1         0   0.2   0.4  1 1 0 0.34 1

where Aa is the adjacency matrix of the bin connections, Na is the total number of agents, x(0) is the initial distribution of agents, v is the desired final distribution as in equation (20), d is a safety upper bound constraint (as in equation (24) with L = I and p = q = d), and λ is the convergence rate of the system. Since the bins at the corners behave like accumulation points in the initial and final distributions, the safety upper bound constraints are not imposed for these bins, i.e., the corresponding entries of the d vector are set as 1. With the given parameters, the optimization problem given in (28) is solved using YALMIP and SDPT3 [29], [30]. In many applications, environmental transition matrices, Gk ’s, may satisfy transition constraints in most examples, i.e., Gk [i, j] = 0

Fig. 4. Snapshots of simulation: configuration space with bin numbers

when Aa [j, i] = 0. However this example considers some Gk matrices that do not have this property, that is, we do not allow some motions even when they can be induced by the environment. Such scenarios can arise in the balloon motion control example given in the introduction where environmental transition matrices may not satisfy the desired constraints. Though some altitudes may induce high velocities, we may not choose to ride with such fast winds, for example, not to damage the structural integrity of the balloon (there may be a maximum speed limit for structural safety purposes). Another important thing to note is that, the desired distribution may not be achievable with given natural transition matrices, i.e., if a desired transition is not physically possible and the desired behavior requires this transition, the problem will be infeasible. Simulation results are presented in Figure 5. The mean density x and 3σ confidence bounds are shown for the case with density upper bound d. The average density for the case without constraints is obtained by evolving the density according to equation (2). Density goes above the desired upper bound for the bins 2, 4 and 5 when the constraint is not imposed. By using ON/OFF control policy, we are able to ensure that the density does not go beyond the prescribed upper limit at a reasonable cost of reduced convergence rate. Similar example for the case where ON mode has a single action is also considered in our earlier work [4] where we are able to modify the final distribution and ensure that the density does not go beyond the safety upper bound by using binary ON/OFF policy. For the results of the second set of simulations, point-wise maximum values of the density at each time step over all 3000 simulations are plotted. This is a good demonstration of our claim in Lemma 1: For all safe initial conditions, i.e., x(0) ≤ d, the density is guaranteed to satisfy safety constraints for all time, i.e., x(t) ≤ d, t ∈ N+ . The snapshots of the overall distribution taken at the beginning and at the end of a simulation from the first set are shown in Figure 4. VI. C ONCLUSION In this paper, we develop a probabilistic density control policy for autonomous mobile agents with two modes: ON and OFF. When the agent is in the ON mode, it can observe the one-step outcome of a single action chosen from actions

8

bin #1

bin #2

0.25

Density of the bin

1

bin #3

1

0.8

0.2

0.8

0.6

0.15

0.6

0.4

0.1

0.4

0.2

0.05

0.2

0

0

0

bin #4

0.2 0.15 0.1

0

100 Time Step

200

0

100

bin #5

0.15

200

0.05

0

bin #6

100

200

0

0

bin #7

100

200

bin #8

0.6

0.3

0.6

0.4

0.2

0.4

0.2

0.1

0.2

0.1

0.05

0

0

100

x, w/ constraints

200 x + 3<

0

0

100 x - 3<

200 Max dens

0

0

100

x, w/o constraints

200

0

0

Upper bound

100

200

Desired density

Fig. 5. Time history of the density of each bin. With the density upper bound constraints the density is guaranteed to stay in the tube between red dashed lines with 99.7% confidence.

for the ON mode and can decide whether to take this action or not. If it does not take the action, it switches to the OFF mode. The density distribution of agents in the system evolves according to a Markov chain that, as shown in this paper, is a linear function of the stochastic environment and the decision policy. We formulate a convex optimization problem, that can be solved reliably via interior-point methods, to synthesize the decision policy which ensures desired safety, transition and convergence properties for the underlying Markov chain. The given constraints on the density are equivalently expressed as constraints on the Markov chain. The resulting density control model is illustrated with a numerical example on autonomous mobile agents. As a future direction, it would be interesting to incorporate the observed transition model used in this paper with a Markov decision process to benefit from the complex and precise optimization goals achieved by MDPs. Acknowledgements: This research was supported partially by Defense Advanced Research Projects Agency (DARPA) under Grant No. D14AP00084, National Science Foundation (NSF) under Grant No. CNS-1624328 and Office of Naval Research (ONR) under Grant No. N00014-16-1-2318 and N00014-15-IP-00052. R EFERENCES [1] B. Ac¸ıkmes¸e and D. S. Bayard, “A markov chain approach to probabilistic swarm guidance,” American Control Conference, Montreal, Canada, pp. 6300–6307, 2012. [2] B. Ac¸ıkmes¸e and D. S. Bayard, “Markov chain approach to probabilistic guidance for swarms of autonomous agents,” Asian Journal of Control, vol. 17, no. 4, pp. 1105–1124, 2015. [3] N. Demir, U. Eren, and B. Ac¸ıkmes¸e, “Decentralized probabilistic density control of autonomous swarms with safety constraints,” Autonomous Robots, vol. 39, no. 4, pp. 537–554, 2015. [4] N. Demir and B. Ac¸ıkmes¸e, “Probabilistic density control for swarm of decentralized on-off agents with safety constraints,” American Control Conference (ACC), 2015, pp. 5238–5244, 2015. [5] M. L. Puterman, Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014. [6] H. Kawano, “Three dimensional obstacle avoidance of autonomous blimp flying in unknown disturbance,” in 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2006, pp. 123–130. [7] W. H. Al-Sabban, L. F. Gonzalez, and R. N. Smith, “Wind-energy based path planning for unmanned aerial vehicles using markov decision processes,” in 2013 IEEE International Conference on Robotics and Automation, 2013, pp. 784–789.

[8] R. N. S. W. H. Al-Sabban, L. F. Gonzalez, “Extending persistent monitoring by combining ocean models and markov decision processes,” in 2012 Oceans, 2012, pp. 1–10. [9] M. T. Wolf, L. Blackmore, Y. Kuwata, N. Fathpour, A. Elfes, and C. Newman, “Probabilistic motion planning of balloons in strong, uncertain wind fields,” in Robotics and Automation (ICRA), 2010 IEEE International Conference on. IEEE, 2010, pp. 1123–1129. [10] Y. Kuwata, L. Blackmore, M. Wolf, N. Fathpour, C. Newman, and A. Elfes, “Decomposition algorithm for global reachability analysis on a time-varying graph with an application to planetary exploration,” in 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2009, pp. 3955–3960. [11] O. Sigaud and O. Buffet, Markov Decision Processes in Artificial Intelligence. Wiley-IEEE Press, 2010. [12] R. Zhang, Y. Yu, M. El Chamie, B. Ac¸ıkmes¸e, and D. Ballard, “Decisionmaking policies for heterogeneous autonomous multi-agent systems with safety constraints,” in 25th International Joint Conference on Artificial Intelligence (IJCAI-16), July 2016. [13] M. El Chamie, Y. Yu, and B. Ac¸ıkmes¸e, “Convex synthesis of randomized policies for controlled Markov processes with hard safety constraints,” in 2016 American Control Conference (ACC), July 2016. [14] M. El Chamie and B. Ac¸ıkmes¸e, “Convex synthesis of optimal policies for markov decision processes with sequentially-observed transitions,” in 2016 American Control Conference (ACC), July 2016. ¨ [15] C. Amato, G. Chowdhary, A. Geramifard, N. K. Ure, and M. J. Kochenderfer, “Decentralized control of partially observable markov decision processes,” in 52nd IEEE Conference on Decision and Control. IEEE, 2013, pp. 2398–2405. [16] D. S. Bernstein, R. Givan, N. Immerman, and S. Zilberstein, “The complexity of decentralized control of Markov decision processes,” Mathematics of Operations Research, vol. 27, no. 4, pp. 819–840, 2002. [17] S. Berman, A. Halasz, M. A. Hsieh, and V. Kumar, “Optimized stochastic policies for task allocation in swarms of robots,” IEEE Trans. on Robotics, vol. 25, no. 4, pp. 927–937, 2009. [18] I. Chattopadhyay and A. Ray, “Supervised self-organization of homogeneous swarms using ergodic projections of markov chains,” IEEE Trans. on Systems, Man, and Cybernetics, vol. 39, no. 6, pp. 1505–1515, 2009. [19] M. Lawford and W. Wonham, “Supervisory control of probabilistic discrete event systems,” Proc. 36th Midwest Symp. Circuits Syst., pp. 327–331, 1993. [20] A. R. Mesquita, J. P. Hespanha, and K. Astrom, “Optimotaxis: A stochastic multi-agent on site optimization procedure,” Hybrid Systems: Computation and Control, Lecture Notes in Computional Science, no. 4981, pp. 358–371, 2008. [21] A. Jadbabaie, G. J. Lin, and A. S. Morse, “Coordination of groups of mobile autonomous agents using nearest neighbor rules,” IEEE Trans. on Automatic Control, vol. 48, no. 6, pp. 988–1001, 2003. [22] J. Cort´es, S. Mart´ınez, T. Karatas, and F. Bullo, “Coverage control for mobile sensing networks,” IEEE Transactions on Robotics and Automation, vol. 20, no. 2, pp. 243–255, 2004. [23] V. Krishnan and S. Mart´ınez, “Distributed Control for Spatial SelfOrganization of Multi-Agent Swarms,” ArXiv e-prints, May 2017. [24] A. Elfes, K. Reh, P. Beauchamp, N. Fathpour, L. Blackmore, C. Newman, Y. Kuwata, M. Wolf, and C. Assad, “Implications of wind-assisted

9

aerial navigation for titan mission planning and science exploration,” in Aerospace Conference, 2010 IEEE. IEEE, 2010, pp. 1–7. E. Altman, Constrained Markov Decision Processes, ser. Stochastic Modeling Series. Taylor & Francis, 1999. B. Ac¸ıkmes¸e, N. Demir, B. Ac¸ıkmes¸e, and M. Harris, “Convex necessary and sufficient conditions for density safety constraints in markov chain synthesis,” IEEE Trans. on Automatic Control, vol. 60, no. 10, pp. 2813– 2818, 2015. S. Boyd, P. Diaconis, P. Parillo, and L. Xiao, “Fastest mixing Markov Chain on graphs with symmetries,” SIAM Jrnl. of Optimization, vol. 20, no. 2, pp. 792–819, 2009. S. Boyd, L. El Ghaoui, E. Feron, and V. Balakrishnan, Linear Matrix Inequalities in System and Control Theory. SIAM, 1994. J. L¨ofberg, “YALMIP: A toolbox for modeling and optimization in MATLAB,” in Proceedings of the CACSD Conference, 2004. R. H. Tutuncu, K. C. Toh, and M. J. Todd, “Solving semidefinitequadratic-linear programs using sdpt3,” Mathematical Programming, vol. 95, no. 2, pp. 189–217, 2003. K. L. Chung, A Course in Probability Theory. Academic Press, 2001.

[25] [26]

[27]

[28] [29] [30]

[31]

Applying Bayes’ rule, we obtain: m X n X = P{s(t+1) = si ,a(t) = ak |y(t+1) = sl , v(t) = ak , s(t) = sj} k=1 l=1

× P{y(t+1) = sl |v(t) = ak ,s(t) = sj} αk [j] {z } | Gk [l,j]

=

m X n X

P{s(t+1) = si |y(t+1) = sl ,a(t) = ak , v(t) = ak ,s(t) = sj}

k=1 l=1

× P{a(t) = ak |y(t+1) = sl , v(t) = ak , s(t) = sj } ×Gk [l, j] αk [j]. | {z } Qk [l,j]

Then, T1 [i, j]

=

A PPENDIX

=

Proof of Theorem 1. The probability of making transition from j th state to ith state can be written as the following sum, since being ON and being OFF are mutually exclusive events:

eTi T1 =

m X

" n m X X k=1 m X

# Rk,i [l, j]Qk [l, j]G[l, j] αk [j]

l=1

T 1 (Rk,i Qk Gk ) ej αk [j]

k=1

1T (Rk,i Qk Gk ) αk

k=1

=

m X

T

1 Rk,i (Qk

1αkT )

Gk =

k=1

M [i, j] =P{s(t+1) = si |s(t) = sj } = P{σ(t) = σon , s(t+1) = si |s(t) = sj } | {z } :=T1 [i,j]

(29)

:=T2 [i,j]

In Algorithm 1, since executing the ON mode implies that an action from Aon must be taken, the first term can be written as the summation over all actions for the ON mode: T1 [i, j] = =

P{s(t+1) = si , a(t) = ak ,v(t) = ak |s(t) = sj }

k=1

The second equation above follows from the fact that v(t) should always precedes a(t), i.e., P(v(t) = ak |a(t) = ak ) = 1. By applying Bayes’ rule [31] to the term inside the sum, we obtain: T1 [i, j] =

P{s(t+1) = si , a(t) = ak |v(t) = ak ,

k=1

s(t) = sj } P{v(t) = ak |s(t) = sj } . | {z } αk [j]

m n X X

ei 1T (Rk,i Pk Gk )

(30)

i=1 k=1

Now, consider the second term in (29): T2 [i, j] = P{σ(t) = σoff , s(t + 1) = si |s(t) = sj } = P{s(t + 1) = i|s(t) = sj , σ(t) = σoff } | {z } Goff [i,j]

× P{σ(t) = σoff |(s(t) = sj } . {z } |

P{s(t+1) = si , a(t) = ak |s(t) = sj }

m X

k=1

T1 =

1−P{σ(t)=σon |s(t)=sj }

k=1 m X

1T (Rk,i Pk Gk) .

Then T1 can be expressed in matrix form as follows:

+ P{σ(t) = σoff , s(t+1) = si |s(t) = sj } . {z } |

m X

m X

Here, given the current state, the probability of being ON is sum of the probabilities of taking each action for the ON mode, i.e., P{σ(t) = σon |s(t) = sj } n n X X = P{σ(t) = σon , s(t + 1) = sl |s(t) = sj } = T1 [l, j] l=1

l=1

Hence, T2 can be written in matrix form as follows: T2 = Goff 1 1T −1T T1 !! n X m X T T T = Goff 1 1 −1 ei 1 (Rk,i Pk Gk ) i=1 k=1

(31) Since observing transitions to distinct states are mutually exclusive events, n m P P T1 [i, j] = P{s(t+1) = si , a(t) = ak , k=1 l=1 y(t+1) = sl |v(t) = ak , s(t) = sj } αk [j]

Now, combining the expressions we get for T1 and T2 as M = T1 + T2 yields (10). Finally, we will show that M ∈ Pn×n . For nonnegativity, we will consider M in two terms. As both terms corresponds to probabilistic quantities, they both must be nonnegative. The nonnegativity of the first term is clear since Gk ≥ 0, Pk ≥ 0

10

+ and Rk,i ≥ 0, k ∈ N+ m , i ∈ Nn . For the nonnegativity of the second term, we need to show n X m X

1T

ei 1T (Rk,i Pk Gk ) ≤ 1T ,

(32)

i=1 k=1

which is equivalent to: n X m X

1T

(Rk,i Pk Gk ) ≤ 1T .

(33)

i=1 k=1

Pn Pm Here, 1T i=1 k=1 (Rk,i Pk Gk ) can be written in index notation as follows: For j ∈ N+ m, n X m X

1T(Rk,i Pk Gk)ej =

i=1 k=1

n X n X m X

Rk,i [l, j]Pk [l, j]Gk [l, j]

l=1 i=1 k=1

We also have

Pm

k=1

Pk [l, j] ≤ 1 for any l and j since

m X

Pk ≤

k=1

m X

1αkT ≤ 11T ,

k=1

Pn

= 1 for any j and k since Gk is column l=1 Gk [l, j] P n stochastic and i=1 Rk,i [l, j] = 1. Hence, n X n X m X

Rk,i [l, j]Pk [l, j]Gk [l, j] l=1 i=1 k=1 n X m X n X =

=

Rk,i [l, j]Pk [l, j]Gk [l, j] =

l=1 k=1 i=1 m X

Pk [l, j]Gk [l, j]

l=1 k=1

convPk [l, j] ≤

k=1

n X m X

l∈N+ n

m X k=1

max Pk [l, j] ≤ 1.

l∈N+ n

The last inequality follows from (11), which then implies that m n X X

1T (Rk,i Pk Gk ) ej ≤ 1, j ∈ N+ n,

i=1 k=1

and hence (32) holds. that M ≥ 0. PnThis Pconcludes m Next let H := (R Pk Gk ) and φ := k,i i=1 k=1 H T 1: 1T M

=

1T H + 1T Goff (1(1T − 1T H))

=

1T H + 1T Goff − Goff (11T H) 1T H + 1T − 1T Goff (11T H) .

= 1T M

= φT +1T −1T Goff (1φT ) φT +1T −(1T Goff ) φT = φT +1T −1T φT = φT +1T −φT = 1T ,

hence M ∈ Pn×n .

Safe Markov Chains for ON/OFF Density Control with ...

tonomous systems operating in complex dynamic environ- ments like ... Note that, the sets of actions for the ON and OFF modes are the same in this example. II. RELATED RESEARCH. The proposed Markov model is applicable to both decision- making for .... time evolution of the density x(t) as the following Markov chain ...

Download PDF

2MB Sizes 1 Downloads 236 Views

Report

Safe Markov Chains for ON/OFF Density Control with ...

Recommend Documents