Performance Guarantee of an Approximate Dynamic ...

Viewer
Transcript

1

Performance Guarantee of an Approximate Dynamic Programming Policy for Robotic Surveillance Myoungkuk Park, Krishnamoorthy Kalyanam, Member, IEEE, Swaroop Darbha, Senior Member, IEEE, Pramod P. Khargonekar, Fellow, IEEE, Meir Pachter, Fellow, IEEE, and Phillip R. Chandler.

Abstract—This paper is focused on the development and analysis of sub-optimal decision algorithms for a collection of robots that assist a remotely located operator in perimeter surveillance. The operator is tasked with the classification of incursions across the perimeter. Whenever there is an incursion into the perimeter, an Unattended Ground Sensor (UGS) in the vicinity, signals an alert. A robot services the alert by visiting the alert location, collecting information e.g., photo and video imagery, and transmitting it to the operator. The accuracy of operator’s classification depends on the volume and freshness of information gathered and provided by the robots at locations where incursions occur. There are two competing objectives for a robot: it needs to spend adequate time at an alert location to collect evidence for aiding the operator in accurate classification but it also needs to service other alerts as soon as possible, so that the evidence collected is relevant. The decision problem is to determine the optimal amount of time a robot must spend servicing an alert. The incursions are stochastic and their statistics are assumed to be known. This problem can be posed as a Markov Decision Problem. However, even for two robots and five UGS locations, the number of states is of the order of billions rendering exact dynamic programming methods intractable. Approximate Dynamic Programming (ADP) via Linear Programming (LP) provides a way to approximate the value function and derive sub-optimal strategies. The novel feature of this paper is the derivation of a tractable lower bound via LP and the construction of a sub-optimal policy whose performance improves upon the lower bound. An illustrative perimeter surveillance example corroborates the results derived in this paper.

Note to Practitioners – In practice, one often encounters the curse of dimensionality in the application of dynamic programming to determine optimal policies for controlled Markov chains. This is true, in particular, for dynamic scheduling problems involving multiple robots/ servers and queues of tasks that arrive in a stochastic fashion. The computation of value function, critical to the determination of optimal policies, is nearly impractical. This paper was presented in part at ASME’s 5th Annual Dynamic Systems and Control Conference, Ft. Lauderdale, FL, Oct 2012; this paper also expands on a preliminary version presented at the IFAC Conference on Research, Education and Development of Unmanned Aerial Vehicles, Compiegne, France, November 2013. M. Park and S. Darbha are with the Department of Mechanical Engineering, Texas A&M University, College Station, TX, 77840, USA, e-mail: [email protected], [email protected] K. Krishnamoorthy is with the InfoSciTex Corporation, Dayton, OH 45431, USA, e-mail: [email protected] P. P. Khargonekar is with the Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL 32611, USA, e-mail: [email protected] M. Pachter is with the Department of Electrical Engineering, Air Force Institute of Technology, Dayton, OH 45433, USA, e-mail: [email protected] P. R. Chandler (Retd.) was with the Autonomous Control Branch, Air Force Research Laboratory, Dayton, OH 45433, USA

Hence, one must settle for sub-optimal policies. Two natural questions arise: (1) How does one construct a sub-optimal policy? (2) How “good” is the constructed sub-optimal policy? A common strategy to tackle the first problem is to approximate the value function and construct a sub-optimal policy that is greedy with respect to the approximate value function. Typically, an approximate value function is constructed via a choice of basis functions. The question of how to choose the basis functions systematically for any problem is a difficult one; usually, the structure of the problem at hand is exploited in the construction of basis functions. The same approach is taken here and the state space is partitioned based on the reward structure and the optimal cost-to-go or value function is approximated by a constant over each partition. The second question is related to the first question in the sense that one needs to construct bounds for the performance of a sub-optimal policy. In this article, we construct upper and lower bounds for the value function (optimal performance) and use the lower bound as an approximate value function. Furthermore, we also show that the resulting sub-optimal policy comes with a performance guarantee, in that it improves on the lower bound, it was derived from. Literature is replete with techniques for computing upper bounds; however, there is little work on lower bounds, which are also required for bounding the sub-optimality of the policy. One encounters prohibitively large number of constraints in the case of computing an upper bound and has to deal with disjunctive linear inequalities in the case of a lower bound. The problem structure is exploited here to circumvent these difficulties. The upper and lower bounds to the value function developed in this paper could also be used to refine the partitions by identifying the partition with the largest difference between the upper and lower bounds; such a partition could be refined further using the structure of the problem. For practitioners, this could be a useful set of tools for generating sub-optimal policies for any controlled Markov chain with a reward function that is amenable to state aggregation. Index Terms—Stochastic Control, Approximate Dynamic Programming, Linear Programming, Robotic Surveillance.

I. I NTRODUCTION This paper is motivated by a robotic perimeter surveillance problem. As shown in Fig. 1, one or more robots (identified as black pentagon in the figure) and Unattended Ground Sensors or simply UGSs, identified by red and green circles in Fig. 1 placed along the perimeter, assist a remotely located human operator in the task of classification of incursions across the perimeter, as either a nuisance or a threat. Incursions are

2

stochastic and have both a spatial and temporal component; we assume that the statistics of the incursion processes are known.

Fig. 1: A schematic of the perimeter patrol problem - 15 UGSs and 1 robot In order to aid the robot-operator team in the timely classification of incursions, UGSs are installed at locations along the perimeter, where incursions occur; these locations will be referred to simply as stations. At the stations, the UGS flags an incursion, signals an alert and communicates the alert immediately to the robots. The red color associated with a station in the figure indicates that an alert has been signaled at that location and has not yet been serviced. Subsequently, a robot services the alert by visiting the UGS station where it was raised, and transmitting images, video, or other sensory information to the operator using on-board camera and other sensing devices. The operator performs the role of a classifier based on the information supplied by the robots. The classification accuracy depends both on the volume and freshness of the information supplied. For accurate classification, the robot should provide as much video or other evidence about the incursion to the operator as possible. Subject to certain limits, we assume that the volume of information increases with the amount of time spent by the robot at an alert location. For timely and accurate classification of incursions, the delay time, defined as the time delay between an alert signal and the time a robot attends to the alert, should be minimized. Thus, there are two competing needs: a robot needs to spend more time at an alert location and it also needs to service other alerts as quickly as possible. A natural question that arises is the following: How long should a robot spend time servicing an alert before moving on to service other alerts? In this paper, we discretize the problem spatially and temporally and recast the optimization problem as follows: should the robot spend the next time interval at the alert location in terms of maximizing the expected, discounted payoff? The payoff considered herein is an increasing function

of the time spent at the alert site (dwell time) and a decreasing function of the delay in servicing alerts. This problem is naturally posed as a Markov Decision Problem (MDP). However, the number of states runs into billions even for a modest size problem. For example, if one considers two robots and eight alert locations, with a maximum allowable delay time of 30 units, the number of states exceeds 30 billion! Hence, solving Bellman’s equation to compute the optimal payoff (value function) is computationally intractable. For this reason, we consider a Linear Programming (LP) based approximate dynamic programming solution strategy [1]. This approach provides an upper bound on the optimal value function, and an estimate of the quality of the resulting sub-optimal policy, e. g., see [2], [3]. The main contributions of this paper are: (i) In Theorem 1, we present both upper and lower bounds to the value function as component-wise minimum vector of all feasible solutions to generalizations of Bellman’s inequalities. These inequalities require the specification of disjoint sets of the state space and the bounds take a constant value for all the states in the specified disjoint sets. The bounds from Theorem 1 help evaluate the quality of any sub-optimal policy. (ii) In Theorem 2, we exploit the structure of the perimeter surveillance problem and simplify the computation of the upper and lower bounds to the determination of optimal solution of a linear program in fewer variables. (iii) In Theorem 3, we present a sub-optimal policy that uses the lower bound as an approximate value function and show that its performance is at least as good as the lower bound. We also show that given a class of lower bounds that take a constant value for all the states in specified disjoint sets, the lower bound computed here dominates every other lower bound in the class; in this sense, it is a non-trivial lower bound. We use the perimeter surveillance application as a vehicle to test the effectiveness of the presented methods and present corroborating numerical results. A. Relationship to existing Literature Perimeter surveillance problems arise in a variety of practical applications and have recently received significant attention in the literature; for example, see [4], [5], [6], [7]. The results described in this article and our prior work in [8], [9], [10], [11], [12] differ from the literature in addressing the need to balance the information gained by the robots with the “quality of service” requirement of attending to alerts signaled at the UGS locations in a timely manner. This paper builds on a preliminary conference paper [13] in three ways: we provide a performance guarantee for the sub-optimal policy. Also, we consider the case of two robots in surveillance instead of a single robot. The novel feature of this work is that it provides guarantees on bounds for optimal performance and performance for a sub-optimal policy while exploiting the structure of the problem in terms of partial ordering inequalities satisfied by the value function. In particular, Theorem 2 circumvents the need for the use of

3

results developed in [14] for factored MDPs. The fundamental difference in the mechanism of reduction of Bellman inequalities between the approach considered here and in [14] lies in exploiting the structure of evolution equations of the states. In contrast to [14], the mechanism is independent of the transition probabilities. In terms of our previous work, only [9] deals with two robots; however, it focuses on the computation of the optimal policy, while the focus of the present article is to develop sub-optimal policies with bounds. The computational complexity involved in determining the optimal policy scales poorly with the size of the perimeter patrol problem and hence, one must consider sub-optimal policies even for problems of modest size. The use of LP techniques for solving Dynamic Programming (DP) problems was introduced in [15], [16]. The use of state aggregation i.e., partitioning of the state space and the construction of sub-optimal policies using approximate value functions is discussed in [17]. A state aggregation based Approximate Dynamic Programming (ADP) method has been used for operating theater planning [18] and for Partially Observable MDPs (POMDPs) [19]. The LP based approach to approximate dynamic programming is discussed in [1], [2], [3]. A feature based approximation of the value function and the resulting LP formulation for a line scheduling problem is detailed in [20]. The results in this paper differ from the existing literature in two ways: (1) the restricted or constrained LPs that we obtain are computationally tractable and hence, there is no need for column generation or random sampling techniques [2], and (2) this work presents a way to construct upper as well as lower bounds using LPs, a marked departure from other work in this area, other than the authors’ prior work [21]. Although the MDP considered herein is a factored MDP [14], the structure of the problem allows us to identify the binding constraints to a smaller set readily and therefore avoid the cumbersome procedures of [14] in computing the upper bound. Also, the results in this paper differ from the earlier work of the authors [12]; in this work, we provide a performance guarantee on the sub-optimal policy constructed via the lower bound. The rest of the paper is organized as follows: in Section II, we present the mathematical formulation for the class of surveillance problems considered herein. In Section III, we present the main results after laying out the necessary mathematical machinery. For all lemmas and theorems, we leave out the proofs from the main text for clarity and instead enumerate them in the Appendix. In Section IV, we consider an example of the perimeter surveillance problem and corroborate the proposed LP methods via numerical and graphical results. II. P ROBLEM F ORMULATION We discretize the perimeter patrol problem spatially and temporally; nodes on the perimeter partition it uniformly. The distance between adjacent nodes on the perimeter is of unit length and the time taken by a robot to traverse between two adjacent nodes is a unit of time. Let the real vectors xr (t), u(t) denote the states of nr robots in the collection and

their control actions respectively at time t. Let a real vector xs (t) denote the states associated with the ns UGS locations. Further, let d(t) ∈ {0, 1}ns denote the vector of disturbances (incursions) occurring at the ns UGS locations respectively. We intentionally leave out a precise definition of the states xr and xs , to allow for greater generality and to accommodate application needs, as they arise later in the article. As an example, one may include the location of a UAV, its direction of travel around the perimeter in the definition of xr , and the amount of time they spend (or dwell) servicing an alert at the UGS location, while xs may contain the delays associated with the alerts/alarms at the UGS stations. The control actions of the robots at time t are captured by the vector u(t); a sample control action indicates whether a robot should dwell at its current location or continue in the same direction or reverse its direction of travel. The disturbance d(t) can take any of L possible values, namely d1 , d2 , . . . , dL with corresponding probabilities p1 , . . . , pL ; these probabilities are assumed to be known a priori. The number of possible values the disturbance d(t) can take depends on the model of incursion processes; for example, if at most one incursion is allowed at any time across the ns stations, then L = ns + 1; if, on the other hand, incursions can occur simultaneously at one or more stations, then L = 2ns . Let the evolution of states xr and xs be governed by the state transition equations: xr (t + 1) = xs (t + 1) =

fr (xr (t), u(t)), fs (xr (t), xs (t), u(t), d(t)),

(1) (2)

where fr and fs are suitably defined vector fields. For the sake of notational convenience, let the state of the system x(t) := (xr (t), xs (t)). The evolution equations (1) and (2) can be combined as: x(t + 1) = f (x(t), u(t), d(t)),

(3)

for the augmented vector field f . Additionally, there may be constraints on the state and control input, of the form: g(x(t), u(t)) ≤ 0,

∀t ≥ 0,

(4)

which model the allowable control actions of the robots. For example, a reasonable constraint might be the following: the state of an UGS can only be altered by the action of a robot that has spent a pre-specified amount of time in its neighborhood. Let Sr , Ss represent the set of all possible discrete states of robots and stations respectively. Let S = Sr × Ss be the Cartesian product of the sets Sr and Ss and denote the set of all possible states of the system. Let r(x, u) denote the one-step payoff/ reward associated with the state x and the control input u. Let U(x) denote the set of control actions associated with the state x. Let U be the set of all possible control actions. We focus our attention on stationary policies, π ∈ Π, where π ∈ Π maps S into U, i.e., u = uπ (x) ∈ U(x). Consider the stochastic optimization problem: for a specified discount factor, λ ∈ [0, 1), find a stationary policy, π, such

4

that the following objective is maximized: aggregated into a partition for the purposes of computing "∞ # an approximate value function and bounds. Our idea of X V ∗ (x0 ) := max E λt r(x(t), uπ (x(t)))|x(0) = x0 , (5) approximation involves partitioning the state space and so, π∈Π we define a general partitioning scheme as follows: t=0

where Π is the set of all possible stationary policies. We make the following standard assumptions about finite state and control spaces: • Assumption 1: The set of allowed control actions for each robot are identical and finite, and is represented by Ur . The vector u may be expressed as u = (u1 , u2 , . . . , unr ) ∈ Urnr =: U. • Assumption 2: Since the problem has been discretized, the perimeter is of finite length and since the disturbances and control decisions are finite, the sets Sr and Ss are also finite. Hence, the state space S of the system is finite. Let V∗ denote the vector of values, V ∗ (x0 ) where x0 ∈ S. It is well-known that V∗ satisfies Bellman’s equation [22]: ( ) X ∗ ∗ p(s, u, z)V (z), ∀s ∈ S, V (s) = max r(s, u) + λ u

z∈S

(6) where p(s, u, z) is the probability of transitioning from state s to z under the influence of control action u. However, the computational tractability depends on |S|, |U| and L. For a modest size problem involving 2 robots and 8 stations, the value of |S| can be upwards of 180 billion! For this reason, conventional techniques to solving Bellman’s equation, such as value and policy iteration, are unsuitable. In the next section, we introduce a generalization to Bellman inequalities, that enables the computation of upper and lower bounds to V∗ via LPs.

General Partitioning Scheme: Let R ≥ 1. We will refer to the set GP = {P1 , P2 , . . . , PR } as a general partitioning scheme of order R if (i) P1 , . . . , PR are disjoint subsets of S whose union is S, (ii) For all i and for any two x, y ∈ Pi ⇒ U(x) = U(y), i.e., any two states in the same partition have the same set of allowable controllable actions. We will call the sets P1 , . . . , PR as general partitions, or simply partitions. We consider a class of MDPs, which includes the perimeter surveillance problem, that have the following structure: Problem Structure (PS): There exists a partial ordering of the states in S such that a state x succeeds z or x z implies (i) r(x, u) ≤ r(z, u) for every u, i.e., for the same control action u, the one-step reward at state x is no more than the one-step reward at state z, and (ii) f (x, u, d) f (z, u, d) for every u, d, i.e., for the same control action u and disturbance d, the states to which x, z will transition to, also retain the same ordering. For example, see the partial ordering scheme for the perimeter patrol example problem (29). Note that the partial ordering should not be confused with dominance; for example, if a vector V1 dominates V2 , i.e., V1 ≥ V2 , then every component of V1 is no less than the corresponding component of V2 .

III. M AIN R ESULTS Bellman’s equation is difficult to solve when the number of states is large. For this reason, one approximates the value function, V∗ , for two purposes: (1) to provide bounds for the value function and (2) to construct a sub-optimal policy using the approximate value function. The bounds on the value function can sometimes be used to establish a suboptimality bound on the performance of a policy. The main results of this paper deal exactly with these issues. Indeed, theorem 1 provides methods to construct upper and lower bounds for the value function using an LP and a disjunctive LP respectively. Theorem 2 simplifies the formulation of the LP and disjunctive LP in order to compute the bounds in a computationally tractable manner, by exploiting structure in the problem. Theorem 3 provides a performance guarantee for a sub-optimal policy constructed by using the lower bound as an approximate value function. We first set up the mathematical tools necessary to state our results. The main results are presented in the form of theorems, with the proofs provided in the Appendix. Typically, the one-step reward function associated with robotic surveillance problems are only dependent on a few state variables, implying that a lot of states of the corresponding MDP share the same reward, e.g., see example problem reward function (31). All such states can be conveniently

A. Notation We will use bold and upper case letters, such as V, for vectors lying in ℜ|S| . We will use x, z, s along with their ˜ to denote states. We subscripts and other decorations such as x will use V (x) to denote the component of V corresponding to a state x. We will use d and u along with subscripts to denote disturbance and action respectively. We reserve the use of c for the initial probability distribution of the states or its positive scalar multiple. Except for x, z, s, d, u, we use bold lower case letters along with their subscripts, for vectors lying in ℜR , e.g., we will denote by w(i) the ith component of a R−dimensional vector w. We will reserve r(u), rπ for representing the vectors of one-step reward associated with an action u and a policy π respectively; as before r(x, u) and rπ (x, π(x)) are their respective components corresponding to a state x. Similarly, P(u) and Pπ represent the transition probability matrices corresponding to the action u and policy π respectively.We will denote f (x, u, d) as the index of the partition to which a state x will transition to under the influence of control action u and disturbance d. In other words, if k = f (x, u, d), then f (x, u, d) ∈ Pk . We will use 1 to denote a vector whose components are all 1. The consequences of the problem structure (PS) are outlined in the following result:

5

Lemma 1. Let x, z correspond to two different initial states of the system described by (3) that satisfy PS. Let the corresponding trajectories subject to the same sequence of input u(t) and disturbance d(t) be respectively x(t) and z(t). If x z, then (a) x(t) z(t) for all t ≥ 0, (b) V ∗ (x) ≤ V ∗ (z) and (c) V ∗ (f (x(t), u(t), d(t)) ≤ V ∗ (f (z(t), u(t), d(t)), t ≥ 0. The solution to Bellman’s equation can be expressed as the following LP [15], which we refer to as the Exact LP (ELP). Let c ≥ 0 represent the discrete probability distribution of the initial states. V∗

=

V (s) ≥

argmin c · V, X r(s, u) + λ p(s, u, z)V (z), ∀s ∈ S, u, (7) z∈S

V (x) ≥

V (z),

∀z x, x, z ∈ S.

(8)

The constraints (7) are referred to, in general, as the Bellman inequalities. Note that (8) implies that whenever the states z, x are comparable as in z x, then V (x) ≥ V (z). This constraint can be added to the ELP, without loss of generality, since V∗ satisfies it. Adding this constraint is a key requirement (as will be shown later) towards establishing the main results in this paper. From [1], any feasible V to the Bellman inequalities upper bounds V∗ . However, computation of non-trivial lower bounds to V∗ has received scant attention in the literature. While the performance of any sub-optimal policy is a lower bound on the value function, the computation of the sub-optimal performance requires the solution of linear equations in |S| variables and is difficult. Here, by performance, we mean the expected infinite horizon discounted reward that is gained by implementing the sub-optimal policy. One must exploit the structure of the problem in order to make headway into the computation of even approximate solutions and bounds. One can generalize the Bellman inequalities, for the purpose of computing upper and lower bounds to V∗ . Towards that end, we define two Generalized Bellman inequalities (GBIs) given by: GBI-1: ∀i, u X V (s) ≥ r(s, u) + λ p(s, u, z)V (z), (9) z∈S

V (s) ≥ V (s) =

V (z), ∀z s, s, z ∈ S, V (z), ∀s, z ∈ Pi .

(10) (11)

GBI-2: ∀i, u min V (x)

x∈Pi

V (x)

≥

≥

"

min r(x, u) + λ

x∈Pi

X

GBI-1 and GBI-2 respectively by FUB and FLB . To compactly express the main results in this paper, we introduce the following additional definitions: Projection: Given a general partition GP of order R, we ˆ is a define a projection map ψ : ℜ|S| → ℜR so that w ˆ = ψ(V) if projection of V with respect to GP, i.e., w w(i) ˆ = minx∈Pi V (x) for all x ∈ Pi and for all i. Lifting: Given a general partition GP of order R, we define a lifting operator φ : ℜR → ℜ|S| and so, V = φ(w) if for all x ∈ Pi and for all i, V (x) = w(i). Floor: Given a general partition GP of order R, we define ⌊V⌋ to be a floor of V with respect to GP if and only if for all x ∈ Pi and for all i, ⌊V ⌋(x) = miny∈Pi V (y), or more compactly, ⌊V⌋ = φ(ψ(V)). It is clear that for any V, it follows that V ≥ ⌊V⌋. Using these definitions, one may simplify the representation of GBI-1 and GBI-2 as follows: GBI-1: V ≥ r(u) + λP(u)V, ∀u V (s) ≥ V (z), ∀z s, s, z ∈ S, V (s) = V (z), GBI-2: ⌊V⌋ ≥ ⌊r + λP(u)V⌋, ∀u, V (s) ≥ V (z), ∀z s, s, z ∈ S. The first main result of this paper deals with upper and lower bounding the value function, V∗ through the optimal solutions of regular and disjunctive LPs respectively. Theorem 1. Let GP = {P1 , . . . , PR } be a general partitioning scheme of order R. Then the following hold: (a) FUB , FLB are non-empty and lower bounded, i.e., for some real α > −∞, V ∈ FUB ∪ FLB ⇒ V ≥ α1. (b) V1 , V2 ∈ FUB ⇒ min{V1 , V2 } ∈ FUB . Hence, the component-wise minimum of all feasible solutions, V := min{V : V ∈ FUB } is well defined and V ∈ FUB ; by construction, V ∈ FUB ⇒ V ≥ V. (c) Similarly, V := min{V : V ∈ FLB } is well defined and V ∈ FLB ; moreover, V ∈ FLB ⇒ V ≥ V. (d) Let c > 0. Then

#

V = argmin c · V.

p(x, u, z)V (z) ,

z∈S

V (z), ∀z x, x, z ∈ S,

(12) (13)

We also observe that if the partitions Pi contain exactly one element for each i, then GBI-1 and GBI-2 reduce to the constraints of ELP; in this sense, they are a generalization to the Bellman inequalities. We will denote the feasible sets of

∀s, z ∈ Pi , ∀i.

V∈FU B

Similarly, V = argmin c · V V∈FLB

(e) V = ⌊V⌋. (f) The value function, V∗ is bounded as follows: V ≥ V∗ ≥ V.

6

Parts (a) through (c) describe the properties of the sets FUB and FLB and assert the existence of a solution in each set that is dominated by every other solution. Hence, they will be optimal for the respective mathematical programs considered in part (d). Since the solutions V ∈ FUB and V ∈ FLB are independent of c > 0, the optimal solutions for mathematical programs in (d) are independent of the choice of c [23] in this sense, state aggregation and partitioning reduces one difficulty associated with the choice of the cost function c that plagues the basis function approach for Approximate Dynamic Programming (ADP) [24]. In other words, the bounds given for value function V∗ in part (f) of Theorem 1 only depend on the partition GP and the description of the MDP. While parts (b) and (c) define the bounds for the value function V∗ , it is simpler to compute them as optimal solutions of mathematical programs in part (d). Part (e) of Theorem 1 allows us to express the lower bounding mathematical program in part (d) using the components of the projection, ψ(V). The equality constraint (10) allows one to simplify the upper bounding program in part (d) using the components of the projection ψ(V). However, the computation is not as straightforward or computationally tractable if one observes that FUB is the feasible set of linear inequalities that are O(|S| × |U|) in number and that FLB is the feasible set of disjunctive linear inequalities. We will exploit the structure of the problem to simplify these inequalities. Our focus now is on simplifying the inequalities by exploiting PS and by carefully partitioning the state space. Since V is constant across all states in a partition, V = φ(w) for some w ∈ ℜR . Since w ∈ ℜR completely determines the lower bound V, it is desirable to find an LP (disjunctive or otherwise) involving fewer variables that can compute w. Towards that end, let p(s, u, k) denote the probability of transitioning from state s to some P state in Pk under the action Pof u; hence p(x, u, k) = s∈Pk p(x, u, s). Let c(i) := s∈Pi c(s) be the ith component of the vector c. We also need to simplify the partial ordering constraints (8). Notice that if x, z ∈ Pi and are comparable as x z, the constraint V (x) ≤ V (z) is readily satisfied as V (x) = V (z). Hence, we only have to focus on partial ordering constraints where x, z belong to different partitions. To avoid this complication, we consider partitioning schemes, where the partitions automatically obey the partial ordering constraints. Towards this end, we define an ordering scheme for the partitions. Partial Order of Partitions: Given a general partitioning scheme GP = {P1 , . . . , PR }, we define Pi Pj if and only if 1) for every x ∈ Pi , there is a z ∈ Pj such that x z; moreover, there is no s ∈ Pj such that s x, and 2) for every z ∈ Pj , there is a x ∈ Pi such that x z; moreover, there is no s ∈ Pi such that z s, and 3) For every u, maxx∈Pi r(x, u) ≤ mins∈Pj r(s, u). Consistent Partitioning: A general partitioning scheme, GP = {P1 , . . . , PR } of order R is consistent if and only if x, z ∈ S and x z implies one of the following: (i) there exist distinct partitions Pi , Pj such that x ∈ Pi , z ∈ Pj and Pi Pj , or

(ii) there exists a partition Pi such that x, z ∈ Pi . We then refer to P1 , . . . , PR as consistent partitions. We note that the perimeter patrol problem (as will be seen later) allows for the existence of a consistent partitioning scheme. The following result provides a method to compute the bounds V and V using fewer variables in the disjunctive LP: Lemma 2. Let c > 0 and GP be a consistent partitioning scheme of order R. Then, (a) V = φ(w) if and only if w is the optimal solution of the following disjunctive LP, i.e., w

:=

argmin c · w,

w w(j)

≥ ≥

ψ(r(u) + λP(u)φ(w)), w(i), ∀Pi Pj .

w

∀u,

(14) (15)

(b) V = φ(w) if and only if w

:= argmin c · w, w

φ(w)

≥

r(u) + λP(u)φ(w),

w(i)

≥

w(j),

∀Pj Pi .

∀u,

(16) (17)

More often that not, the partitions allow for subsets which contain maximal and minimal elements with respect to the partial order. Such sets can be used for simplifying the number of constraints in (14), (15), (16) and (17). We define maximal and minimal subsets of a partition as follows: Maximal and Minimal Sets: • A set Pmin,i ⊂ Pi is a minimal set of Pi if given z ∈ Pmin,i , the following are satisfied: (i) there is no x ∈ Pi such that z x and (ii) there exists s ∈ Pi such that s z. • Similarly, a set Pmax,i ⊂ Pi is a maximal subset of Pi if given any z ∈ Pmax,i , the following are satisfied: (i) there is no x ∈ Pi such that x z, and (ii) there exists a s ∈ Pi such that z s. Correspondingly, we will define an extremal partition as follows: Extremal Partition: A consistent partition Pi is an extremal partition, if one can find states si , si ∈ Pi such that si x si for every x ∈ Pi . In other words, the sets, Pmax,i = {si }, and Pmin,i = {si } are both singleton sets. The next result simplifies the constraints of Lemma 2 by considering maximal and minimal sets for each partition. For some systems such as the perimeter surveillance, one can construct extremal partitions. The smaller the size of the maximal and minimal sets, the greater is the simplification of the constraints, as shown in the following result. In addition, if the partitions are all extremal partitions, the disjunctive LP for lower bounding reduces to an LP of lower dimension! Definition: The function ψmax : ℜ|S| → ℜR so that if w = ψmax (V), then w(i) = minx∈Pmax,i V (x). Similarly, ψmin : ℜ|S| → ℜR such that if w = ψmin (V), then

7

w(i) = maxx∈Pmin,i V (x). Theorem 2. Let c > 0 and GP be a consistent partitioning scheme of order R. For each i, let Pmax,i and Pmin,i be the maximal and minimal sets of Pi respectively, Then, (a) V = φ(w) if and only if w is the optimal solution of the following Disjunctive Bounding LP (DBLP), i.e., w w

:= argmin c · w, ≥ ψmax (r(u) + λP(u)φ(w)), ≥

w(j)

∀u, (18)

w(i), ∀Pi Pj .

(19)

Similarly, the upper bound V = φ(w), where w

:=

argmin c · w,

w

≥

ψmin (r(u) + λP(u)φ(w)),

w(j)

≥

w(i), ∀Pi Pj .

w

∀u, (20) (21)

(b) If Pi , ∀i is an extremal partition, the lower bound, V = φ(w), where w is the optimal solution of the following LP (LBLP): w

:=

w(i)

≥

w(j)

≥

argmin c · w, R X p(si , u, k)w(k), ∀u, r(si , u) + λ k=1

w(i),

∀Pi Pj .

(22)

Similarly, the upper bound, V = φ(w), where w is the optimal solution of the following LP (UBLP): w

:=

w(i)

≥

argmin c · w, R X p(si , u, k)w(k), ∀u, r(si , u) + λ

But the above policy could be un-desirable from a computational perspective, since it requires computing the argmax{·} over all states x ∈ S. Hence, for a given general partitioning strategy GP , a desirable subclass of policies Πs ⊂ Π is the following: Definition: A policy π ∈ x, z ∈ Pi ⇒ uπ (x) = uπ (z).

Πs if for every i and

From hereon, for the sake of notational convenience, we will write π(x) for uπ (x) or even π(i) for π(x), ∀x ∈ Pi . Essentially, policies in Πs are desirable from the viewpoint of implementation. Since the sub-optimal action corresponding to states in the same partition is the same, one can implement the policy by storing the partition-action pair and using a membership function which determines which partition a state belongs to, one can readily implement this policy. So, using the lower bound, a sub-optimal greedy policy π ∈ Πs would be given by: ∀x ∈ Pi , π(x) = argmax {ψ(r(u) + λP(u)φ(w))(i)} .

(24)

u

We are interested in a sub-optimal policy that comes with a performance guarantee. In particular, we wish to have Vπ ≥ V. Unfortunately, the policy π given by (24) may not have that guarantee. This is so because some of the partial ordering constraints (19) can be binding at the optimal solution w to DBLP. For this reason, we have to modify the policy in the following manner. Let B be the set of partitions such that Pi ∈ B implies that there is some u for which the following equation holds: w(i) = ψ(r(u) + λP(u)φ(w))(i).

k=1

w(j)

≥

w(i),

∀Pi Pj .

(23)

The mechanism of reducing the number of constraints via partitioning is not dependent on the transition probabilities having a certain structure. In this sense, this mechanism of constraint reduction is in stark contrast to the approach of [14] via factored MDPs as they must satisfy specific state transition probability conditions; moreover, their approach is not readily applicable to the disjunctive LP provided in Theorem 2. The simplification we achieve can be quite drastic, when the problem admits extremal partitions, as is the case with the perimeter patrol problem. B. A sub-optimal policy with guaranteed performance The subsection deals with with the construction of a suboptimal policy using the lower bound, V = φ(w) as an approximate value function. The key contribution here is that we show that the sub-optimal policy comes with a provable performance guarantee. The standard choice for a sub-optimal policy that employs an approximate value function is the policy that is greedy with respect to the approximation. For the lower bound, this would be given by: ∀x ∈ S, ( ) X p(x, u, z)V (x) . π(x) = argmax r(x, u) + λ u

z∈S

Clearly, if Pj ∈ / B, w(j) > ψ(r(u) + λP(u)φ(w))(j). Lemma 3. B 6= ∅; corresponding to every Pj ∈ / B, there is a Pi ∈ B such that Pi Pj and w(i) = w(j). Since for each Pj ∈ / B, there exists a partition Pi ∈ B such that Pi Pj and w(j) = w(i). We call the partition Pi ∈ B a conjugate partition of Pj ∈ / B; we will also refer to its index by i = n∗ (j). The modified (greedy) sub-optimal policy, πsub is given by: •

If Pi ∈ B, then for every x ∈ Pi , we set πsub (x)

:= π(i) (25) = argmax {ψ(r(u) + λP(u)φ(w))(i)} . u

•

If Pi ∈ / B, for every x ∈ Pi , πsub (x) := π(i) = π(n∗ (i)).

For each policy, π ∈ Πs , there is a corresponding suboptimal performance function, Vπ , that can be computed by solving the |S| simultaneous equations: Vπ

= rπ + λPπ Vπ ,

(26)

8

Note that one can show that Vπ satisfies the partial ordering constraints, i.e., Vπ (x) ≥ Vπ (z), if z x, in a fashion similar to part (b) of Lemma 1. For a policy π ∈ Πs , the corresponding lower-bounding disjunctive LP, referred to as Policy specific DBLP (or simply PDBLP(π)) is given by: wπ w w(j)

=

argmin c · w,

≥ ψmax (rπ + λPπ φ(w)), ≥ w(i), ∀Pi Pj .

(27) (28)

The modified sub-optimal policy has been chosen so that it comes with the performance guarantee that we sought. This is captured in the following result. Theorem 3. For every π ∈ Πs , V∗ ≥ Vπsub ≥ V = φ(w) = φ(wπsub ) ≥ φ(wπ ). In other words, the performance of the sub-optimal policy, πsub is guaranteed to be no less than the lower bound V; moreover, V equals the greatest lower bound that can be computed using PDBLP(π) over all policies π ∈ Πs . Suppose the following condition holds: ∀u, ∀i, the one step reward r(x, u) = ri (u) for any state x ∈ Pi . Then, a myopic or greedy policy, πg , that maximizes the immediate (one-step) reward is given by: πg (x) = argmax ri (u), ∀x ∈ Pi , ∀i. u

Here, the control action is the same across all the states in the partition. Therefore, πg ∈ Πs , and so a lower bound on its performance Vπg can be obtained by solving PDBLP(πg ). This lower bound is dominated by V, i.e., V = φ(wπsub ) ≥ φ(wπg ). So, it is reasonable to expect that Vπsub ≥ Vπg , i.e., the sub-optimal policy we recommend will outperform a myopic policy. C. Symmetry considerations and simplification Symmetry in a system can be exploited by aggregation as it induces an equivalence classes of states. If Ei is an equivalence class, it follows that Ei is a subset of a consistent partition. Symmetry also implies that V ∗ (x) = V ∗ (z) for all x, z ∈ Ei . These constraints may easily be accommodated in UBLP by setting x z for all x, z ∈ Ei , and in LBLP by requiring Ei to be contained wholly in a consistent partition. Moreover, if x, z ∈ Ei , we have f (x, u, dl ) = f (z, u, dl ) for every u and dl . Hence, the inequality constraints corresponding to states in an equivalence class of the form: w(i)

≥

among representative states of an equivalence class contained in the set Pmax,i . In the following section, we illustrate a perimeter patrol example problem and showcase numerical and graphical results that corroborate the main results of the paper.

ri (u) + λ

L X

pl w(f (x, u, dl )), ∀i, u,

l=1

can be replaced by a single inequality constraint where x is a representative state of the equivalence class. Similarly, a disjunctive constraint of the form: i h PL w(i) ≥ mins∈Pmax,i ri (u) + λ l=1 pl w(f (s, u, dl )) ,

simplifies in the following way: the minimization on the right hand side of the inequality will now only need to be carried out

IV. P ERIMETER PATROL E XAMPLE P ROBLEM Consider a perimeter to be monitored with the aid of nr = 2 identical robots. Let N = 8 nodes discretize the perimeter uniformly, with some of the nodes corresponding to the locations of UGSs. Let the set of nodes be labeled as N := {0, 1, . . . , N − 1} and the set of UGS locations be symmetrically located at Ω ⊂ N (Ω = {0, 2, 4, 6}). If an UGS detects an incursion, an alert is signaled at the location and communicated instantaneously to the robots. Let ai (t) denote the action of the ith robot at time t, so that the control input u(t) = (a1 (t), a2 (t), . . . , anr (t)). The set of allowable actions for the robot are {1, 0, −1} with ai (t) = 0 if it dwells at its current location and equals 1 or −1 if it moves in the counterclockwise (CCW) or clockwise (CW) direction respectively. The maximum number of allowable values of u is 3nr . The disturbance input (incursion) at the j th station is denoted by dj (t) ∈ {0, 1}, with dj (t) = 1 if there is an incursion at time t and 0 otherwise. The disturbance input, d(t) is an ns -tuple with its j th component being dj (t), j ∈ Ω. Let δ(·) denote the Kronecker delta function and σ(·) = 1 − δ(·). We will assume that each station has an independent alert queue so that the maximum number of allowable values of the disturbance input d(t) is 2ns , and the arrival of incursions at each queue is a Bernoulli process. Let the probability of no incursions occurring at a station in a time unit be pα = e−α , where α is the arrival rate of the alerts. Then the probability that k stations signal an alert at each time ns −k step is given bypα (1 − pα )k . We consider the following additional restriction on the motion of the robots: a robot can only dwell at a UGS location; hence, the allowable actions at a non-UGS location for the ith robot is {−1, 1}. Let li (t), Ti (t) respectively denote the current location of the ith robot and the time it has spent at its current location. Let ls (t) and lr (t) denote the relative distance from the 1st robot to the nearest station and the 2nd robot respectively in the CCW direction at time t. It is intuitive that states of the robots with the same ls and lr values are related by cyclic symmetry if the delays at stations are also correspondingly cyclically permuted; in such a case, all states that can be transformed from each other by a cyclic permutations can be aggregated into a partition. The state of the robots is given by xr (t) = (ls (t), lr (t), T1 (t), T2 (t)). Let τj (t) denote the time elapsed since an alert, that is yet to be serviced, was signaled at the j th station. The state xs (t) is the ns -tuple of time delays, with the j th component being the time delay τj (t). We define two states x, z to satisfy the partial order, x z, if xr = zr , and xs ≥ zs .

(29)

9

The governing equations may be expressed as: 0.1 V*

ls (t + 1) = (ls (t) + a1 (t)) mod N/ns , lr (t + 1) = (lr (t) + a2 (t)) mod N,

Vlb 0

where, h(τj , l1 , a1 , l2 , a2 , dj ) := max (τj + 1)σ (τj (t)) 1 − max {δ(li − j)δ(ai )} , dj .

−0.1

Value function and bounds

Ti (t + 1) = (Ti (t) + 1)δ(ai (t)), i = 1, 2, τj (t + 1) = h(τj (t), l1 (t), a1 (t), l2 (t), a2 (t), dj (t)), ∀j ∈ Ω, (30)

Vub

−0.2

−0.3

−0.4

−0.5

i=1,2

−0.6

The one step reward function is given by, r(x, u) =

K X

−0.7

ψr (Tk ) − β||xs ||∞ ,

(31)

N2 (Dmax + 1)ns + 2N Tmax(Dmax + 1)ns −1 {z } | ns | {z } One robot dwells Neither robot dwells

2 + (ns − 1)Tmax (Dmax + 1)ns −2 . | {z }

(32)

Both robots dwell

As we can see, this problem scales rapidly with the number of nodes and Dmax . To reduce the size of the problem, states with the same xr , worst service delay, τ (xs (t)) := maxj∈Ω τj (t), and alert status, A(xs (t)) := (δ(τ1 (t)), . . . , δ(τns (t)), are aggregated into the same partition. If x, z ∈ Pi , r(x, u) = r(z, u), ∀u. The resulting partitions are extremal partitions; the number of partitions for this instance of the problem is M = 15, 546, where: M=

4

6

8

10

12

State ID

where the information gain function, ψr (·) is monotonically increasing in the time spent by the robot at an UGS location. This function is based on modeling the operator as a classifier (for details see [10]). The second term penalizes the tardy response of robots in servicing alerts. For simplicity, we assume that Tmax is specified; for t ≥ 0, Ti (t) ∈ {0, 1, · · · , Tmax }, i = 1, 2 and τj (t) ∈ {0, 1, · · · , Dmax }. The other parameters were chosen to be: pα = e−2/60 , weighing factor, β = 0.01 and discount factor, λ = 0.9. The corresponding MDP with ns = 4, N = 8, Tmax = 5, Dmax = 15 has |S| = 1, 395, 456 states, where: |S| =

−0.8 2

k=1

N2 (Dmax (2ns − 1) + 1) ns | {z } Neither robot dwells

+ 2Tmax N (Dmax (2ns −1 − 1) + 1) | {z } One robot dwells

+ Tmax (ns − 1)(Dmax (2ns − 2) + 1) . {z } |

(33)

Both robots dwell

Part (b) of Theorem 2 is relevant for this example problem as the chosen partitions can be shown to be extremal partitions. Indeed, consider a partition Pi . Since independent alert queues for all stations admits the possibility of τ (i) being the delay at every alert location where there is an un-serviced alert, we readily see that this situation corresponds to si ∈ Pmax,i .

5

x 10

Fig. 2: Value function and bounds (all states)

Hence, the lower bound can be computed using LBLP in part (b) of Theorem 2. Fig. 2 and Fig. 3 show the value function and its bounds respectively. The optimal value function, V∗ , is computed by using value iteration method. The upper bound, Vub , and lower bound, Vlb , are computed as solutions to UBLP and LBLP respectively. The sub-optimal performance function, Vπsub is also shown in the Fig. 3. Since the state space is huge, we show a representative subset of states in the plots. We pick states that belong to partitions with ls = 0, lr = 4, T1 = 0, T2 = 0, A = (0, 1, 1, 1), and all possible maximum delays, τ . In Fig. 3, the dotted lines represent the separation between different partitions. For this data set, the partitions chosen differ only in the value of τ as shown in the x−axis. From Fig. 3, we see that the value function and the sub-optimal performance function are both bounded by the upper and lower bounds. We can also see that the sub-optimal performance function is very close to the value function. In this example, percentage deviation between upper and lower bounds is 57.5%, and the deviation between V∗ and Vsub is 4.2%. As mentioned earlier, the size of problem scales rapidly with the number of nodes N . If we were to choose 16 nodes to discretize the perimeter, i.e., N = {0, 1, . . . , 15}, and if there are 8 stations, Ω = {0, 2, 4, 6, 8, 10, 12, 14} which are symmetrically located, then the total number of states is ≈ 183 billion (32)! By adopting the same partitioning scheme as in the previous example, the number of partitions is M = 592, 942 (33). The exact computation of value function (or the sub-optimal performance corresponding to any sub-optimal policy) for this instance of the problem is not feasible owing to the exceedingly large number of states. However, using the proposed methods, we computed the upper and lower bounds for the value function as shown in Fig. 4. In this example, percentage error between upper and lower bounds is 69.7%. Theorem 3 assures us that if we were to choose a policy that

10

−0.05 −0.1

90

Vlb

80

Vub

−0.2

α=2 α=6

70

−0.25

60

−0.3

nd/na (%)

Value function and bounds

−0.15

V* Vsub

−0.35 −0.4

40

30

−0.45 −0.5 −0.55

50

20

45 6

7

8

9

10

11

12

13

14

10

Worst Service Delay

0

Fig. 3: Value function and bounds (Sampled Partitions): ls = 0, lr = 4, T1 = 0, T2 = 0, A = (0, 1, 1, 1)

1

2

3

4

5

Dwell Time (d)

30

α=2 α=6

0

V V

-0.2

lb

25

ub 20

n /na (%)

-0.6

15

τ

Bounds

-0.4

-0.8

10

-1

5

-1.2 0

-1.4

0

2

4

6

8

10

12

14

16

18

20

Service Delay (τ) 0.5

1

1.5

2

2.5

3

Partitions

3.5

4

4.5

5

5.5 5

x 10

Fig. 5: Performance of the sub-optimal policy - Service delay

Fig. 4: Upper and Lower bounds for 2 robots and 8 stations

is greedy with respect to the lower bound, we are guaranteed a sub-optimal performance that exceeds the lower bound. To further analyze the performance of the 2nd example in the absence of V∗ , we resort to Monte-Carlo simulations. A. Simulation Results for the two robot problem We performed Monte-Carlo simulations in order to test the effectiveness of the sub-optimal policy, πsub , via the following quantities of practical interest: (a) average dwell time, (b) average delay in servicing an alert, (c) worst-case service delay, and (d) mean information gained. The Monte-Carlo simulation run time was set to 60000 time units. We ran two simulations with two different alert rates; 2 6 α = 60 , and α = 60 respectively. Although the sub-optimal policy was designed for the first case, by applying it to a different alert rate, the robustness of the policy can be shown. We simulated the response associated with the sub-optimal policy by the same randomly generated sequence of incursions at the same UGS locations. We collected the data from the Monte-Carlo simulations and the results are shown in Fig. 5 and Table I.

For the Monte-Carlo simulation, let na , nd and nτ respectively denote the total number of alerts, number of alerts which were serviced with a dwell time d and number of alerts which have been serviced after a time delay τ . The dwell time (or loitering time) i.e., time spent by a robot servicing an alert can be from 1 to 5 time steps. The upper plot of Fig. 5 shows the fraction of total alerts serviced, ( nnad ) as a function of the dwell (or loitering) time for the two different alert rates. The lower plot of Fig. 5 shows the fraction of total alerts serviced, ( nnaτ ) as a function of the time delay associated with servicing an alert. As one can see, the sub-optimal policy enables the robots to service more than 90% of the alerts within delay of 5 units of time. Furthermore, the percentage of alerts serviced, where the maximum delay is more than 10 units of time is quite small. This can also be inferred from Table I, which shows the mean dwell time, mean service delay, and worst TABLE I: Simulation results Alert rate (α) Mean dwell time Mean delay time Worst delay time

2/60 3.2278 3.9172 17

6/60 1.4321 5.6472 29

11

0 Vub

Vlb

Vemp

−0.1 −0.2 −0.3 −0.4 −0.5 −0.6 −0.7 −0.8 −0.9 −1 2

3

4

5

6

7

8

9 10 Partition set

11

12

13

14

15

Fig. 6: Plot of Empirical Performance Function and Bounds

case delay for the two different alert rates. As noted earlier, neither the value function V∗ or the suboptimal performance function Vπsub can be computed for the two robot example. We would still like to know how the performance function compares to the upper and lower bounds that we have computed. Towards this end, we define Vemp (x) to be the empirical (approximation computed by simulation) performance function associated with the policy πsub . For demonstration purposes, we choose 15 partitions such that all partitions have the same ls = 0, lr = 2, T1 = T2 = 0, A = (1, 0, 0, 1, 0, 1, 0, 1), but each partition has a different worst delay time, τ ranging from 1 to 15. Because there are still too many states in a partition, we pick the maximal and minimal states in each partition as the initial states. For each initial state, we run Monte- Carlo simulation for 50 unit times. Since λ = 0.9, the discounted pay-off for t ≥ 50 is very small. For each initial state, 200 Monte Carlo simulations with different stochastic disturbance input instances are performed. We consider the average value of the discounted payoff for each initial state as the empirical value function of state x. Fig. 6 shows the computed Vemp (x) along with the bounds. Here, the x−axis label represents τ , so each separation line is a boundary between partitions. As we can see, Vemp (x) is sandwiched between the computed upper and lower bounds. If we increase the number of MonteCarlo simulations and simulation duration for each initial state, Vemp (x) → Vπsub (x). Hence, the simulation results support our proposed methodology for large scale MDPs that exhibit the problem structure, PS. V. C ONCLUSIONS In this article, we considered a class of stochastic dynamic programs that were motivated by the perimeter patrol problem. Even for modest sized instances of this problem, the number of states associated with the stochastic dynamic program runs into the order of billions. We provided a method to construct lower and upper bounds for the value function associated with the dynamic programs using LP techniques. More importantly, we provable performance guarantees for a

sub-optimal policy derived from the lower bound. The size of the LPs (or disjunctive LPs) that one has to solve depends on the level of simplification/reduction that can be achieved with the constraint set. In contrast to existing methods that rely on random sampling techniques, we have established a rigorous method of constraint reduction that relies only on the problem structure. We offer a multi-level reduction scheme with the most efficient scheme (both bounds computed via LPs of size equal to the number of partitions) available for MDPs that allow extremal partitions. This paper does not consider some related important problems and these could be topics for future research. Two such issues are: (1) adaptive refinement of partitions: this is important when the number of partitions need to be bounded from the point of view of computational tractability, and (2) decentralized MDPs: this topic is a difficult yet important one and deserves attention in a separate article. The basic issues boil down to the information available to each robot to make independent decisions and the computational and communication constraints on each robot. ACKNOWLEDGMENTS The material presented in this paper was based on the work supported in part by the Air Force Office of Scientific Research (AFOSR) award no. FA9550-10-1-0392, and National Science Foundation (NSF) award no. ECCS-1015066.

P ROOFS

OF

A PPENDIX L EMMAS AND T HEOREMS

Lemma 1. Let x, z correspond to two different initial states of the system described by (3) that satisfy PS. Let the corresponding trajectories subject to the same sequence of input u(t) and disturbance d(t) be respectively x(t) and z(t). If x z, then (a) x(t) z(t) for all t ≥ 0, (b) V ∗ (x) ≤ V ∗ (z) and (c) V ∗ (f (x(t), u(t), d(t)) ≤ V ∗ (f (z(t), u(t), d(t)), t ≥ 0. Proof: The proof of (a) is by induction. At t = 0, it is readily true from the hypothesis. It suffices to show that if x(t) z(t), then x(t + 1) z(t + 1); however, this readily follows from evolution equation (3) and part (ii) of PS. For the same sequence of inputs u(t), d(t), from part (i) of PS and part (a) of this lemma, we can infer that r(x(t), u(t)) ≤ r(z(t), u(t)) for t ≥ 0. Hence, for the same sequence of inputs u(t), d(t), the total discounted reward associated with the initial state x is no more than the initial state z. Taking expectation over all the disturbances and maximizing over all the control actions, one readily obtains V ∗ (x) ≤ V ∗ (z). From part (a), we have: f (x(t), u(t), d(t)) = x(t+1) z(t+1) = f (z(t), u(t), d(t)). It follows that, V ∗ (f (x(t), u(t), d(t)) ≤ V ∗ (f (z(t), u(t), d(t)), t ≥ 0.

12

Theorem 1. Let GP = {P1 , . . . , PR } be a general partitioning scheme of order R. Then the following hold: (a) FUB , FLB are non-empty and lower bounded, i.e., for some real α > −∞, V ∈ FUB ∪ FLB ⇒ V ≥ α1. (b) V1 , V2 ∈ FUB ⇒ min{V1 , V2 } ∈ FUB . Hence, the component-wise minimum of all feasible solutions, V := min{V : V ∈ FUB } is well defined and V ∈ FUB ; by construction, V ∈ FUB ⇒ V ≥ V. (c) Similarly, V := min{V : V ∈ FLB } is well defined and V ∈ FLB ; moreover, V ∈ FLB ⇒ V ≥ V. (d) Let c > 0. Then V = argmin c · V. V∈FU B

If z x, then V (x) ≥ V (z) and hence, V (x) ≥ ⌊V ⌋(z) implying ⌊V ⌋(x) ≥ ⌊V ⌋(z). Hence, ⌊V⌋ satisfies constraints (12) and (13); hence ⌊V⌋ ∈ FLB . By part (c), ⌊V⌋ ≥ V; but V ≥ ⌊V⌋; hence, V = ⌊V⌋. (f) From [1], V ≥ V∗ . We also observe that ⌊V∗ ⌋ ∈ FLB and hence by part (c), ⌊V∗ ⌋ ≥ V. But V∗ ≥ ⌊V∗ ⌋ and hence, V∗ ≥ V. Lemma 2. Let c > 0 and GP be a consistent partitioning scheme of order R. Then, (a) V = φ(w) if and only if w is the optimal solution of the following disjunctive LP, i.e., w

:=

argmin c · w,

w w(j)

≥ ≥

ψ(r(u) + λP(u)φ(w)), w(i), ∀Pi Pj .

Similarly,

w

V = argmin c · V V∈FLB

(e) V = ⌊V⌋. (f) The value function, V∗ is bounded as follows: V ≥ V∗ ≥ V. Proof: Parts (b) and(c) follows from the non-negativity of probabilities and is trivial. Parts (d) follow directly from (a), (b) and (c). Hence, it suffices to prove parts (a), (e) and (f). (a) Let maxx,u r(x, u) β := . 1−λ It is easy to verify that V = β1 ∈ FUB .

∀u,

(34) (35)

(b) V = φ(w) if and only if w

:= argmin c · w, w

φ(w)

≥

r(u) + λP(u)φ(w),

w(i)

≥

w(j),

∀u,

∀Pj Pi .

(36) (37)

Proof: (a) It suffices to show that (i) φ(w) ∈ FLB and (ii) ψ(V) satisfies the constraints (34) and (35); the former implies that φ(w) ≥ V by part (c) of Theorem 1 and the latter implies that ψ(V) ≥ w, which in turn implies from part (f) of Theorem 1 and the definition of floor, V = ⌊V⌋ = φ(ψ(V)) ≥ φ(w). Consequently, V = φ(w).

Similarly, if we set γ=

maxi,u minx∈Pi r(x, u) , 1−λ

it is easy to verify that γ1 ∈ FLB . Let V be any feasible solution to GBI-i. Define v := min minx V (x). Let rmin := minx,u r(x, u), and α = r1−λ . If V satisfied GBI-1, then for all u V ⇒v

≥ r(u) + λP(u)V, ≥ rmin 1 + λP(u)(v1) = rmin 1 + λv1, rmin = α. ≥ rmin + λv ⇒ v ≥ 1−λ

So, it is clear that V ∈ FUB ⇒ V ≥ v1 ≥ α1. If V ∈ FLB , it implies that for all u: ⌊V⌋ ⇒v

≥ ≥

⌊r(u) + λP(u)V⌋ ⌊rmin 1 + λP(u)(v1)⌋ = ⌊rmin 1 + λv1⌋

≥

rmin + λv ⇒ v ≥ α.

Again, V ≥ α1. (e) Let V ∈ FLB . Then, from (12) and (13), for all u: ⌊V⌋ ⌊⌊V⌋⌋ = ⌊V⌋

≥ ≥

⌊r(u) + λP(u)V⌋ ⌊r(u) + λP(u)⌊V⌋⌋

We will focus on showing that φ(w) ∈ FLB : It can be readily seen that constraint (12) will be satisfied by φ(w). If z x, then by consistent partitioning, either (i) x, z belong to the same partition, in which case constraint (13) is satisfied or (ii) x, z belong to different partitions, namely Pi and Pj respectively. By consistent partitioning, Pj Pi ; but if Pj ≥ Pi , then w(i) ≥ w(j); consequently, φ(w)(x) ≥ φ(w)(z). Hence, φ(w) satisfies constraint (13) and φ(w) ∈ FLB . We will now show that ψ(V) satisfies constraints (34) ˆ is a projection of V ˆ := ψ(V), i.e., w and (35). Let w with respect to the given partitioning scheme. By part ˆ Since V (e) of Theorem 1, V = ⌊V⌋ ⇒ V = φ(w). satisfies (12), we have: ∀i, u, ˆ w(i) ˆ ≥ ψ(r(u) + λP(u)φ(w))(i), ˆ satisfies (34). If Pi Pj , let z ∈ which implies that w Pi . By the definition of partial order of partitions, there exists x ∈ Pj such that z x and consequently, w(i) ˆ = V (x) ≥ V (z) = w(j). ˆ Hence, ψ(V) satisfies constraints (34) and (35). (b) As in part (a), it suffices to show that φ(w) ∈ FUB and ψ(V) satisfies constraints (36) and (37). Since φ(w) satisfies (36), V = φ(w) satisfies the Bellman

13

inequality (9). By definition, V = φ(w) also satisfies the equality constraint (11). As in the proof for part (a), if z x, either they belong to the same partition, in which case constraint (10) is satisfied or x ∈ Pi and z ∈ Pj respectively with Pj Pi . Since Pj Pi implies V (x) = w(i) ≥ w(j) = V (z), it is clear that φ(w) satisfies constraint (10); hence, φ(w) ∈ FUB and so, φ(w) ≥ V. Let w = ψ(V). Since V satisfies the equality constraint, (11), we have φ(w) = V. Hence, V satisfying (9) implies φ(w) satisfies (36). Let x ∈ Pj . Then, by partial ordering of partitions, Pj Pi implies that there is a z ∈ Pi such that x z and hence, w(i) = V (x) ≥ V (z) = w(j); hence, ψ(V) also satisfies (37). By the arguments showing part (b) of Theorem 1, we have any feasible solution, ψ(V) to the inequalities (36) and (37) must upper bound the optimal solution, i.e., ψ(V) ≥ w. Consequently, V = φ(ψ(V)) ≥ φ(w). Earlier, we showed that φ(w) ≥ V; hence, V = φ(w). Theorem 2. Let c > 0 and GP be a consistent partitioning scheme of order R. For each i, let Pmax,i and Pmin,i be the maximal and minimal sets of Pi respectively, Then, (a) V = φ(w) if and only if w is the optimal solution of the following Disjunctive Bounding LP (DBLP), i.e., w w w(j)

:= argmin c · w, ≥ ≥

ψmax (r(u) + λP(u)φ(w)), w(i), ∀Pi Pj .

Since Pmax,i ⊂ Pi , for every u: w 2 (i) ≥ ≥

Since w2 (i) ≥ w2 (j) for all Pj Pi , it follows that w2 satisfies constraints (34) and (35); hence, w2 ≥ w1 by a reasoning analogous to parts (a) and (c) of Theorem 1. In order to show that w1 satisfies constraints (38) and (39), consider a state, s ∈ Pi such that s ∈ / Pmax,i . Then, by the definition of maximal set, there exists a z ∈ Pmax,i such that z s; consequently, from PS, r(z, u) ≤ r(s, u) for every u and f (z, u, d) f (s, u, d) for every u, d. Also, r(s, u) + λ

:= ≥ ≥

Since f (z, u, d) f (s, u, d) for every u, d, consistent partitioning implies either (a) f (z, u, d), f (z, u, d) belong to the same partition or (b) if f (z, u, d) ∈ Pi and f (s, u, d) ∈ Pj , then Pi Pj . Consequently, w1 (j) = w 1 (f (s, u, dl )) ≥ w1 (f (z, u, dl )) = w 1 (i) as w1 satisfies the partial order constraint (35). Together with r(s, u) ≥ r(z, u), we get:

r(s, u) + λ

:=

w(i)

≥

∀u, (40) (41)

argmin c · w, R X r(si , u) + λ p(si , u, k)w(k), ∀u, k=1

w(j)

≥

w(i),

∀Pi Pj .

(42)

Similarly, the upper bound, V = φ(w), where w is the optimal solution of the following LP (UBLP): w

:=

w(i)

≥

w(j)

≥

argmin c · w, R X p(si , u, k)w(k), ∀u, r(si , u) + λ k=1

w(i),

∀Pi Pj .

p(s, u, k)w 1 (k) =

k=1 L X

pl w 1 (f (s, u, dl )) ≥

r(z, u) + λ

(b) If Pi , ∀i is an extremal partition, the lower bound, V = φ(w), where w is the optimal solution of the following LP (LBLP): w

R X

l=1

argmin c · w, ψmin (r(u) + λP(u)φ(w)), w(i), ∀Pi Pj .

pl w(f (s, u, dl )).

l=1

w

w w(j)

p(s, u, k)w(k) =

r(s, u) + λ

r(s, u) + λ ∀u, (38) (39)

R X

k=1 L X

Similarly, the upper bound V = φ(w), where w

ψmax (r(u) + λP(u)φ(w 2 )) ψ(r(u) + λP(u)φ(w 2 )).

(43)

Proof: (a) Let w1 and w2 be the optimal solutions to the disjunctive LPs in part (b) of Lemma 2 and Theorem 2 respectively.

L X

pl w 1 (f (z, u, dl ))

l=1 R X

= r(z, u) + λ

p(z, u, k)w 1 (k).

k=1

The above inequality implies that, min [r(x, u) + λ

x∈Pi

R X

min [r(x, u) + λ

x∈Pmax,i

p(x, u, k)w 1 (k)] =

k=1 R X

p(x, u, k)w 1 (k)],

k=1

thereby implying that w1 satisfies the constraints (38) and (39); hence, w1 is feasible for the disjunctive LP in part (a) of Lemma 2. So, analogous to the proof of parts (a) and (c) of Theorem 1, w1 ≥ w2 . Hence, w2 = w1 and the optimal solutions to the disjunctive LPs in part (a) of Lemma 2 and Theorem 2 are identical. An analogous proof holds for the upper bound as well. (b) The simpler LPs follow, when that partitions are extremal, i.e., Pmax,i = {si } and Pmin,i = {si } .

14

Lemma 3. B = 6 ∅; corresponding to every Pj ∈ / B, there is a Pi ∈ B such that Pi Pj and w(i) = w(j).

Then, we have:

Proof: The set B indicates the number of distinct values of components of w. Clearly, it is atleast one and this case corresponds to all the components being the same. Hence, B 6= ∅. We observe that w continues to be the optimal solution to DBLP even if the Bellman constraints (38) corresponding to Pj ∈ / B are dropped i.e.,

e(i) ≤

argmin c · w, ψ(r(u) + λP(u)φ(w))(i), ∀i : Pi ∈ B,

w = w(i) ≥ ≥

w(j)

w(i),

∀Pi Pj .

We claim that for every Pj ∈ / B, there is a Pi ∈ B such that Pi Pj and w(i) = w(j). If no such Pi ∈ B exists, it implies that either there is no Pi Pj or there is a Pi Pj but the corresponding w(j) > w(i). In the former case, there is no constraint of the form w(j) ≥ w(i) for any Pi ∈ B; in the latter case, one can drop the partial ordering constraints corresponding to w(j) from the set (39) without affecting the optimal solution because they are not binding at the optimum. So, we are only left with constraints of the form: w(j)

≥

w(l),

∀Pl Pj , and Pl , Pj ∈ / B.

This implies that the component w(j) is unconstrained (not bounded from below) and can be reduced indefinitely; in particular, the same holds for the j th component of the optimal solution, w(j). Since c > 0, it follows that the optimum value c · w cannot be lower bounded; however, this contradicts the bounded-ness result in part (a) of Theorem 1. Theorem 3. For every π ∈ Πs , V∗ ≥ Vπsub ≥ V = φ(w) = φ(wπsub ) ≥ φ(wπ ). In other words, the performance of the sub-optimal policy, πsub is guaranteed to be no less than the lower bound V; moreover, V equals the greatest lower bound that can be computed using PDBLP(π) over all policies π ∈ Πs . Proof: The inequalities V∗ ≥ Vπsub is obvious by virtue of sub-optimality of policy πsub . The inequality Vπsub ≥ φ(wπsub ) follows from parts (a), (c), (d) and (f) of Theorem 1 by restricting the set of control actions at a state x to πsub (x). It is clear that the optimal solution to DBLP, w satisfies the constraints of PDBLP(π) for all π ∈ Πs and hence, w ≥ wπ and in particular, w ≥ wπsub . It therefore suffices to show that w ≤ wπsub as it would then imply that V = φ(w) = φ(wπsub ). Corresponding to Pi ∈ B and by the definition of the policy πsub (25), we have: w(i) w πsub (i)

= ψ(rπsub + λPπsub φ(w))(i), ≥ ψ(rπsub + λPπsub φ(wπsub ))(i).

Let e = w − wπsub , and x = argmin[r(x, πsub (x)) + λ x∈Pi

R X

k=1

p(x, πsub (x), k)w πsub (k)].

r(x, πsub (x)) + λ

R X

p(x, πsub (x), k)w(k)

k=1 R X

−r(x, πsub (x)) + λ

p(x, πsub (x), k)w πsub (k).

k=1

Let kek = maxi e(i). Then: e(i) ≤ λ

R X

p(x, πsub (i), k)e(k) ≤ λkek.

k=1

Hence, ∀i : Pi ∈ B, e(i) ≤ λkek. For a partition, Pj ∈ / B, there exists a conjugate index, Pi ∈ B such that Pi Pj and w(i) = w(j). However, if Pi Pj , then wπsub (j) ≥ w πsub (i). Hence, e(j) ≤ e(i) ≤ λkek ⇒ max e(k) ≤ λkek. k

Since e ≥ 0, kek = maxk e(k). This implies that kek ≤ λkek ⇒ (1 − λ)kek ≤ 0 ⇒ kek ≤ 0 ⇒ kek = 0. So, w = wπsub . Moreover, since w ≥ wπ , ∀π ∈ Πs , V = φ(w) is the greatest lower bound that can be computed using PDBLP(π) over all policies π ∈ Πs . R EFERENCES [1] P. J. Schweitzer and A. Seidmann, “Generalized polynomial approximations in Markovian decision processes,” Journal of Mathematical Analysis and Applications, vol. 110, no. 2, pp. 568–582, 1985. [2] M. Trick and S. Zin, “Spline approximation to value functions: A linear programming approach,” Macroeconomic Dynamics, vol. 1, pp. 255– 277, 1997. [3] B. Van Roy, “Performance loss bounds for approximate value iteration with state aggregation,” Mathematics of Operations Research, vol. 31, no. 2, pp. 234–244, 2006. [4] D. Kingston, R. Beard, and R. Holt, “Decentralized perimeter surveillance using a team of UAVs,” IEEE Transactions on Robotics, vol. 24, no. 6, pp. 1394–1404, 2008. [5] D. A. Paley, L. Techy, and C. A. Woolsey, “Coordinated perimeter patrol with minimum-time alert response,” in Proc. Guidance, Navigation and Control Conf., no. AIAA 2009-6210, Chicago, IL, 2009. [6] J. Marier, C. Besse, and B. Chaib-draa, “Solving the continuous time multiagent patrol problem,” in IEEE International Conference on Robotics and Automation (ICRA), 2010, pp. 941–946. [7] A. Marino, L. Parker, G. Antonelli, F. Caccavale, and S. Chiaverini, “A fault-tolerant modular control approach to multi-robot perimeter patrol,” in Proc. IEEE International Conf. on Robotics and Biomimetics (ROBIO), 2009, pp. 735–740. [8] S. Darbha, K. Krishnamoorthy, M. Pachter, and P. Chandler, “State aggregation based linear programming approach to approximate dynamic programming,” in Proc. IEEE Conf. Decision and Control, 2010, pp. 935–941. [9] K. Krishnamoorthy, M. Pachter, P. Chandler, D. Casbeer, and S. Darbha, “UAV perimeter patrol operations optimization using efficient dynamic programming,” in Proc. American Control Conf., 2011, pp. 462–467. [10] K. Krishnamoorthy, M. Pachter, P. Chandler, and S. Darbha, “Optimization of perimeter patrol operations using UAVs,” AIAA J. Guidance, Control and Dynamics, vol. 35, no. 2, pp. 434–441, 2012. [11] K. Krishnamoorthy, M. Pachter, S. Darbha, and P. Chandler, “Approximate dynamic programming with state aggregation applied to UAV perimeter patrol,” Internat. J. of Robust and Nonlinear Control, vol. 21, no. 12, pp. 1396–1409, 2011. [12] K. Krishnamoorthy, S. Darbha, M. Park, P. Chandler, M. Pachter, and D. Casbeer, “Lower bounding linear program for the perimeter patrol optimization problem,” AIAA J. Guidance, Control and Dynamics, vol. 37, no. 2, pp. 558–565, March 2014.

15

[13] M. Park, S. Darbha, K. Krishnamoorthy, P. P. Khargonekar, M. Pachter, and P. Chandler, “Sub-optimal stationary policies for a class of stochastic optimization problems arising in robotic surveillance applications,” in Proceedings of the 5th Annual Dynamic Systems and Control Conference, no. DSCC2012-8610. ASME, Oct 2012. [14] C. Guestrin, D. Koller, R. Parr, and S. Venkataraman, “Efficient solution algorithms for factored MDPs,” Journal of Artificial Intelligence Research, vol. 19, pp. 399–468, 2003. [15] A. Manne, “Linear programming and sequential decisions,” Management Science, vol. 6, no. 3, pp. 259–267, 1960. [16] F. d’Epenoux, “A probabilistic production and inventory problem,” Management Science, vol. 10, no. 1, pp. 98–108, 1963. [17] E. Porteus, “Bounds and transformations for discounted finite Markov decision chains,” Operations Research, vol. 23, no. 4, pp. 761–784, 1975. [18] Y. Liu, C. Chu, and K. Wang, “Aggregated state dynamic programming for operating theater planning,” in IEEE Conference on Automation Science and Engineering, Toronto, ON, CA, 2010, pp. 1013–1018. [19] Y. Virin, G. Shani, S. E. Shimony, and R. I. Brafman, “Scaling up: Solving POMDPs through value based clustering,” in Proceedings of National Conference on Artificial Intelligence. Vancover, British Columbia: Association for the Advancement of Artificial Intelligence (AAAI), July 2007, pp. 1290–1295. [20] J.-Y. Choi and S. Reveliotis, “Relative value function approximation for the capacitated re-entrant line scheduling problem,” IEEE Transactions on Automation Science and Engineering, vol. 2, no. 3, pp. 285–299, 2005. [21] K. Krishnamoorthy, M. Park, M. Pachter, P. Chandler, and S. Darbha, “Bounding procedure for stochastic dynamic programs with application to the perimeter patrol problem,” in Proc. American Control Conf., Montreal, QC, CA, June 2012, pp. 5874–5882. [22] R. Howard, Dynamic Programming and Markov Processes. Cambridge, MA: MIT Press, 1960. [23] M. Park, K. Krishnamoorthy, M. Pachter, S. Darbha, and P. Chandler, “State partitioning based linear program for stochastic dynamic programs: an invariance property,” Operations Research Letters, vol. 40, no. 6, pp. 487–491, Nov. 2012. [24] D. P. De Farias and B. Van Roy, “The linear programming approach to approximate dynamic programming,” Operations Research, vol. 51, no. 6, pp. 850–865, 2003.

Myoungkuk Park received the B.S. degree in Mechanical Engineering from Kyunghee University in 2002 and the M.S.degree in Mechanical Engineering from Korea University in 2004. He is currently a graduate student pursuing a Ph.D. in the Mechanical Engineering department at Texas A&M university. His research interests are control systems, with applications to robotics and UAVs, and large scale stochastic dynamic programs.

Dr. Swaroop Darbha received his Bachelor of Technology from the Indian Institute of Technology - Madras in 1989, M. S. and Ph. D. degrees from the University of California in 1992 and 1994 respectively. He was a post-doctoral researcher at the California PATH program from 1995 to 1996. He has been on the faculty of Mechanical Engineering at Texas A&M University since 1997, where he is currently a professor. His current research interests lie in the development of diagnostic systems for air brakes in trucks, development of planning, control and resource allocation algorithms for a collection of Unmanned Aerial Vehicles.

Dr. Krishnamoorthy Kalyanam received the B.Tech. degree in Mechanical engineering from the Indian Institute of Technology, Madras in 2000, and the M.S. and Ph.D. degrees in Mechanical engineering from the University of California at Los Angeles, in 2003 and 2005, respectively. In Oct 2005, he joined G.E. Global Research in Bangalore, India as a Research Engineer, where he worked on Train optimal control and Wind Farm Layout Optimization. In July 2009, he moved to the U.S. Air Force Research Laboratory (AFRL) as a National Research Council sponsored Research Associate. Since then, he has been working on autonomous systems and cooperative control of unmanned air vehicles. He is currently with the Infoscitex corporation and is an on-site research scientist/contractor at AFRL.

Dr. Pramod Khargonekar has worked at the Universities of Florida, Minnesota, and Michigan. He was Chairman of the EECS Department and held the Shannon Chair at Michigan. He was Dean of Engineering and holds Eckis Professorship at Florida. He is a recipient of the NSF Presidential Young Investigator Award, the American Automatic Control Council’s Eckman Award, the IEEE Baker Prize Award, the IEEE CSS Axelby Best Paper Award, the Hugo Schuck ACC Best Paper Award, and a Distinguished Alumnus and Distinguished Service Award from IIT Bombay. He is a Fellow of IEEE and is a Web of Science Highly Cited Researcher.

Dr. Meir Pachter is a Professor of Electrical Engineering at the Air Force Institute of Technology, Wright-Patterson AFB. Dr. Pachter received the BS and MS degrees in Aerospace Engineering in 1967 and 1969 respectively, and the Ph.D. degree in Applied Mathematics in 1975, all from the Israel Institute of Technology. Dr. Pachter held research and teaching positions at the Israel Institute of Technology, the Council for Scientific and Industrial Research in South Africa, Virginia Polytechnic Institute, Harvard University and Integrated Systems, Inc. Dr. Pachter is interested in the application of mathematics to the solution of engineering and scientific problems. His current areas of interest include military operations optimization, dynamic games, cooperative control, estimation and optimization, statistical signal processing, adaptive optics, inertial navigation, and GPS navigation. For his work on adaptive and reconfigurable flight control, he received AFRL’s prestigious General Foulois award in 1994. Dr. Pachter is a Fellow of the IEEE.

Mr. Phil Chandler is the tech advisor for the Autonomous Control branch of the Air Force Research Lab (AFRL), Wright-Patterson Air Force Base. He received his B.S. and M.S. degrees from Wright State University. He was the principal architect of the Self Repairing Flight Control System advanced development program and recipient of AFRL’s prestigious General Foulois award in 1994. He is currently researching autonomous control algorithms for unmanned combat aerial vehicles.

Performance Guarantee of an Approximate Dynamic ...

way to approximate the value function and derive sub-optimal strategies. ... on a preliminary version presented at the IFAC Conference on Research, ... article, we construct upper and lower bounds for the value function ...... We call the partition.

Download PDF

346KB Sizes 1 Downloads 181 Views

Report

Performance Guarantee of an Approximate Dynamic ...

Recommend Documents