Simulation-based optimization of Markov decision processes and Multi-armed bandits: An empirical process theory approach Rahul Jain and Pravin Varaiya

Abstract— We generalize the PAC Learning framework for Markov Decision Processes developed in [18]. We consider the reward function to depend on both the state and the action. Both the state and action spaces can potentially be countably infinite. We obtain an estimate for the value function of a Markov decision process, which assigns to each policy its expected discounted reward. This expected reward can be estimated as the empirical average of the reward over many independent simulation runs. We derive bounds on the number of runs needed for the convergence of the empirical average to the expected reward uniformly for a class of policies, in terms of the V-C or pseudo dimension of the policy class. We then propose a framework to obtain an -optimal policy from simulation. We provide sample complexity of such an approach. Index Terms— Markov decision processes, Markov games, empirical process theory, PAC learning, value function estimation, uniform rate of convergence, simulation-based optimization.

I. I NTRODUCTION It is well-known that solving Markov decision processes using dynamic programming is computationally intractable. Thus various approximate dynamic programming techniques as well as simulation-based techniques have been developed. We propose an empirical process theory approach to simulation-based optimization of Markov decision processes. The empirical process theory (EPT) [12] studies the uniform behavior of a class G of measurable functions in the law of large numbers (as well as the central limit theorem [3]) regime. In particular, EPT studies the conditions for which n 1X Pr{sup | g(Xi ) − EP [g(X)]| > } → 0 (1) g∈G n i=1 and the rate of convergence. Convergence results in EPT typically use concentration of measure inequalities such as those of Chernoff and Hoeffding. The rate of convergence of an empirical average to the expected value depends on the exponent in the upper bound of such inequalities. Led by Talagrand, there has been a lot of effort to improve this exponent [15], [7]. The goal of this paper is to extend the reach of this rich and rapidly developing theory to Markov decision processes The first author is with the IBM TJ Watson Research Center, Hawthorne, NY and can be reached at [email protected]. The second author is with the EECS department at the University of California, Berkeley and can be reached at [email protected]. This paper is a generalization and extension of results published in [18], and were presented at the Conference on Stochastic Processes and Applications (SPA), Urbana-Champaign, IL, August 2007.

and Multi-armed bandits problems, and use this framework to solve the optimal policy search problem. Consider an MDP with a set of policies Π. The value function assigns to each π ∈ Π its expected discounted reward V (π). We estimate V from independent samples of the discounted reward by the empirical mean, Vˆ (π). We obtain the number of samples n(, δ) (or sample complexity) needed so that the probability Pr{sup |Vˆ (π) − V (π)| > } < δ.

(2)

π∈Π

Our approach is broadly inspired by [4], [16], [17] and influenced by [6]. Thus, we would like to reduce the problem in equation (2) to understanding the geometry of Π in terms of its covering number. (If the covering number is finite, it is the minimal number of elements of a set needed to approximate any element in the set Π with a given accuracy.) We first relate the covering numbers of the space of stationary stochastic policies and the space of Markov chains that they induce. We relate these to the space of simulation functions that simulate the policies when the policy space is convex. These results together yield the rate of convergence of the empirical estimate to the expected value for the discounted-reward MDPs. What makes the problem non-trivial is that obtaining empirical discounted reward from simulation involves an iteration of simulation functions. The geometry of the space of iterated simulation functions is much more complex than that of the original space. The problem of uniform convergence of the empirical average to the value function for discounted MDPs was studied in [5], [10] in a machine learning context. While [5] only considered finite state and action spaces, [10] obtains the conditions for uniform convergence in terms of the simulation model rather than the geometric characteristics (such as covering numbers or the P-dimension) of the simulation function space as opposed to that of the more natural policy space. Large-deviations results for finite state and action spaces for the empirical state-action frequencies and general reward functions were obtained in [1], [8]. A different approach more akin to importance sampling is explored in [11]. While the problem of uniform estimation of the value function for discounted and average-reward partially observed MDPs is of interest in itself, we also present a framework for simulation-based optimal policy search. The simulation-based estimates such as those proposed in this paper, have been used in a gradient-based method for

finding Nash equilibrium policies in a pursuit-evasion game problem [14] though the theoretical understanding is far from complete. Other simulation-based methods to find approximations to the optimal policy include [9], [2], [19]. This paper extends results in [18] from the case where the reward function depends only on the state to the case where it depends on both the state and action. It then presents how this framework can be used to find approximately optimal policies. We finally, conclude with a simulationbased optimization framework for Multi-armed bandits. II. P RELIMINARIES Consider an MDP M , with state space X and action space A, transition probability function Pa (x, x0 ), initial ¯ reward function r(x, a) (which depends state distribution λ, on both state and action) with values in [0, R], and discount factor 0 < γ < 1. The value function for a policy π is the expected discounted reward ∞ X V (π) = E[ γ t r(xt , at )], t=1

in which (xt , at ) is the state-action pair in the tth step under policy π. Let Π0 denote the space of P all stationary stochastic policies {π(x, a) : a ∈ A, x ∈ X, a π(x, a) = 1} and let Π ⊆ Π0 be the subset of policies of interest. The MDP M under a fixed stationary policy π induces a Markov transition probability function P chain with 0 Pπ (x, x0 ) = P (x, x )π(x, a). The initial distribution a a on the Markov chains is λ, and we identify Pπ with the Markov chain. Denote P := {Pπ : π ∈ Π}. Let X be an arbitrary set and λ be a probability measure on X. Given a set F of real-valued functions on X, ρ a metric on R, let dρ(λ) be the pseudo-metric on F with respect to measure λ, Z dρ(λ) (f, g) = ρ(f (x), g(x))λ(dx). A subset G ⊆ F is an -net for F if ∀f ∈ F, ∃g ∈ G with dρ(λ) (f, g) < . The size of the minimal -net is the covering number, denoted N (, F, dρ(λ) ). The -capacity of F under the ρ metric is C(, F, ρ) = supλ N (, F, dρ(λ) ). Essentially, the -net can be seen as a subset of functions that can -approximate any function in F. The covering number is a measure of the richness of the function class. The richer it is, the more approximating functions we will need for a given measure of approximation . The capacity makes it independent of the underlying measure λ on X. (See [6] for an elegant treatment of covering numbers.) Let F be a set of real-valued functions from X to [0, 1]. Say that F P-shatters {x1 , · · · , xn } if there exists a witness vector c = (c1 , · · · , cn ) such that the set {(η(f (x1 ) − c1 ), · · · , η(f (xn ) − cn )), f ∈ F} has cardinality 2n ; η(·) is the sign function. The largest such n is the P-dim(F). This is a generalization of VC-dim and for {0, 1}-valued functions, the two definitions are equivalent.

Other combinatorial dimensions such as the fat-shattering dimension introduced in [?] yield both an upper and lower bound on the covering numbers but in this paper we will use the P-dim. Results using fat-shattering dimension can be established similarly. III. T HE SIMULATION MODEL We estimate the value V (π) of policy π ∈ Π from independent samples of the discounted rewards. The samples are generated by a simulation ‘engine’ (qπ , ~). This is a deterministic function to which we feed two noise sequences ν = (ν1 , ν2 , · · ·) and ω = (ω1 , ω2 , · · ·) (with νi and ωi i.i.d. and uniform Ω = [0, 1]) and different initial states and actions (x10 , a10 ), · · · , (xn0 , an0 ) (with (xi0 , ai0 ) i.i.d. with distribution λ). (Note that we have put an initial distribution on initial actions but these play no role. The actions at t=1 are determined by the policies which are functions of x0 and action at t=1, a1 ). The engine then generates a state and action sequence, with the state sequence samePas the Markov chain corresponding to π, Pπ (x, y) = a Pa (x, y)π(x, a). The estimate of V (π) is the average of the total discounted reward starting with different initial states. Because simulation cannot be performed indefinitely, we truncate the simulation at some time T , after which the contribution to the total discounted reward falls below /2 for required estimation error bound . T is the /2-horizon time. The function ~ : X × A × Ω → X gives the next state x0 given current state is x, action taken a, and noise ωi . Many simulation functions are possible. We will work with the following simple simulation model, for X = N and A = [1 : NA ], for NA finite. Definition 1 (Simple simulation model): The simple simulation model (qπ , ~) for a given MDP with policy π ∈ Π is given by qπ (x, ν) = inf{b ∈ A : ν ∈ [Qπ,x (b − 1), Qπ,x (b))}, P where Qπ,x (b) := b0 ≤b π(x, b0 ) and ~(x, a, ω) = inf{y ∈ X : ω ∈ [Fa,x (y − 1), Fa,x (y))}, P 0 in which Fa,x (y) := y 0 ≤y Pa (x, y ) is the c.d.f. corresponding to the transition probability function Pa (x, y). This is the simplest method of simulation: For example, to simulate a probability distribution on a discrete state space, we partition the unit interval such that the first subinterval has length equal to the mass on the first state, the second subinterval has length equal to the mass on the second state, and so on. It is a surprising fact that there are other simulation functions h0 that generate the same distribution, but which have a much larger complexity than h. The state and action sequence {(xt , at )}∞ t=0 for policy π is obtained by at+1 xt+1

= qπ (xt , νt+1 ), = fπ (xt , νt+1 , ωt+1 ) = ~(xt , qπ (xt , νt+1 ), ωt+1 ),

where νt and ωt ∈ Ω are noises. The initial state-action pair (x0 , a0 ) is drawn according to the given initial state-action ¯ over X). distribution λ (with marginal distribution λ Denote zt = (xt , at ) ∈ Z := X × A and ξt = (νt , ωt ) ∈ Ω2 := Ω × Ω, then

As before, S = Q × F, the set of simulation systems that simulate the policies Π, and generate the state-action sequence. Thus, the results and discussion of the previous section hold. For a policy π, our simulation system is

sπ (zt , ξt ) = zt+1 = (qπ (xt , νt+1 ), fπ (xt , νt+1 , ωt+1 )).

(zt+1 , θξ) = sπ (zt , ξ),

The function sπ : Z × Ω2 → Z is called the simulation system for the policy π. We denote Q = {qπ : π ∈ Π}, F = {fπ : π ∈ Π} and the set of all simulation systems induced by Π by S = {sπ : π ∈ Π} = Q × F. Let µ denote the Lebesgue measure on Ω2 . Then,

in which zt+1 is the next state-action pair starting from the current state-action pair zt and the simulator also outputs the shifted noise sequences θξ. This definition of the simulation function is introduced to facilitate the iteration of simulation functions. Let S 2 := {s ◦ s : W × W → W × W, s ∈ S} and S t its generalization to t iterations. Let µ be a probability measure on Ω∞ 2 and λ the initial distribution on Z. Denote the product measure on W by P = λ × µ, and on W n by Pn . Define the two pseudometrics on S: X ρP (s1 , s2 ) = λ(z)µ{ξ : s1 (z, ξ) 6= s2 (z, ξ)},

P r{zt+1 = z 0 |zt = z, π} = µ{ξ : sπ (z, ξ) = z 0 }. Unless specified otherwise, S will denote the set of simulation functions for the policy space Π under the simple simulation model. The question now is how does the complexity of the space Q compare with that of the policy space Π. We connect the two when Π is convex. Lemma 1: Suppose Π is convex with P-dimension 1 d. Let Q be the corresponding space of simple simulation functions that simulate Π. Then, P-dim(Q) = d. Moreover, the algebraic dimension of Q is also d. The proof essentially follows the argument in the proof of Lemma 4.2 in [18] and will not be repeated here. IV. S AMPLE C OMPLEXITY FOR D ISCOUNTED - REWARD MDP S Consider an MDP M with countably infinite state space X = N, and finite action space A = [1 : NA ] (for some NA ∈ N), transition probability function Pa (x, x0 ), initial ¯ reward function r(x, a), and discount state distribution λ, factor γ < 1. The value function is the total discounted reward for a policy π in some set of stationary policies Π. Let θ be the left-shift operator on Ω∞ , θ(ω1 , ω2 , · · ·) = (ω2 , ω3 , · · ·). We redefine Q to be the set of measurable functions from W := X × A × Ω∞ × Ω∞ onto itself which simulates Π, the set of given stationary policies under the simple simulation model to generate the action sequence. The action-simulation function qπ (x, a, ν, ω) may only depends on current state x (though we define a more general function for notational simplicity), and ν1 , the first component of the sequence ν = (ν1 , ν2 , · · ·) (with νi i.i.d. and uniform [0,1]). Similarly, we redefine F to be the set of measurable functions from W := X × A × Ω∞ × Ω∞ onto itself which simulates Π, the set of given stationary policies under the simple simulation model to generate the state sequence. Thus, state-simulation functions fπ (x, a, ν, ω) may depend only on the current state x (but again we define a more general function), and ξ1 , the first component of the sequence ξ = (ξ1 , ξ2 , · · ·) (with ξi = (νi , ωi ) each i.i.d. and uniform [0,1]). 1 We refer the reader to [18] for a preliminary discussion about Pdimension, -covering numbers and -capacity.

z

and dL1 (P) (s1 , s2 ) :=

X

Z λ(z)

|s1 (z, ξ) − s2 (z, ξ)|dµ(ξ).

z

The ρP and dL1 (P) pseudo-metrics for the functions in Q and F are defined similarly. We present a key technical result which relates the covering number of the iterated functions S t under the ρ pseudo-metric with the covering number for Q under the L1 pseudo-metric. Lemma 2: Let λ be the initial distribution on PZ and let λs the (one-step) distribution given by λs (y) = z λ(z)µ{ξ : s(z, ξ) = (y, θξ)} for s ∈ S. Suppose that λs (y) , 1} < ∞. s∈S,y∈Z λ(y)

K := { sup

(3)

Then, N (, S t , ρP ) ≤ N (/K t , Q, dL1 (P) ). The proof of Lemma 2 is similar to that of Lemma 4 and is omitted. The condition of the lemma essentially means that under distribution λ the change in the probability mass on any state-action pair under any policy after one transition is uniformly bounded. The estimation procedure is this. Obtain n initial state(1) (n) action pairs z0 , · · · , z0 drawn i.i.d. according to λ, and n trajectories ξ (1) , · · · , ξ (n) ∈ Ω∞ 2 (Ω = [0, 1]) drawn according to µ, the product measure on Ω∞ 2 of uniform probability measures on Ω2 . Denote the samples by Ξn = {(z01 , ξ 1 ), · · · , (z0n , ξ n )} drawn with measure Pn . This is our first main result. Theorem 1: Let (Z, Γ, λ) be the measurable state-action space and r(x, a) the real-valued reward function, with values in [0, R]. Let Π ⊆ Π0 , the space of stationary stochastic policies, and S the space of simple simulation systems of Π. Suppose that P-dim(Q) ≤ d and the initial state-action diss (z) tribution λ is such that K := max{sups∈S,z∈Z λλ(z) , 1} <

∞. Let VˆnT (π) be the estimate of V (π) obtained by averaging the reward from n samples, each T steps long. Then, for , δ > 0, Pn {sup |VˆnT (π) − V (π)| > } < δ π∈Π

if 4 32R2 32eR (log + 2d(log + T log K)). 2 α δ α (4) Here T is the /2-horizon time and α = /2(T + 1). Proof: Fix a policy π. Let sπ be the corresponding simple simulation system that yields the state-action sequence given a noise sequence ξ. Define the function Rtπ (z0 , ξ) := r ◦ sπ ◦ · · · ◦ sπ (z0 , ξ), with sπ composed t times. Let Rt := {Rtπ : Z × Ω∞ 2 → [0, R], π ∈ Π}. Let V (π) be the expected discounted reward, and V T (π) the expectedP discounted PT reward truncated upto T steps. Let n VˆnT (π) = n1 i=1 [ t=0 γ t Rtπ (xi0 , ω i )]. Then, n ≥ n0 (, δ) :=

|V (π) − VˆnT (π)| ≤ |V (π) − V T (π)| + |V T (π) − VˆnT (π)|,  ≤ |V T (π) − VˆnT (π)| + 2 T n X 1X π i i  | [Rt (z0 , ξ ) − E(Rtπ )]| + . ≤ n 2 t=0 i=1 The expectation is with respect to the product measure P = λ × µ. We show that with high probability, each term in the sum over t is bounded by α = /2(T + 1). Note that (after abusing the definition of the reward function a bit) we have Z |r(st1 (z, ξ)) − r(st2 (z, ξ))|dµ(ξ)dλ(z) X ≤ R· λ(z)µ{ξ : st1 (z, ξ) 6= st2 (z, ξ)}, z

which as in lemma 2 implies that dL1 (P) (r ◦

st1 , r



st2 )

≤ R · K dL1 (P) (q1 , q2 ).

1X Rt (z0i , ξ i ) − E(Rt )| > α] n i=1

≤ 4C(α/16, Rt , dL1 ) exp(−nα2 /32R2 )  d 32eRK T 32eRK T ≤ 4 exp(−nα2 /32R2 ). log α α This implies the estimation error is bounded by α with probability at least δ, if the number of samples is n≥

at+1 xt+1 yt+1

= qπ,t+1 (ht , νt+1 ) = fπ,t+1 (xt , ht , νt+1 , ωt+1 ) := ~(xt , at+1 , ωt+1 ) = gπ,t+1 (xt , ht , νt+1 , ωt+1 , ζt+1 ) := g(xt+1 , ζt+1 )

where νt , ωt , ζt are i.i.d. uniform [0,1], and qπt , h and g are simulation functions under the simple simulation model (which simulate πt+1 , Pa and τ respectively). The initial state-action pair (x0 , a0 ) is drawn according to the given initial state-action distribution λ. Denote zt = (xt , ht ) ∈ Zt := X × Ht and ξt = (νt , ωt , ζt ) ∈ Ω3 := Ω × Ω × Ω, and ξ = (ξ1 , ξ2 , · · ·) then sπ,t+1 (zt , ξt ) = (zt+1 , θξ) = (qπ,t+1 (ht , νt+1 ), fπ,t+1 (xt , ht , νt+1 , ωt+1 ), gπ,t+1 (xt , ht , νt+1 , ωt+1 , ζt+1 ), ht , θξ)

n

Rt ∈Rt

We now consider partially observed discounted-reward MDPs with general policies (non-stationary with memory). The setup is as before, except that the policy depends on observations y ∈ Y, governed by the (conditional) probability τ (y|x) of observing y ∈ Y when the state is x ∈ X. Let ht denote the history (a0 , y0 , a1 , y1 , · · · , at , yt ) of observations and actions before time t. The results of section IV extend when the policies are non-stationary however there are many subtleties regarding the domain and range of simulation functions, and measures, and some details are different. Let Ht = {ht = (a0 , y0 , a1 , y1 , · · · , at , yt ) : as ∈ A, ys ∈ Y, 0 ≤ s ≤ t} (we introduce a0 for notational convenience but the next action and state might only depend on y0 ). Let Π be the set of policies π = (π1 , π2 , · · ·), with πt+1 : Ht × A → [0, 1] a probability measure on A conditioned on ht ∈ Ht . Let Πt denote the set of all policies πt at time t with π ∈ Π. We can simulate a policy π in the following manner:

T

From theorem 5.7 in [17] (also theorem 3 in [4]), lemma 2, and the inequality above, we get Pn [ sup |

V. PARTIALLY O BSERVABLE MDP S WITH G ENERAL P OLICIES

32R2 4 32eR (log + 2d(log + T log K)). 2 α δ α

where θ is a left-shift operator. The function sequence sπ = (sπ,1 , sπ,2 , · · ·) will be called the simulation system for a general policy π. The function sequences qπ = (qπ,1 , qπ,2 , · · ·), fπ = (fπ,1 , fπ,2 , · · ·) and gπ = (gπ,1 , gπ,2 , · · ·) will be called action, state and observation simulation functions corresponding to policy π. We denote Q = {qπ : π ∈ Π}, F = {fπ : π ∈ Π}, G = {gπ : π ∈ Π} and the set of all simulation functions induced by Π by S = {sπ : π ∈ Π}. Note that we can define sπ,t ◦ · · · ◦ sπ,1 . We shall denote it by stπ . We first connect the P-dimensions of Πt and Qt := {qπ,t : π ∈ Π}. (St and Ft will be defined similarly). Lemma 3: Suppose Πt is convex and P-dim (Πt ) = d. Then, P-dim (Qt ) = d. The proof follows that of lemma 1 and is omitted. Let µ be a probability measure on Ω∞ 3 and λt a measure on Zt−1 . Denote the product measure on Wt−1 = Ω∞ 3 ×

Zt−1 by Pt = λt × µ, and on Wnt−1 by Pnt . Define the two pseudo-metrics on St : X λt (z)µ{ξ : s1t (z, ξ) 6= s2t (z, ξ)}, ρt (s1t , s2t ) =

VI. S IMULATION - BASED O PTIMIZATION We propose a simulation-based optimization framework based on the empirical process theory for Markov decision processes developed above and in [18]. Given an MDP M z∈Zt−1 with a convex and compact policy space Π. Let Q be the set of simple action-simulation functions that simulate policies and ˆ ε be an ε-net for Q under the dL metric. Denote Π. Let Q Z 1 X ˆ ε. the set of policies in Π corresponding to its ε-net by Π λt (z) |s1t (z, ξ)−s2t (z, ξ)|dµ(ξ). dL1 (Pt ) (s1t , s2t ) := We know from Lemma 1 that if the P-dimension of Π is z∈Zt−1 some finite d, the P-dimension of Q is also d, which implies ˆ ε is finite, and its cardinality is bounded by The ρt and dL1 (Pt ) pseudo-metrics for the function spaces that the ε-net Π A Qt and Ft are defined similarly. n1 (ε, NA ) = 2(x log x)d , where x = 2eN ε , where NA is the cardinality of the action set A. Define for s ∈ S and z ∈ Zt , ˆ ε and simulate n sample-paths for We pick each π ∈ Π X 0 t 0 t λst (z) := λ(z )µ{ξ : s (z , ξ) = (z, θ ξ)} (5) T times steps, which is the /2-horizon time for a given z 0 ∈Z0  > 0. Thus, we obtain estimates VˆnT (π). We pick be a probability measure on Zt . We now state the extension of the technical lemma needed for the main theorem of this section. Lemma 4: Let λ be a probability measure on Z0 and λst be the probability measure on Zt as defined above. Suppose that P-dim (Qt ) ≤ d, ∀t ≥ 1, and there exists probability measures λt on Zt such that K := λst (z) max{supt supst ∈S t ,z∈Zt λt+1 (z) , 1} < ∞. Then, for 1 ≤ t ≤ T, N (, S t , ρ1 ) ≤ N (

  , Qt , ρt ) · · · N ( , Q1 , ρ1 ) Kt Kt

dt 2eKt 2eKt ≤ log .   The proof can be found in the appendix. We now obtain our sample complexity result. Theorem 2: Let (Z, Γ, λ) be the measurable state-action space, Y the observation space, Pa (x, x0 ) the state transition function and τ (y|x) the conditional probability measure that determines the observations. Let r(x, a) the real-valued reward function bounded in [0, R]. Let Π be the set of stochastic policies (non-stationary and with memory in general), S be the set of simple simulation systems that simulate π ∈ Π. Suppose that P-dim (Qt ) ≤ d ∀t ≥ 1, and let the probability measures λ, µ (on Z0 and Ω∞ 3 respectively) and λt+1 a probability measure on Zt (with λ1 = λ) be such that λst (z) K := max{supt supst ∈S t ,z∈Zt λt+1 (z) , 1} is finite, where T ˆ λst is as defined above. Let Vn (π), the estimate of V (π) obtained from n samples with T time steps. Then, given any , δ > 0, with probability at least 1 − δ, 

sup |VˆnT (π) − V (π)| < 

π∈Π

 2 log 4δ + 2dT (log 32eR for n ≥ 32R α2 α + log KT ) where T is the /2 horizon time and α = /2(T + 1). The above results can be extended to Multi-armed Markov bandits with discounted rewards. Due to space constraints, such results will be omitted.

π ˆ ∗ ∈ arg sup VˆnT (π). ˆε π∈Π

The optimal policy is, of course, π ∗ ∈ arg sup V (π). π∈Π

We also define π ˜ ∈ arg sup VˆnT (π). π∈Π

Define the regret of a policy as %(π) := V (π ∗ ) − V (π). Now, from results of section IV, we know that for a given , δ > 0, and with n ≥ n0 (/3, δ), with probability at least 1−δ V (π) ≤ VˆnT (π) + /3, ∀ π ∈ Π, (6) and in particular for π = π ∗ . Also, VˆnT (π) ≤ V (π) + /3, ∀ π ∈ Π,

(7)



and in particular for π = π ˆ . Consider any π1 , π2 ∈ Π with simulation functions q1 , q2 ∈ Q and corresponding s1 , s2 ∈ S. Then, from the proof of theorem 1, we know that dL1 (VˆnT (π1 ), VˆnT (π2 )) ≤ R ·

T X

γ t dL1 (st1 , st2 )

t=0

≤R·

T X

γ t K t dL1 (q1 , q2 ).

t=0

Then, if dL1 (q1 , q2 ) ≤ ε, then dL1 (VˆnT (π1 ), VˆnT (π2 )) ≤ Rε

1 − (γK)T +1 ≤ /3 1 − γK

(1−γK) for γK < 1 and ε ≤ 3R(1−(γK) T +1 ) . Now, note that if we take π1 = π ∗ , then by definition of ˆ ε such that dL (q1 , q2 ) ≤ an ε-net, there exists a π2 = π ˆ∈Π 1 ε. Thus,

VˆnT (π ∗ ) ≤ VˆnT (ˆ π ) + /3 ≤ VˆnT (ˆ π ∗ ) + /3.

(8)

From equations (6), (7) and (8), we get that V (π ∗ ) ≤ V (ˆ π ∗ ) + , i.e., π ˆ ∗ is an -optimal policy. Formally, we have shown that Theorem 3: Given an MDP with countably infinite state space and a finite action space (of cardinality NA ), with a convex, compact policy space Π with P-dim(Π)= d, an , δ > 0, and γ < 1/K, if we estimate the value function for each policy in a given ε-net for Q (with (1−γK) ε < 3R(1−(γK) T +1 ) ) by doing n0 (/3, δ) simulation runs, each for T times steps, then the obtained policy π ˆ ∗ is optimal, in the sense that Pn {%(π) > } < δ. And moreover the sample complexity is given by   1 1 2 1 n0 (/3, δ) · n1 (ε, NA ) ∼ O 3 log , log ,   δ a polynomial in 1/. VII. A PPLICATION TO M ULTI -A RMED BANDITS Consider a Multi-armed bandit M with (finite) NA machines, each machine being a Markov process on the countably infinite state space X. When arm of machine a is pulled and the system moves from current state x to state x0 ∈ X according to some state transition function Pa (x, x0 ). Note that our framework is quite general: Not only we allow restless machines, we also allow the states of machines to be dependent. We will assume that the reward received is some bounded and known function r(x, a) ∈ [0, R]. The observation at time t is yt = r(xt , at ). Let ht+1 = (a0 , y0 , · · · , at , yt ) denote the history known at time t. A policy then is π = (π1 , π2 , · · ·) where πt (a, ht ), the probability of pulling arm a given the current history ht−1 . We will assume weP have a policy space Π. ∞ Let V (π) = E[ t=0 γ t r(xt , at )] denote the expected total discounted reward under policy π, where 0 < γ < 1 is the discount factor. We will assume that Π is convex and compact, and P-dim(Πt ) ≤ d, finite for each t. It is well-known that picking arms according to the Gittins index yields the optimal ‘exploration versus exploitation trade-off’ and maximizes V (π). However, computing the Gittins index is computationally complex, and the problem is known to be intractable in some cases. Thus, we extend the simulation-based framework we have developed for Markov decision processes to find an -optimal policy for such problems. Our simulation system for a Multi-armed bandit M is the following. Let ν, ω denote uniform i.i.d. noise sequences from Ω∞ as before. Then, given a policy π = (π1 , π2 , · · ·), the action at time t + 1 is at+1 = qπ,t (ht , νt+1 ) and the state at time t + 1 is xt+1 = fπ,t (xt , ht , νt+1 , ωt+1 ) = h(xt , at+1 , ωt+1 ).

Denote zt = (xt , ht ) ∈ Zt := X × Ht and ξt = (νt , ωt ) ∈ Ω2 . Thus, (zt+1 , θξ) = sπ,t+1 (zt , ξ) summarizes the simulation system for π where as before, θ is the left-shift operator. We denote Πt , Ht , Qt , Ft , St and Π, Q, F, S as before. It follows straightforwardly from Lemma 3 that if Πt is convex and P-dim(Πt ) = d < ∞, then P-dim(Qt ) = d also. Let µ be a probability measure on Ω∞ 2 and λt a probability measure on Zt . Denote the product measure by Pt = λt ×µ. Defining the psuedo-metrics ρt and dL1 on St and Qt as before, we can establish that Lemma 4 holds for the Multiarmed bandit problem also. We can then establish the following result. Theorem 4: Consider the Multi-armed bandit M with Pa (x, x0 ) the state transition function for arm a. Let r(x, a) the real-valued reward function bounded in [0, R] with discount factor 0 < γ < 1. Let Π be the set of stochastic policies (non-stationary and with memory in general), S be the set of simulation systems that simulate policies π ∈ Π under the simple simulation model. Suppose that P-dim (Qt ) = d < ∞ and let the probability measures λ, µ (on Z0 and Ω∞ respectively) and λt+1 a prob2 ability measure on Zt (λ1 = λ) be such that K := λst (z) t max{supt supst ∈S t ,z∈Zt λt+1 (z) , 1} is finite, where λs is T as defined in equation 5. Let Vˆn (π), the estimate of V (π) obtained from n samples with T time steps. Then, given any , δ > 0, with probability at least 1 − δ, sup |Vˆn (π) − V (π)| <  π∈Π

n ≥ n0 (, δ) with  n0 (, δ) := log 4δ + 2dT (log 32eR + log KT ) where T is α the /2 horizon time and α = /2(T + 1). Theorem 5: Given a Multi-armed bandit M with countably infinite state space X and a finite action space A (of cardinality NA ), with a convex, compact policy space Π with P-dim(Π)= d < ∞, an , δ > 0, and γ < 1/K, if we estimate the value function for each policy in a given (1−γK) ε-net for Q (with ε < 3R(1−(γK) T +1 ) ) by doing n0 (/3, δ) simulation runs, each for T times steps, then the obtained policy π ˆ ∗ is -optimal, in the sense that for

32R2 α2

Pn {%(π) > } < δ. And moreover the sample complexity is given by   1 1 2 1 n0 (/3, δ) · n1 (ε, NA ) ∼ O 3 log , log ,   δ a polynomial in 1/. VIII. C ONCLUSIONS The paper considers simulation-based value function estimation methods for Markov decision processes (MDP). Uniform sample complexity results are presented for the discounted reward case. The combinatorial complexity of

the space of simulation functions under the proposed simple simulation model is shown to be the same as that of the underlying space of induced Markov chains when the latter is convex. Using ergodicity and weak mixing leads to similar uniform sample complexity result for the average reward case, when a reference Markov chain exists. Extension of the results are obtained when the MDP is partially observable with general policies. Remarkably, the sample complexity results have the same order for both completely and partially observed MDPs when stationary and memoryless policies are used. Sample results for discounted-reward Markov games can be deduced easily as well. The results can be seen as an extension of the theory of PAC (probably approximately correct) learning for partially observable Markov decision processes (POMDPs) and games. PAC theory is related to the the system identification problem. One of the key contributions of this paper is the observation that how we simulate an MDP matters for obtaining uniform estimates. This is a new (and suprising) observation. Thus, the results of this paper can also be seen as the first steps towards developing an empirical process theory for Markov decision processes. Such a theory would go a long way in establishing a theoretical foundation for computer simulation of complex engineering systems. We have used Hoeffding’s inequality in obtaining the rate of convergence for discounted-reward MDPs and the McDiarmid-Azuma inequality for the average-reward MDPs, though more sophisticated and tighter inequalities of Talagrand [13], [?] can be used as well. This would yield better results and is part of future work. A PPENDIX Proof of Lemma 2. Proof: Consider any s1 = (q1 , f1 ), s2 = (q2 , f2 ) ∈ S and z ∈ Z. Then,



µ{ξ : st1 (z, ξ) 6= st2 (z, ξ)} t−1 µ{ξ : st1 (z, ξ) 6= st2 (z, ξ), st−1 1 (z, ξ) = s2 (z, ξ)} t−1 µ{ξ : st1 (z, ξ) 6= st2 (z, ξ), st−1 1 (z, ξ) 6= s2 (z, ξ)} µ{∪y (ξ : st1 (z, ξ) 6= st2 (z, ξ), t−1 t−1 st−1 ξ))} 1 (z, ξ) = s2 (z, ξ) = (y, θ t t µ{∪y (ξ : s1 (z, ξ) 6= s2 (z, ξ), t−1 t−1 t−1 st−1 ξ))} 1 (z, ξ) 6= s2 (z, ξ), s1 (z, ξ) = (y, θ t−1 t−1 µ{∪y (ξ : s1 (y, θ ξ) 6= s2 (y, θ ξ), t−1 t−1 st−1 ξ))} 1 (z, ξ) = s2 (z, ξ) = (y, θ t−1 t−1 µ{∪y (ξ : s1 (z, ξ) 6= s2 (z, ξ), t−1 st−1 ξ))} 1 (z, ξ) = (y, θ X t−1 µ{ξ : s1 (y, θ ξ) 6= s2 (y, θt−1 ξ)|

+

y t−1 st−1 ξ)}µ{ξ : st−1 1 (z, ξ) = (y, θ 1 (z, ξ) t−1 t−1 µ{ξ : s1 (z, ξ) 6= s2 (z, ξ)}.

= + = + ≤ +

= (y, θt−1 ξ)}

It is easy to argue that λst−1 (y) ≤ K t−1 λ(y) where 1 P t−1 λst−1 (y) = z λ(z)µ{ξ : st−1 ξ)}. Thus, 1 (z, ξ) = (y, θ 1 multiplying both RHS and LHS of the above sequence of inequalities and summing over z, and observing that µ{ξ : s1 (y, θt−1 ξ) 6= s2 (y, θt−1 ξ)} = µ{ξ 0 : s1 (y, ξ 0 ) 6= s2 (y, ξ 0 )} we get that the first part of RHS is X λst−1 (y)µ{ξ 0 : s1 (y, ξ 0 ) 6= s2 (y, ξ 0 )} 1

y

≤ K t−1 ·

X

λ(y)µ{ξ 0 : s1 (y, ξ 0 ) 6= s2 (y, ξ 0 )}

y

This implies that t−1 ρP (st1 , st2 ) ≤ K t−1 ρP (s1 , s2 ) + ρP (st−1 1 , s2 ) t−1 t−2 ≤ (K +K + · · · + 1)ρP (s1 , s2 ) t ≤ K ρP (s1 , s2 ).

where the second inequality is obtained by induction. Now, X λ(z)µ{ξ : s1 (z, ξ) 6= s2 (z, ξ)} z



X

λ(z)µ{ξ : q1 (z, ξ) 6= q2 (z, ξ)}

z



X

Z λ(z)

|q1 (z, ξ) − q2 (z, ξ)|dµ(ξ)

z

where the first inequality follows because the event {q1 = q2 , f1 6= f2 } has zero probability, and the second inequality follows because µ is a probability measure and |q1 − q2 | ≥ 1 when q1 6= q2 , and thus ρP (st1 , st2 ) ≤ K t · dL1 (q1 , q2 ) which proves the required assertion. Proof of Lemma 4. The proof of Lemma 4 is similar but the details are somewhat more involved. Proof: Consider any st1 , st2 ∈ F t and z ∈ Z. Then, µ{ξ : st1 (z, ξ) 6= st2 (z, ξ)} t−1 = µ{ξ : st1 (z, ξ) 6= st2 (z, ξ), st−1 1 (z, ξ) = s2 (z, ξ)} t−1 +µ{ξ : st1 (z, ξ) 6= st2 (z, ξ), st−1 1 (z, ξ) 6= s2 (z, ξ)} = µ{∪z∈Zt−1 (ξ : st1 (z, ξ) 6= st2 (z, ξ), t−1 t−1 st−1 ξ))} 1 (z, ξ) = s2 (z, ξ) = (z, θ t t +µ{∪z∈Zt−1 (ξ : s1 (z, ξ) 6= s2 (z, ξ), t−1 t−1 t−1 st−1 ξ))} 1 (z, ξ) 6= s2 (z, ξ), s1 (z, ξ) = (z, θ t t ≤ µ{∪z∈Zt−1 (ξ : s1 (z, ξ) 6= s2 (z, ξ), t−1 t−1 st−1 ξ))} 1 (z, ξ) = s2 (z, ξ) = (z, θ t−1 t−1 +µ{∪z∈Zt−1 (ξ : s1 (z, ξ) 6= s2 (z, ξ), t−1 st−1 ξ))} 1 (z, ξ) = (z, θ



X

µ{ξ : s1t (z, θt−1 ξ) 6= s2t (z, θt−1 ξ)|

z∈Zt−1 t−1 st−1 ξ)}µ{ξ : st−1 1 (z, ξ) = (z, θ 1 (z, ξ) t−1 t−1 +µ{ξ : s1 (z, ξ) 6= s2 (z, ξ)}.

= (z, θt−1 ξ)}

Multiplying both RHS and LHS of the above sequence of inequalities and summing over z, and observing again that µ{ξ : s1t (z, θt−1 ξ) 6= s2t (z, θt−1 ξ)} = µ{ξ 0 : s1t (z, ξ 0 ) 6= s2t (z, ξ 0 )} we get that the first part of RHS is X λst−1 (z)µ{ξ 0 : s1t (z, ξ 0 ) 6= s2t (z, ξ 0 )} 1

z∈Zt−1

X

≤K·

λt (z)µ{ξ 0 : s1t (z, ξ 0 ) 6= s2t (z, ξ 0 )}.

z∈Zt−1

This by induction implies ρ1 (st1 , st2 ) ≤ K(ρt (s1t , s2t ) + · · · + ρ1 (s11 , s21 )), which implies the first inequality. For the second inequality, note that the ρ pseudo-metric and the L1 pseudo-metric are related thus X λt (z)µ{ξ : s1t (z, ξ) 6= s2t (z, ξ)} z



X

λt (z)µ{ξ : q1t (z, ξ) 6= q2t (z, ξ)}

z



X

Z λt (z)

|q1t (z, ξ) − q2t (z, ξ)|dµ(ξ)

z

where the first inequality follows because only the events {q1t 6= q2t , f1t = f2t , g1t = g2t }, and {q1t 6= q2t , f1t 6= f2t } have non-zero probability and non-zero L1 distance. Both these events are contained in {q1t 6= q2t }. The second inequality follows because µ is a probability measure and |q1t − q2t | ≥ 1 when q1t 6= q2t , and thus ρt (s1t , s2t ) ≤ K · dL1 (q1t , q2t ) which proves the required assertion. R EFERENCES [1] E. A LTMAN AND O. Z EITOUNI, “Rate of convergence of empirical measures and costs in controlled Markov chains and transient optimality”, Math. of Operations Research, 19(4):955-974, 1994. [2] J. BAXTAR AND P. L. BARTLETT, “Infinite-horizon policy-gradient estimation”, J. of A. I. Research, 15:319-350, 2001. 86(2):377-390, 1991. [3] R. D UDLEY, Uniform Central Limit Theorems, Cambridge University Press, 1999. [4] D. H AUSSLER, “Decision theoretic generalizations of the PAC model for neural nets and other learning applications”, Information and Computation, 100(1):78-150, 1992. [5] M. K EARNS , Y. M ANSOUR AND A. Y. N G, “Approximate planning in large POMDPs via reusable trajectories”, Proc. Neural Information Processing Systems Conf. , 1999.

[6] A. N. KOLMOGOROV AND V. M. T IHOMIROV, “-entropy and capacity of sets in functional spaces”, American Math. Soc. Translation Series 2, 17:277-364, 1961. [7] M. L EDOUX, The Concentration of Measure Phenomenon, Mathematical Surveys and Monographs, Volume 89, American Mathematical Soc. , 2001. [8] S. M ANNOR AND J.N. T SITSIKLIS, “On the empirical state-action frequencies in Markov decision processes under general policies”, to appear, Math. of Operations Research, 2005. [9] P. M ARBACH AND J. N. T SITSIKLIS, “Approximate gradient methods in policy-space optimization of Markov reward processes”, J. Discrete Event Dynamical Systems, Vol. 13:111-148, 2003. [10] A. Y. N G AND M. I. J ORDAN, “Pegasus: A policy search method for large MDPs and POMDPs”, Proc. UAI, 2000. [11] L. P ESHKIN AND S. M UKHERJEE, “Bounds on sample size for policy evaluation in Markov environments”, Lecture Notes in Computer Science, 2111:616-630, 2001. [12] D. P OLLARD, Empirical Process Theory and Applications, Institute of Mathematical Statistics, Hayward, CA, 1990. [13] P-M. S AMSON, “Concentration of measure inequalities for Markov chains for Φ-mixing processes”, The Annals of Probability, 28(1):416461, 2000. [14] D. H. S HIM , H. J. K IM AND S. S ASTRY. “Decentralized nonlinear model predictive control of multiple flying robots in dynamic environments”, Proc. IEEE Conf. on Decision and Control, 2003. [15] M. TALAGRAND, “Concentration of measure and isoperimetric inequalities in product spaces”, Pub. Math. de l’I. H. E. S. , 81:73-205, 1995. [16] V. VAPNIK AND A. C HERVONENKIS, “Necessary and sufficient conditions for convergence of means to their expectations“, Theory of Probability and Applications, 26(3):532-553, 1981. [17] M. V IDYASAGAR, Learning and Generalization: With Applications to Neural Networks, Second edition, Springer-Verlag, 2003. [18] R.JAIN AND P.VARAIYA, “Simulation-based uniform value function estimates of Markov decision processes”, SIAM J. Control and Optimization, 45(5):1633-1656, 2006. [19] P. L. BARTLETT AND A. T EWARI, “Sample complexity of policy search with known dynamics, in Advances in Neural Information Processing Systems 19, MIT Press, 2007.

Simulation-based optimization of Markov decision ...

from simulation. We provide sample complexity of such an approach. Index Terms—Markov decision processes, Markov games, empirical process theory, PAC learning, value function estima- tion, uniform rate of convergence, simulation-based optimiza- tion. .... Say that F P-shatters {x1, ··· ,xn} if there exists a witness.

202KB Sizes 1 Downloads 202 Views

Recommend Documents

Semiparametric Estimation of Markov Decision ...
Oct 12, 2011 - procedure generalizes the computationally attractive methodology of ... pecially in the recent development of the estimation of dynamic games. .... distribution of εt ensures we can apply Hotz and Miller's inversion theorem.

Statistical Model Checking for Markov Decision ...
Programming [18] works in a setting similar to PMC. It also uses simulation for ..... we use the same input language as PRISM, many off-the-shelf models and case ... http://www.prismmodelchecker.org/casestudies/index.php. L resulting in the ...

Ranking policies in discrete Markov decision processes - Springer Link
Nov 16, 2010 - Springer Science+Business Media B.V. 2010. Abstract An ... Our new solution to the k best policies problem follows from the property: The .... Bellman backup to create successively better approximations per state per iteration.

Faster Dynamic Programming for Markov Decision ... - Semantic Scholar
number H, solving the MDP means finding the best ac- tion to take at each stage ... time back up states, until a time when the potential changes of value functions ...

Identification in Discrete Markov Decision Models
Dec 11, 2013 - written as some linear combination of elements in πθ. In the estimation .... {∆πθ0,θ : θ ∈ Θ\{θ0}} and the null space of IKJ + β∆HMKP1 is empty.

Limit Values in some Markov Decision Processes and ...
Application to MDPs with imperfect observation. 5. Application to repeated games with an informed controller ... (pure) strategy: σ = (σt)t≥1, where for each t,.

Optimistic Planning for Belief-Augmented Markov Decision ... - ORBi
have allowed to tackle large scale problems such as the game of Go [19]. .... tribution b0 and the history ht = (s0,a0,...,st−1,at−1,st) ..... Online optimization.

Optimistic Planning for Belief-Augmented Markov Decision ... - ORBi
[31], [33], and many sub-domains of artificial intelligence. [38]. By collecting data about the underlying ... As a first step, heuristic-type of solutions were proposed. (ϵ−greedy policies, Boltzmann exploration), but later in .... model (obtaine

Topological Value Iteration Algorithm for Markov Decision ... - IJCAI
space. We introduce an algorithm named Topolog- ical Value Iteration (TVI) that can circumvent the problem of unnecessary backups by detecting the structure of MDPs and ... State-space search is a very common problem in AI planning and is similar to

Dynamic optimization with type indeterminate decision ...
Aug 6, 2011 - dynamic inconsistency, may instead reflect rational optimization by a type ..... Using the values in table 1 and 2, we note that type θ1τ2 achieves ...

Dynamic optimization with type indeterminate decision ...
Aug 6, 2011 - ... to the McIntosh's paradox of self-control. AParis School of Economics, [email protected]. †Indiana University, [email protected]. 1 ...

Unsupervised Learning of Probabilistic Grammar-Markov ... - CiteSeerX
Computer Vision, Structural Models, Grammars, Markov Random Fields, .... to scale and rotation, and performing learning for object classes. II. .... characteristics of both a probabilistic grammar, such as a Probabilistic Context Free Grammar.

Markov Logic
MAP INFERENCE IN MARKOV LOGIC. NETWORKS. ○ We've tried Alchemy (MaxWalkSAT) with poor results. ○ Better results with integer linear programming.

Markov Bargaining Games
apply mutatis mutandis to any irreducible component of the Markov chain. A sufficient condition for the. Markov chain to be aperiodic is for πii > 0 ... subgame perfect equilibrium. 4It is not necessary for the discount factors to be strictly less t

markov chain pdf
File: Markov chain pdf. Download now. Click here if your download doesn't start automatically. Page 1 of 1. markov chain pdf. markov chain pdf. Open. Extract.

Markov Bargaining Games
I am grateful to Klaus Schmidt,. Avner Shaked and to the University of Bonn for their hospitality and support whilst I completed this ..... define b(i) and B(i) to be the bounds on the buyer's equilibrium payoffs when he is the proposer in a subgame

Markov Logic Networks
A Markov Logic Network (MLN) is a set of pairs. (F i. , w i. ) where. F i is a formula in first-order logic. w ... A distribution is a log-linear model over a. Markov network H if it is associated with. A set of features F = {f. 1. (D ..... MONTE CAR

Hidden Markov Models - Semantic Scholar
A Tutorial for the Course Computational Intelligence ... “Markov Models and Hidden Markov Models - A Brief Tutorial” International Computer Science ...... Find the best likelihood when the end of the observation sequence t = T is reached. 4.

OPTIMIZATION OF INTENSITY-MODULATED RADIOTHERAPY ...
NTCPs based on EUD formalism with corresponding ob- Fig. 1. (a) Sample DVH used for EUD calculation. (b) EUD for the. DVH in (a) as a function of parameter a. Tumors generally have. large negative values of a, whereas critical element normal struc- t

OPTIMIZATION OF INTENSITY-MODULATED RADIOTHERAPY ...
deviates from EUD0. For a tumor, the subscore attains a. small value when the equivalent uniform dose falls sig- nificantly below EUD0. Similarly, for a normal ...

Hidden Markov Models - Semantic Scholar
Download the file HMM.zip1 which contains this tutorial and the ... Let's say in Graz, there are three types of weather: sunny , rainy , and foggy ..... The transition probabilities are the probabilities to go from state i to state j: ai,j = P(qn+1 =