Online Supplement: Improving UCT Planning via Approximate Homomorphisms Nan Jiang1 , Satinder Singh1 , and Richard Lewis2 1
1
Computer Science and Engineering , University of Michigan 2 Department of Psychology, University of Michigan
Theoretical Analysis
The main idea of our algorithm is to build empirical local layered MDPs based on the trajectories sampled by UCT and then to find approximate homomorphisms in them. This construction is lossy in two ways: 1. The empirical MDP is different from the original MDP. 2. The abstraction is built hence has approximation errors from the empirical MDP. We will show how to combine the loss from both sources. First, we define the notion of loss in value along with some notation useful for the analysis. Notation & Objective of Analysis: For the current state of interest, let the true local layered MDP with depth dmax be M . In the first ˆ . In step that introduces value-loss, UCT samples n trajectories in M and builds an empirical MDP M ˆ to an abstract a second step that also introduces value-loss, an approximate homomorphism h maps M 0 0 ˆ h constructed by applying Algorithm 1 in the main paper to M ˆ with parameter ( T , R ) (hence MDP M 2 2 the approximation error of the constructed abstraction is at most (0T , 0R )) 1 . Our analytical objective is to π∗ π∗ π bound the loss of the abstraction, i.e., to bound kVM − VMh k∞ , where VM is the expected value function ∗ ˆh of policy π evaluated in MDP M , and π is the optimal policy in M , while πh∗ is the optimal policy in M lifted to true local MDP M . Theorem 1. (Main Result) ∀ηT , ηR > 0, 0 < δ < 1, π∗
∗
π kVM − VMh k∞ ≤
2(0R + ηR ) γ(Rmax − Rmin )(0T + ηT ) + 1−γ (1 − γ)2
holds with probability at least 1 − δ if n > max{exp(dmax ) log a(KB)dmax /δ /b , N } where 1. K is the number of actions available in state, 2. B is the maximal number of possible next states from a state-action pair, def
3. p =
min (s,a,s1 ,d):P (s,a,s1 ,d))>0
P (s, a, s1 , d))/2,
ˆ and M ˆ h, 4. [Rmin , Rmax ] is the range of reward in M, M 1 We
ˆ and M ˆ h , and similarly for reward function. use P , Pˆ and Pˆh to distinguish transition probabilities of M , M
1
(1)
5. N, c are positive constants that do not depend on the choice of ηT , ηR , def
6. a = max{3dmax , 6B}, def
2 7. b = min{2cp2 , 2cηR /(Rmax − Rmin )2 , 2cηT2 /B 2 }.
We prove Theorem 1 using the following three lemmas: Lemma 2. The probability that max s,a,d
X Pˆ (s, a, s1 , d)) − P (s, a, s1 , d)) ≤ ηT s1
(2) ˆ a, d) − R(s, a, d)| ≤ ηR max |R(s, s,a,d
holds is at least 2 1 − (KB)dmax dmax exp(−2p2 c log(dmax ) (n)) + 2 exp(−2c log(dmax ) (n)ηR /(Rmax − Rmin )2 ) +2B exp(−2ηT2 c log(dmax ) (n)/B 2 ) .
(3)
ˆ h be (T , R ). If Eq.(2) holds, T ≤ 0 + ηT Lemma 3. Let the approximation parameter for h : M 7→ M T 0 and R ≤ R + ηR . Lemma 4. (Ravindran and Barto, 2004 [1]) ∗
π∗
π kVM − VMh k∞ ≤
γ(Rmax − Rmin )T 2R + . 1−γ (1 − γ)2
ˆ a, d) The key idea in the proof of Lemma 2 is to consider each (s, a, d) and bound the probability that R(s, and Pˆ (s, a, ·, d) are not accurate, and then bound the probability that inaccurate estimates do not occur at any (s, a, d) by union bound. To obtained the former result, we first need to bound the number of times an (s, a, d) tuple is visited, which is given in the following lemma. Lemma 5. ∃c > 0, N > 0 s.t. when n > N , n o P ns,a,d ≥ c log(d+1) (n) ≥ 1 − exp(−2p2 c log(dmax ) (n)))d . Proof. (By induction.) At d = 0, which is the root, the state is visited exactly n times. According to Theorem 3 in [2], ∃ρ > 0 s.t. ns,a,d ≥ ρ log(ns,d ). Therefore, ns,a,0 ≥ ρ log(ns,0 ) > c log(n) with probability 1 as long as c < ρ. Now consider arbitrary d < dmax . Let the state-action pair at the previous level that leads to s be (s0 , a0 , d−1). According to the induction assumption, n o P ns0 ,a0 ,d−1 ≥ c log(d) (n) ≥ (1 − exp(−2p2 c log(dmax ) (n)))d−1 . (4) n o What we need to bound is P ns,a,d ≥ c log(d+1) (n) , which can be decomposed in the following way n o P ns,a,d ≥ c log(d+1) (n) n o ≥ P ns,a,d ≥ c log(d+1) (n), ns0 ,a0 ,d−1 ≥ c log(d) (n) n o n o = P ns,a,d ≥ c log(d+1) (n) ns0 ,a0 ,d−1 ≥ c log(d) (n) · P ns0 ,a0 ,d−1 ≥ c log(d) (n) . 2
The second term has already been bounded in Eq.(4). We will bound the first term in two steps: first, we show that o o n n P ns,a,d ≥ c log(d+1) (n) ns0 ,a0 ,d−1 ≥ c log(d) (n) ≥ P ns,d /ns0 ,a0 ,d−1 ≥ p ns0 ,a0 ,d−1 ≥ c log(d) (n) . (5) This is because when ns,d /ns0 ,a0 ,d−1 ≥ p holds, ns,a,d ≥ ρ log(ns,d ) ≥ ρ log(pc log(d) (n))) = ρ log(d+1) (n) + ρ log(pc). Note that the second term is a constant, thus for any 0 < c < ρ, as long as log(dmax ) (n) > log(pc)/(c − ρ) (solving this inequality yields N , which does not depend on ηT and ηR ) we have ns,a,d ≥ c log(d+1) (n). This shows that the right side event of Eq.(5) is a sub-event of the left side, thus the inequality holds. Second, we bound the right side of Eq.(5). For any fixed ns0 ,a0 ,d−1 , ns0 ,a0 ,d−1,s1 /ns0 ,a0 ,d−1 is the average of Bernoulli random variables with expected value P (s0 , a0 , s, d − 1). According to Hoeffding bound, n o n o P ns,d /ns0 ,a0 ,d−1 ≥ p ≥ P ns0 ,a0 ,s,d−1 /ns0 ,a0 ,d−1 ≥ p n o = P ns0 ,a0 ,s,d−1 /ns0 ,a0 ,d−1 − P (s0 , a0 , s, d − 1) ≥ p − P (s0 , a0 , s, d − 1) ≥ 1 − exp(−2(P (s0 , a0 , s, d − 1) − p)2 ns0 ,a0 ,d−1 ). According to the definition of p in Eq.(3), we always have P (s0 , a0 , s, d − 1) − p > p. With ns0 ,a0 ,d−1 ≥ c log(d) (n), we have n o P ns,d /ns0 ,a0 ,d−1 ≥ p ≥ 1 − exp(−2p2 c log(d) (n)) Hence n o n o n o P ns,a,d ≥ c log(d+1) (n) ≥ P ns,a,d ≥ c log(d+1) (n) ns0 ,a0 ,d−1 ≥ c log(d) (n) · P ns0 ,a0 ,d−1 ≥ c log(d) (n) n o n o ≥ P ns,d /ns0 ,a0 ,d−1 ≥ p ns0 ,a0 ,d−1 ≥ c log(d) (n) · P ns0 ,a0 ,d−1 ≥ c log(d) (n) ≥ (1 − exp(−2p2 c log(d) (n)))(1 − exp(−2p2 c log(dmax ) (n)))d−1 ≥ (1 − exp(−2p2 c log(dmax ) (n)))d . So the lemma follows. Proof of Lemma 2. Consider a state-action-depth tuple (s, a, d). For the empirical reward and transition probabilities to be accurate at (s, a, d), we first require that (s, a, d) is visited sufficiently. By relaxing Lemma 5 we have a universal bound for ns,a,d that is independent of d: ∀(s, a, d), ∃c, N 0 , s.t. ∀n > N 0 , n o P ns,a,d ≥ c log(dmax ) (n) ≥ (1 − exp(−2p2 c log(dmax ) (n)))dmax where c and N 0 are the constants specified in Lemma 5. Now we can bound the probability that reward and transition estimates are inaccurate separately. The ˆ a) is the average of at least c log(dmax ) (n) i.i.d. samples of random variables that lie in empirical reward R(s, [Rmin , Rmax ], with expected value R(s, a), hence by Hoeffding bound n o 2 ˆ a) − R(s, a)| > ηR ns,a,d ≥ c log(dmax ) (n) ≤ 2 exp(−2c log(dmax ) (n)ηR P |R(s, /(Rmax − Rmin )2 ). Similarly for transition probabilities, nX o P |Pˆ (s, a, s1 , d) − P (s, a, s1 , d)| > ηT ns,a,d ≥ c log(dmax ) (n) s
1 n[ o ≤ P |Pˆ (s, a, s1 , d) − P (s, a, s1 , d)| > ηT /B ns,a,d ≥ c log(dmax ) (n)
s1
o X n ≤ P |Pˆ (s, a, s1 , d) − P (s, a, s1 , d)| > ηT /B ns,a,d ≥ c log(dmax ) (n) . s1
3
Consider a particular possible next state s1 , o n P |Pˆ (s, a, s1 , d) − P (s, a, s1 , d)| > ηT /B ns,a,d ≥ c log(dmax ) (n) ≤ 2 exp(−2ηT2 c log(dmax ) (n)/B 2 ). As (s, a) has at most B possible next states, nX o P |Pˆ (s, a, s1 , d) − P (s, a, s1 , d)| > ηT ns,a,d ≥ c log(dmax ) (n) ≤ 2B exp(−2ηT2 c log(dmax ) (n)/B 2 ). s1
By observing that empirical reward and transition distribution are independent of each other when fixing ns,a,d , we have the following result: ∀(s, a, d), n o X ˆ a, d) − R(s, a, d)| ≤ ηR , P |R(s, |Pˆ (s, a, s1 , d) − P (s, a, s1 , d)| ≤ ηT s1
n o X ˆ a, d) − R(s, a, d)| ≤ ηR , ≥ P |R(s, |Pˆ (s, a, s1 , d) − P (s, a, s1 , d)| ≤ ηT , ns,a,d ≥ c log(dmax ) (n) s1
o n o n ˆ a, d) − R(s, a, d)| ≤ ηR ns,a,d ≥ c log(dmax ) (n) = P ns,a,d ≥ c log(dmax ) (n) · P |R(s, nX o ·P |Pˆ (s, a, s1 , d) − P (s, a, s1 , d)| ≤ ηT ns,a,d ≥ c log(dmax ) (n) s1 2 ≥ (1 − exp(−2p2 c log(dmax ) (n)))dmax (1 − 2 exp(−2c logd+1 (n)ηR /(Rmax − Rmin )2 ))
(1 − 2B exp(−2ηT2 c log(dmax ) (n)/B 2 )).
(6)
The final step is to bound the probability that the estimate is accurate everywhere. With union bound, n o P M 0 is not (T , R ) accurate n [ o =P M 0 is not (T , R ) accurate at (s, a, d) (s,a,d)
≤
X
n o P M 0 is not (T , R ) accurate at (s, a, d)
(s,a,d)
≤ #(s, a, d) ·
! n o X ˆ a, d) − R(s, a, d)| ≤ ηR , 1 − P |R(s, |Pˆ (s, a, s1 , d) − P (s, a, s1 , d)| ≤ ηT . s1
And the bound in Lemma 2 is obtained by plugging in Eq.(6) and noticing that #(s, a, d) ≤ (KB)dmax , and can be simplified by its first order approximation (which is strictly smaller). Finally, from Theorem 2 to our main result, we only have to require that each term in the outmost parenthesis in Eq.(3) is less than δ/3(KB)dmax , and find the satisfying n.
4
Proof of Lemma 3. def
T = max s,a,d
X Pˆh (h(s), a, x, d) − x
s1 :h(s1 )=x
X ≤ max Pˆh (h(s), a, x, d) − s,a,d
x
x
X
s,a,d
≤
Pˆ (s, a, s1 , d) −
s1 :h(s1 )=x
X = 0T + max 0T
Pˆ (s, a, s1 , d)
X s1 :h(s1 )=x
X + max s,a,d
P (s, a, s1 , d)
X
+ max s,a,d
x
X
X
P (s, a, s1 , d)
s1 :h(s1 )=x
Pˆ (s, a, s1 , d) − P (s, a, s1 , d)
X s1 :h(s1 )=x
X
Pˆ (s, a, s1 , d) − P (s, a, s1 , d)
x s1 :h(s1 )=x
X Pˆ (s, a, s1 , d) − P (s, a, s1 , d) = 0T + max s,a,d
=
0T
s1
+ ηT .
The proof of R ≤ 0R + ηR is very similar hence omitted.
References [1] Balaraman Ravindran and A Barto. Approximate homomorphisms: A framework for nonexact minimization in Markov decision processes. In 5th International Conference on Knowledge-Based Computer Systems, 2004. [2] Levente Kocsis and Csaba Szepesv´ ari. Bandit based Monte-Carlo planning. In 15th European Conference on Machine Learning, pages 282–293, 2006.
5