Generative Adversarial Imitation Learning

Viewer
Transcript

Generative Adversarial Imitation Learning Chun-Yao Kang

Aug 14, 2017

.

Chun-Yao Kang (National Taiwan University) Generative Adversarial Imitation Learning

.

.

.

.

.

. . . . . . . .

. . . . . . . .

. . . . . . . .

. .

Aug 14, 2017

.

. .

.

.

.

.

1 / 17

.

Outline

1

Introduction Imitation Learning Inverse Reinforcement Learning

2

Generative Adversarial Imitation Learning

3

Experiments

4

Conclusion

.

Chun-Yao Kang (National Taiwan University) Generative Adversarial Imitation Learning

.

.

.

.

.

. . . . . . . .

. . . . . . . .

. . . . . . . .

. .

Aug 14, 2017

.

. .

.

.

.

.

2 / 17

.

Outline

1

Introduction Imitation Learning Inverse Reinforcement Learning

2

Generative Adversarial Imitation Learning

3

Experiments

4

Conclusion

.

Chun-Yao Kang (National Taiwan University) Generative Adversarial Imitation Learning

.

.

.

.

.

. . . . . . . .

. . . . . . . .

. . . . . . . .

. .

Aug 14, 2017

.

. .

.

.

.

.

3 / 17

.

Imitation Learning

Learns from expert demonstrations Given only the trajectories from expert. Reward function is not available.

Examples: Self driving Robot controlling

Why imitation learning? Hard to define the reward in some tasks. Hand-crafted rewards can lead to unwanted behavior.

.

Chun-Yao Kang (National Taiwan University) Generative Adversarial Imitation Learning

.

.

.

.

.

. . . . . . . .

. . . . . . . .

. . . . . . . .

. .

Aug 14, 2017

.

. .

.

.

.

.

4 / 17

.

Two Approaches to Imitation Learning

Behavior cloning Supervised learning over state-action pairs from expert trajectories. Requires large amounts of data.

Inverse Reinforcement Learning Finds a reward function which makes expert trajectories better than others.

.

Chun-Yao Kang (National Taiwan University) Generative Adversarial Imitation Learning

.

.

.

.

.

. . . . . . . .

. . . . . . . .

. . . . . . . .

. .

Aug 14, 2017

.

. .

.

.

.

.

5 / 17

.

Notations

c(s, a): cost for taking action a at state s. (Acts the same as reward function) Eπ [c(s, a)]: expected cumulative cost w.r.t. policy π πE : expert policy τ : trajectory samples H(π) ≜ Eπ [− log π(a|s)]: causal entropy of policy π

.

Chun-Yao Kang (National Taiwan University) Generative Adversarial Imitation Learning

.

.

.

.

.

. . . . . . . .

. . . . . . . .

. . . . . . . .

. .

Aug 14, 2017

.

. .

.

.

.

.

6 / 17

.

Inverse Reinforcement Learning Maximum causal entropy IRL: ( ) ˜c = IRL(πE ) = arg max min −H(π) + Eπ [c(s, a)] − EπE [c(s, a)] c∈C

π∈Π

looks for a cost function ˜c ∈ C that assigns low cost to the expert policy and high cost to other policies. Recover the expert policy by running RL under cost function ˜c: π ˜ = RL(˜c) = arg min −H(π) + Eπ [˜c(s, a)] π∈Π

Challenges: IRL + RL finds exactly the expert policy, but is computational expensive. Restrict C to a smaller set (linear or convex) results in an algorithm which can scale to large state and action spaces, but might lead to poor imitation. . . . . . . . . . . . . . . . . . .

Chun-Yao Kang (National Taiwan University) Generative Adversarial Imitation Learning

.

.

. . . .

. . . .

. . . .

.

Aug 14, 2017

.

.

.

.

.

.

7 / 17

.

Outline

1

Introduction Imitation Learning Inverse Reinforcement Learning

2

Generative Adversarial Imitation Learning

3

Experiments

4

Conclusion

.

Chun-Yao Kang (National Taiwan University) Generative Adversarial Imitation Learning

.

.

.

.

.

. . . . . . . .

. . . . . . . .

. . . . . . . .

. .

Aug 14, 2017

.

. .

.

.

.

.

8 / 17

.

Generative Adversarial Imitation Learning

GAIL: min max Eπ [log(D(s, a))] + EπE [log(1 − D(s, a))] − λH(π) D

π

GAN: min max Ex∼pdata (x) [log(D(x))] + Ez∼pz (z) [log(1 − D(G(z)))] G

D

Discriminator D distinguishes between the distribution of data generated by G (π in GAIL) and the true data distribution (πE in GAIL)

.

Chun-Yao Kang (National Taiwan University) Generative Adversarial Imitation Learning

.

.

.

.

.

. . . . . . . .

. . . . . . . .

. . . . . . . .

. .

Aug 14, 2017

.

. .

.

.

.

.

9 / 17

.

Algorithm

Train the policy and discriminator network πθ and Dw : Input: τE , θ0 , w0 In each iteration Sample trajectories τ ∼ πθ Update w with gradient ascent Update θ with TRPO (a better rule to optimize policy than gradient decent)

.

Chun-Yao Kang (National Taiwan University) Generative Adversarial Imitation Learning

.

.

.

.

.

. . . . . . . .

. . . . . . . .

. . . . . . . .

. .

Aug 14, 2017

.

. .

.

.

.

.

10 / 17

.

Outline

1

Introduction Imitation Learning Inverse Reinforcement Learning

2

Generative Adversarial Imitation Learning

3

Experiments

4

Conclusion

.

Chun-Yao Kang (National Taiwan University) Generative Adversarial Imitation Learning

.

.

.

.

.

. . . . . . . .

. . . . . . . .

. . . . . . . .

. .

Aug 14, 2017

.

. .

.

.

.

.

11 / 17

.

Experiments Environments: 9 physics-based control tasks Classic RL control tasks (cartpole, etc.) Difficult high-dimensional tasks in MuJoCo (humanoid, etc.)

Expert trajectories Running TRPO on true cost function defined by OpenAI Gym About 50 state-action pairs in each trajectory

Baselines Behavior cloning Feature expectation matching (linear cost function class) Game-theoretic apprenticeship learning (convex cost function class)

Network structure two hidden layers of 100 units each tanh as activation function

.

Chun-Yao Kang (National Taiwan University) Generative Adversarial Imitation Learning

.

.

.

.

.

. . . . . . . .

. . . . . . . .

. . . . . . . .

. .

Aug 14, 2017

.

. .

.

.

.

.

12 / 17

.

Results

.

Chun-Yao Kang (National Taiwan University) Generative Adversarial Imitation Learning

.

.

.

.

.

. . . . . . . .

. . . . . . . .

. . . . . . . .

. .

Aug 14, 2017

.

. .

.

.

.

.

13 / 17

.

Results (contd.)

.

Chun-Yao Kang (National Taiwan University) Generative Adversarial Imitation Learning

.

.

.

.

.

. . . . . . . .

. . . . . . . .

. . . . . . . .

. .

Aug 14, 2017

.

. .

.

.

.

.

14 / 17

.

Outline

1

Introduction Imitation Learning Inverse Reinforcement Learning

2

Generative Adversarial Imitation Learning

3

Experiments

4

Conclusion

.

Chun-Yao Kang (National Taiwan University) Generative Adversarial Imitation Learning

.

.

.

.

.

. . . . . . . .

. . . . . . . .

. . . . . . . .

. .

Aug 14, 2017

.

. .

.

.

.

.

15 / 17

.

Conclusion

GAIL directly learns expert policy from small amounts of data using a framework similar to GAN. GAIL outperforms existing imitation learning algorithms and achieves exact expert performance even in complex environments.

.

Chun-Yao Kang (National Taiwan University) Generative Adversarial Imitation Learning

.

.

.

.

.

. . . . . . . .

. . . . . . . .

. . . . . . . .

. .

Aug 14, 2017

.

. .

.

.

.

.

16 / 17

.

Reference

Ho, Jonathan, and Stefano Ermon. Generative adversarial imitation learning. Advances in Neural Information Processing Systems, 2016. Goodfellow, Ian, et al. Generative adversarial nets Advances in Neural Information Processing Systems, 2014. Hung-Yi Lee Introduction of Generative Adversarial Network https://www.slideshare.net/tw_dsconf/ss-78795326

.

Chun-Yao Kang (National Taiwan University) Generative Adversarial Imitation Learning

.

.

.

.

.

. . . . . . . .

. . . . . . . .

. . . . . . . .

. .

Aug 14, 2017

.

. .

.

.

.

.

17 / 17

.

Generative Adversarial Imitation Learning

Aug 14, 2017 - c(s,a): cost for taking action a at state s. (Acts the same as reward function). EÏ[c(s,a)]: expected cumulative cost w.r.t. policy Ï. ÏE: expert policy.

Download PDF

383KB Sizes 11 Downloads 289 Views

Report

Generative Adversarial Imitation Learning

Recommend Documents