Generative Adversarial Imitation Learning Chun-Yao Kang

Aug 14, 2017

.

Chun-Yao Kang (National Taiwan University) Generative Adversarial Imitation Learning

.

.

.

.

.

. . . . . . . .

. . . . . . . .

. . . . . . . .

. .

Aug 14, 2017

.

. .

.

.

.

.

1 / 17

.

Outline

1

Introduction Imitation Learning Inverse Reinforcement Learning

2

Generative Adversarial Imitation Learning

3

Experiments

4

Conclusion

.

Chun-Yao Kang (National Taiwan University) Generative Adversarial Imitation Learning

.

.

.

.

.

. . . . . . . .

. . . . . . . .

. . . . . . . .

. .

Aug 14, 2017

.

. .

.

.

.

.

2 / 17

.

Outline

1

Introduction Imitation Learning Inverse Reinforcement Learning

2

Generative Adversarial Imitation Learning

3

Experiments

4

Conclusion

.

Chun-Yao Kang (National Taiwan University) Generative Adversarial Imitation Learning

.

.

.

.

.

. . . . . . . .

. . . . . . . .

. . . . . . . .

. .

Aug 14, 2017

.

. .

.

.

.

.

3 / 17

.

Imitation Learning

Learns from expert demonstrations Given only the trajectories from expert. Reward function is not available.

Examples: Self driving Robot controlling

Why imitation learning? Hard to define the reward in some tasks. Hand-crafted rewards can lead to unwanted behavior.

.

Chun-Yao Kang (National Taiwan University) Generative Adversarial Imitation Learning

.

.

.

.

.

. . . . . . . .

. . . . . . . .

. . . . . . . .

. .

Aug 14, 2017

.

. .

.

.

.

.

4 / 17

.

Two Approaches to Imitation Learning

Behavior cloning Supervised learning over state-action pairs from expert trajectories. Requires large amounts of data.

Inverse Reinforcement Learning Finds a reward function which makes expert trajectories better than others.

.

Chun-Yao Kang (National Taiwan University) Generative Adversarial Imitation Learning

.

.

.

.

.

. . . . . . . .

. . . . . . . .

. . . . . . . .

. .

Aug 14, 2017

.

. .

.

.

.

.

5 / 17

.

Notations

c(s, a): cost for taking action a at state s. (Acts the same as reward function) Eπ [c(s, a)]: expected cumulative cost w.r.t. policy π πE : expert policy τ : trajectory samples H(π) ≜ Eπ [− log π(a|s)]: causal entropy of policy π

.

Chun-Yao Kang (National Taiwan University) Generative Adversarial Imitation Learning

.

.

.

.

.

. . . . . . . .

. . . . . . . .

. . . . . . . .

. .

Aug 14, 2017

.

. .

.

.

.

.

6 / 17

.

Inverse Reinforcement Learning Maximum causal entropy IRL: ( ) ˜c = IRL(πE ) = arg max min −H(π) + Eπ [c(s, a)] − EπE [c(s, a)] c∈C

π∈Π

looks for a cost function ˜c ∈ C that assigns low cost to the expert policy and high cost to other policies. Recover the expert policy by running RL under cost function ˜c: π ˜ = RL(˜c) = arg min −H(π) + Eπ [˜c(s, a)] π∈Π

Challenges: IRL + RL finds exactly the expert policy, but is computational expensive. Restrict C to a smaller set (linear or convex) results in an algorithm which can scale to large state and action spaces, but might lead to poor imitation. . . . . . . . . . . . . . . . . . .

Chun-Yao Kang (National Taiwan University) Generative Adversarial Imitation Learning

.

.

. . . .

. . . .

. . . .

.

Aug 14, 2017

.

.

.

.

.

.

7 / 17

.

Outline

1

Introduction Imitation Learning Inverse Reinforcement Learning

2

Generative Adversarial Imitation Learning

3

Experiments

4

Conclusion

.

Chun-Yao Kang (National Taiwan University) Generative Adversarial Imitation Learning

.

.

.

.

.

. . . . . . . .

. . . . . . . .

. . . . . . . .

. .

Aug 14, 2017

.

. .

.

.

.

.

8 / 17

.

Generative Adversarial Imitation Learning

GAIL: min max Eπ [log(D(s, a))] + EπE [log(1 − D(s, a))] − λH(π) D

π

GAN: min max Ex∼pdata (x) [log(D(x))] + Ez∼pz (z) [log(1 − D(G(z)))] G

D

Discriminator D distinguishes between the distribution of data generated by G (π in GAIL) and the true data distribution (πE in GAIL)

.

Chun-Yao Kang (National Taiwan University) Generative Adversarial Imitation Learning

.

.

.

.

.

. . . . . . . .

. . . . . . . .

. . . . . . . .

. .

Aug 14, 2017

.

. .

.

.

.

.

9 / 17

.

Algorithm

Train the policy and discriminator network πθ and Dw : Input: τE , θ0 , w0 In each iteration Sample trajectories τ ∼ πθ Update w with gradient ascent Update θ with TRPO (a better rule to optimize policy than gradient decent)

.

Chun-Yao Kang (National Taiwan University) Generative Adversarial Imitation Learning

.

.

.

.

.

. . . . . . . .

. . . . . . . .

. . . . . . . .

. .

Aug 14, 2017

.

. .

.

.

.

.

10 / 17

.

Outline

1

Introduction Imitation Learning Inverse Reinforcement Learning

2

Generative Adversarial Imitation Learning

3

Experiments

4

Conclusion

.

Chun-Yao Kang (National Taiwan University) Generative Adversarial Imitation Learning

.

.

.

.

.

. . . . . . . .

. . . . . . . .

. . . . . . . .

. .

Aug 14, 2017

.

. .

.

.

.

.

11 / 17

.

Experiments Environments: 9 physics-based control tasks Classic RL control tasks (cartpole, etc.) Difficult high-dimensional tasks in MuJoCo (humanoid, etc.)

Expert trajectories Running TRPO on true cost function defined by OpenAI Gym About 50 state-action pairs in each trajectory

Baselines Behavior cloning Feature expectation matching (linear cost function class) Game-theoretic apprenticeship learning (convex cost function class)

Network structure two hidden layers of 100 units each tanh as activation function

.

Chun-Yao Kang (National Taiwan University) Generative Adversarial Imitation Learning

.

.

.

.

.

. . . . . . . .

. . . . . . . .

. . . . . . . .

. .

Aug 14, 2017

.

. .

.

.

.

.

12 / 17

.

Results

.

Chun-Yao Kang (National Taiwan University) Generative Adversarial Imitation Learning

.

.

.

.

.

. . . . . . . .

. . . . . . . .

. . . . . . . .

. .

Aug 14, 2017

.

. .

.

.

.

.

13 / 17

.

Results (contd.)

.

Chun-Yao Kang (National Taiwan University) Generative Adversarial Imitation Learning

.

.

.

.

.

. . . . . . . .

. . . . . . . .

. . . . . . . .

. .

Aug 14, 2017

.

. .

.

.

.

.

14 / 17

.

Outline

1

Introduction Imitation Learning Inverse Reinforcement Learning

2

Generative Adversarial Imitation Learning

3

Experiments

4

Conclusion

.

Chun-Yao Kang (National Taiwan University) Generative Adversarial Imitation Learning

.

.

.

.

.

. . . . . . . .

. . . . . . . .

. . . . . . . .

. .

Aug 14, 2017

.

. .

.

.

.

.

15 / 17

.

Conclusion

GAIL directly learns expert policy from small amounts of data using a framework similar to GAN. GAIL outperforms existing imitation learning algorithms and achieves exact expert performance even in complex environments.

.

Chun-Yao Kang (National Taiwan University) Generative Adversarial Imitation Learning

.

.

.

.

.

. . . . . . . .

. . . . . . . .

. . . . . . . .

. .

Aug 14, 2017

.

. .

.

.

.

.

16 / 17

.

Reference

Ho, Jonathan, and Stefano Ermon. Generative adversarial imitation learning. Advances in Neural Information Processing Systems, 2016. Goodfellow, Ian, et al. Generative adversarial nets Advances in Neural Information Processing Systems, 2014. Hung-Yi Lee Introduction of Generative Adversarial Network https://www.slideshare.net/tw_dsconf/ss-78795326

.

Chun-Yao Kang (National Taiwan University) Generative Adversarial Imitation Learning

.

.

.

.

.

. . . . . . . .

. . . . . . . .

. . . . . . . .

. .

Aug 14, 2017

.

. .

.

.

.

.

17 / 17

.

Generative Adversarial Imitation Learning

Aug 14, 2017 - c(s,a): cost for taking action a at state s. (Acts the same as reward function). Eπ[c(s,a)]: expected cumulative cost w.r.t. policy π. πE: expert policy.

383KB Sizes 11 Downloads 250 Views

Recommend Documents

Steganographic Generative Adversarial Networks
3National Research University Higher School of Economics (HSE) ..... Stacked convolutional auto-encoders for steganalysis of digital images. In Asia-Pacific ...

Unrolled Generative Adversarial Networks
mode collapse, stabilizes training of GANs with complex recurrent generators, and increases diversity and ..... Auto-encoding variational bayes, 2013. [25] T. D. ...

Ensembles of Generative Adversarial Networks - Computer Vision ...
Computer Vision Center. Barcelona, Spain ... call this class of GANs, Deep Convolutional GANs (DCGAN), and we will use these GANs in our experiments.

Texture Synthesis with Spatial Generative Adversarial ...
Generative image modeling using spatial LSTMs. In. Advances in Neural Information Processing Systems 28, 2015. [16] Dmitry Ulyanov, Vadim Lebedev, Andrea Vedaldi, and Victor Lempitsky. Texture networks: Feed-forward synthesis of textures and stylized

b-GAN: New Framework of Generative Adversarial ...
(6). Thus, estimating the density ratio problem turns out to be the minimization of Eq. 5 with respect to θ. 3.2 Motivation. In this section, we introduce important propositions required to derive b-GAN. Proofs of propositions are given in Appendix

Learning in Implicit Generative Models
translation, or fine-grained spatio-temporal models tracking the spread of disease. Alternatively, we ... and ecology, since the mechanistic understanding of such systems can be used to directly create a data simulator ... Without a likelihood functi

Hybrid Generative/Discriminative Learning for Automatic Image ...
1 Introduction. As the exponential growth of internet photographs (e.g. ..... Figure 2: Image annotation performance and tag-scalability comparison. (Left) Top-k ...

The Role of Imitation in Learning to Pronounce
SUMMARY . ..... Actions: what evidence do we have of the acts of neuromotor learning that are supposed to be taking place?

The Role of Imitation in Learning to Pronounce
adult judgment of either similarity or functional equivalence, the child can determine correspondences ...... Analysis (probably) of variable data of this kind from a range of speakers. 3. .... that ultimately produce them, including changes in respi

The Role of Imitation in Learning to Pronounce
Summary. ..... 105. 7.3.3. Actions: what evidence do we have of the acts of neuromotor learning that are supposed to be taking place?

2.75cm Imitation Learning for Vision-based Lane ...
PilotNet, 9 layer CNN with ∼250k trainable parameters. • Images captured from 3 front cameras for training. • Lane following with 98% level of autonomy. • Our contributions. • Experimentally show that data augmentation might not be necessar

LEARNING OF GOAL-DIRECTED IMITATION 1 ...
King's College London. Cecilia Heyes ... Over the past decade, considerable evidence has accumulated to support the suggestion that .... computer keyboard; immediately beyond the keyboard was a manipulandum; and 30 cm beyond the ...

The Role of Imitation in Learning to Pronounce
I, Piers Ruston Messum, declare that the work presented in this thesis is my own. Where ... both for theoretical reasons and because it leaves the developmental data difficult to explain ...... Motor, auditory and proprioceptive (MAP) information.

The Role of Imitation in Learning to Pronounce
The second mechanism accounts for how children learn to pronounce speech sounds. ...... In the next chapter, I will describe a number of mechanisms which account ...... (Spanish, for example, showing a reduced effect compared to English.) ...

Active Imitation Learning via State Queries
ments in two test domains show promise for our approach compared to a ... Learning Strategies to Reduce Label Cost, Bellevue, Wash- ington, USA. or if the ...

Discriminative Learning can Succeed where Generative ... - Phil Long
Feb 26, 2007 - Given a domain X, we say that a source is a probability distribution P over. X × {−1, 1}, and a learning problem P is a set of sources.

Adversarial Sequence Prediction
Software experiments provide evidence that this is also true .... evaders in Ef, with the obvious interchange of roles for predictors and evaders. This tells us that in ...