Relaxation Schemes for Min Max Generalization in Deterministic Batch Mode Reinforcement Learning Raphael Fonteneau Damien Ernst
Bernard Boigelot
Quentin Louveaux
University of Liège, Belgium Abstract
The T-stage min max generalization optimization problem
We study the min max optimization problem introduced in [1] for computing policies for batch mode reinforcement learning in a deterministic setting. This problem is NP-hard. We focus on the two-stage case for which we provide two relaxation schemes. The first relaxation scheme works by dropping some constraints in order to obtain a problem that is solvable in polynomial time. The second relaxation scheme, based on a Lagrangian relaxation where all constraints are dualized, leads to a conic quadratic programming problem. Both relaxation schemes are shown to provide better results than those given in [1].
Introduction Discrete-time optimal control problems arise in many fields (engineering, finance, medicine, artificial intelligence, etc). Batch mode reinforcement learning (RL) is a powerful tool to solve such problems when the only information available on the system is contained in a batch collection of trajectories of the system.
●
Batch mode RL algorithms are challenged when dealing with large or continuous spaces. In such cases, the main approach is to combine Dynamic Programming with function approximators, which can often lead to hazardous generalization.
●
To overcome this difficulty, [1] proposes a min max-type strategy for generalizing in deterministic, Lipschitz continuous environments with continuous state spaces, finite action spaces and finite time horizon.
●
In this work, we deeper investigate the min max optimization problem introduced in [1]. In particular, we propose two relaxation schemes that both provide better results that those given in [1]. Proofs of results are given in [2].
●
Focus on the 2-stage problem
Formalization ●
Deterministic discrete-time system, finite optimization horizon:
●
Continuous normed state space, finite action space:
●
Reward function:
●
T-stage return of a sequence of actions:
Two relaxation schemes for Assumption: the batch mode setting ●
●
Trust-region
The system dynamics and the reward function are unknown For each action transitions is known:
, a sample of
Lagrangian Relaxation
system
where
Assumption: Lipschitz continuity
Problem statement Comparing the bounds Given
,
and
, what is the worst possible return that
can be obtained for a specific sequence of actions? ●
●
Experimental results
Comparison of the two relaxation schemes with the solution (called CGRL) proposed in [1]:
Once this problem is solved, the min max approach to generalization aims at identifying a sequence of actions which maximizes its worst possible return. Distribution of the returns of control policies at the end of the sampling process
References [1] R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. Towards min max generalization in reinforcement learning. Agents and Artificial Intelligence: International Conference, ICAART 2010, Valencia, Spain, January 2010, Revised Selected Papers. Series: Communications in Computed and Information Science (CCIS), Volume 129, pp. 61-77. Editors: J. Filipe, A. Fred, and B.Sharp. Springer, Heidelberg, 2011. [2] R. Fonteneau, D. Ernst, B. Boigelot, Q. Louveaux. Min max generalization for deterministic batch mode reinforcement learning : relaxation schemes. Submitted.
Acknowledgements Raphael Fonteneau is a Postdoctoral Fellow of the FRS-FNRS. This paper presents research results of the Belgian Network DYSCO and the PASCAL2 European Network of Excellence. The authors also thank Yurii Nesterov for pointing out the idea of using Lagrangian relaxation.
Regular grid
Uniform sampling (average over 100 runs)