Min Max Generalization for Twostage Deterministic Batch Mode Reinforcement Learning: Relaxation Schemes
Abstract
We study the optimization problem introduced in [22] for computing policies for batch mode reinforcement learning in a deterministic setting. First, we show that this problem is NPhard. In the twostage case, we provide two relaxation schemes. The first relaxation scheme works by dropping some constraints in order to obtain a problem that is solvable in polynomial time. The second relaxation scheme, based on a Lagrangian relaxation where all constraints are dualized, leads to a conic quadratic programming problem. We also theoretically prove and empirically illustrate that both relaxation schemes provide better results than those given in [22].
einforcement Learning, Min Max Generalization, Nonconvex Optimization, Computational Complexity
60J05 Discretetime Markov processes on general state spaces
1 Introduction
Research in Reinforcement Learning (RL) [48] aims at designing computational agents able to learn by themselves how to interact with their environment to maximize a numerical reward signal. The techniques developed in this field have appealed researchers trying to solve sequential decision making problems in many fields such as Finance [26], Medicine [34, 35] or Engineering [42]. Since the end of the nineties, several researchers have focused on the resolution of a subproblem of RL: computing a highperformance policy when the only information available on the environment is contained in a batch collection of trajectories of the agent [10, 17, 28, 38, 42, 19]. This subfield of RL is known as “batch mode RL”.
Batch mode RL (BMRL) algorithms are challenged when dealing with large or continuous state spaces. Indeed, in such cases they have to generalize the information contained in a generally sparse sample of trajectories. The dominant approach for generalizing this information is to combine BMRL algorithms with function approximators [6, 28, 17, 11]. Usually, these approximators generalize the information contained in the sample to areas poorly covered by the sample by implicitly assuming that the properties of the system in those areas are similar to the properties of the system in the nearby areas well covered by the sample. This in turn often leads to low performance guarantees on the inferred policy when large state space areas are poorly covered by the sample. This can be explained by the fact that when computing the performance guarantees of these policies, one needs to take into account that they may actually drive the system into the poorly visited areas to which the generalization strategy associates a favorable environment behavior, while the environment may actually be particularly adversarial in those areas. This is corroborated by theoretical results which show that the performance guarantees of the policies inferred by these algorithms degrade with the sample dispersion where, loosely speaking, the dispersion can be seen as the radius of the largest nonvisited state space area.
To overcome this problem, [22] propose a type strategy for generalizing in deterministic, Lipschitz continuous environments with continuous state spaces, finite action spaces, and finite timehorizon. The approach works by determining a sequence of actions that maximizes the worst return that could possibly be obtained considering any system compatible with the sample of trajectories, and a weak prior knowledge given in the form of upper bounds on the Lipschitz constants related to the environment (dynamics, reward function). However, they show that finding an exact solution of the problem is far from trivial, even after reformulating the problem so as to avoid the search in the space of all compatible functions. To circumvent these difficulties, they propose to replace, inside this problem, the search for the worst environment given a sequence of actions by an expression that lowerbounds the worst possible return which leads to their so called CGRL algorithm (the acronym stands for “Cautious approach to Generalization in Reinforcement Learning”). This lower bound is derived from their previous work [20, 21] and has a tightness that depends on the sample dispersion. However, in some configurations where areas of the the state space are not well covered by the sample of trajectories, the CGRL bound turns to be very conservative.
In this paper, we propose to further investigate the generalization optimization problem that was initially proposed in [22]. We first show that this optimization problem is NPhard. We then focus on the twostage case, which is still NPhard. Since it seems hopeless to exactly solve the problem, we propose two relaxation schemes that preserve the nature of the generalization problem by targetting policies leading to high performance guarantees. The first relaxation scheme works by dropping some constraints in order to obtain a problem that is solvable in polynomial time. This results into a well known configuration called the trustregion subproblem [13]. The second relaxation scheme, based on a Lagrangian relaxation where all constraints are dualized, can be solved using conic quadratic programming in polynomial time. We prove that both relaxation schemes always provide bounds that are greater or equal to the CGRL bound. We also show that these bounds are tight in a sense that they converge towards the actual return when the sample dispersion converges towards zero, and that the sequences of actions that maximize these bounds converge towards optimal ones.
The paper is organized as follows:

in Section 2, we give a short summary of the literature related to this work,

Section 3 formalizes the generalization problem in a Lipschitz continuous, deterministic BMRL context,

in Section 4, we focus on the particular twostage case, for which we prove that it can be decoupled into two independent problems corresponding respectively to the first stage and the second stage (Theorem 4.1):

the first stage problem leads to a trivial optimization problem that can be solved in closedform (Corollary 4.1),


we analyze in Section 5.4 the asymptotic behavior of the relaxation schemes as a function of the sample dispersion:

we show that the the bounds provided by the relaxtion schemes converge towards the actual return when the sample dispersion decreases towards zero (Theorem 5.4.1),

we show that the sequences of actions maximizing such bounds converge towards optimal sequences of actions when the sample dispersion decreases towards zero (Theorem 5.4.2),


Section 6 illustrates the relaxation schemes on an academic benchmark,

Section 7 concludes.
We provide in Figure 1 an illustration of the roadmap of the main results of this paper.
2 Related Work
Several works have already been built upon paradigms for computing policies in a RL setting. In stochastic frameworks, approaches are often successful for deriving robust solutions with respect to uncertainties in the (parametric) representation of the probability distributions associated with the environment [16]. In the context where several agents interact with each other in the same environment, approaches appear to be efficient strategies for designing policies that maximize one agent’s reward given the worst adversarial behavior of the other agents. [29, 43]. They have also received some attention for solving partially observable Markov decision processes [30, 27].
The approach towards generalization, originally introduced in [22], implicitly relies on a methodology for computing lower bounds on the worst possible return (considering any compatible environment) in a deterministic setting with a mostly unknown actual environment. In this respect, it is related to other approaches that aim at computing performance guarantees on the returns of inferred policies [33, 41, 39].
Other fields of research have proposed type strategies for computing control policies. This includes Robust Control theory [24] with H methods [2], but also Model Predictive Control (MPC) theory  where usually the environment is supposed to be fully known [12, 18]  for which approaches have been used to determine an optimal sequence of actions with respect to the “worst case” disturbance sequence occurring [44, 4]. Finally, there is a broad stream of works in the field of Stochastic Programming [7] that have addressed the problem of safely planning under uncertainties, mainly known as “robust stochastic programming” or “riskaverse stochastic programming” [15, 45, 46, 36]. In this field, the twostage case has also been particularly wellstudied [23, 14].
3 Problem Formalization
We first formalize the BMRL setting in Section 3.1, and we state the generalization problem in Section 3.2.
3.1 Batch Mode Reinforcement Learning
We consider a deterministic discretetime system whose dynamics over stages is described by a timeinvariant equation
where for all , the state is an element of the state space where denotes the dimensional Euclidean space and is an element of the finite (discrete) action space that we abusively identify with . is referred to as the (finite) optimization horizon. An instantaneous reward
is associated with the action taken while being in state . For a given initial state and for every sequence of actions , the cumulated reward over stages (also named stage return) is defined as follows: {definition}[stage Return]
where
An optimal sequence of actions is a sequence that leads to the maximization of the stage return: {definition}[Optimal stage Return]
We further make the following assumptions that characterize the batch mode setting:

The system dynamics and the reward function are unknown;

For each action , a set of onestep system transitions
is known where each onestep transition is such that:

We assume that every set contains at least one element:
In the following, we denote by the collection of all system transitions:
Under those assumptions, batch mode reinforcement learning (BMRL) techniques propose to infer from the sample of onestep system transitions a highperformance sequence of actions, i.e. a sequence of actions such that is as close as possible to .
3.2 Min max Generalization under Lipschitz Continuity Assumptions
In this section, we state the generalization problem that we study in this paper. The formalization was originally proposed in [22].
We first assume that the system dynamics and the reward function are assumed to be Lipschitz continuous. There exist finite constants such that:
where denotes the Euclidean norm over the space . We also assume that two constants and satisfying the abovewritten inequalities are known.
For a given sequence of actions, one can define the worst possible return that can be obtained by any system whose dynamics and would satisfy the Lipschitz inequalities and that would coincide with the values of the functions and given by the sample of system transitions . As shown in [22], this worst possible return can be computed by solving a finitedimensional optimization problem over . Intuitively, solving such an optimization problem amounts in determining a most pessimistic trajectory of the system that is still compliant with the sample of data and the Lipschitz continuity assumptions.
More specifically, for a given sequence of actions , some given constants and , a given initial state and a given sample of transitions , this optimization problem writes:
Note that, throughout the paper, optimization variables will be written in bold.
The min max approach to generalization aims at identifying which sequence of actions maximizes its worst possible return, that is which sequence of actions leads to the highest value of .
We focus in this paper on the design of resolution schemes for solving the program . These schemes can afterwards be used for solving the problem through exhaustive search over the set of all sequences of actions.
Later in this paper, we will also analyze the computational complexity of this min max generalization problem. When carrying out this analysis, we will assume that all the data of the problem (i.e., ) are given in the form of rational numbers.
4 The Twostage Case
In this section, we restrict ourselves to the case where the time horizon contains only two steps, i.e. , which is an important particular case of . Many works in optimal sequential decision making have considered the twostage case [23, 14], which relates to many applications, such as for instance medical applications where one wants to infer “safe” clinical decision rules from batch collections of clinical data [1, 31, 32, 49].
In Section 4.1, we show that this problem can be decoupled into two subproblems. While the first subproblem is straightforward to solve, we prove in Section 4.2 that the second one is NPhard, which proves that the twostage problem as well as the generalized stage problem are also NPhard.
Given a twostage sequence of actions , the twostage version of the problem writes as follows:
For a matter of simplicity, we will often drop the arguments in the definition of the optimization problem and refer as . We denote by the lower bound associated with an optimal solution of : {definition}[Optimal Value ] Let , and let be an optimal solution to Then,
4.1 Decoupling Stages
Let and be the two following subproblems:
We show in this section that an optimal solution to can be obtained by solving the two subproblems and corresponding to the first stage and the second stage. Indeed, one can see that the stages and are theoretically coupled by constraint (4), except in the case where the two actions and are different for which is trivially decoupled. We prove in the following that, even in the case , optimal solutions to the two decoupled problems and also satisfy constraint (4). Additionally, we provide the solution of .
Let . If is an optimal solution to and is an optimal solution to , then is an optimal solution to .

First case:
The constraint (4) drops and the theorem is trivial.

Second case:
The rationale of the proof is the following. We first relax constraint (4), and consider the two problems and . Then, we show that optimal solutions of and also satisfy constraint (4).
About
The problem consists in the minimization of under the intersection of interval constraints. It is therefore straightforward to solve. In particular the optimal solution lies at the lower value of one of the intervals. Therefore there exists such that
(9) 
Furthermore must belong to all intervals. We therefore have that
(10) 
In other words,
About
Again we observe that it is the minimization of under the intersection of interval constraints as well. The sizes of the intervals are however not fixed but determined by the variable . If we denote the optimal solution of by and , we know that also lies at the lower value of one of the intervals. Hence there exists such that
(11) 
Furthermore must belong to all intervals. We therefore have that
(12) 
We now discuss two cases depending on the sign of .
If
Using (9) and (12) with index , we have
(13) 
Since , we therefore have
(14) 
Using the triangle inequality we can write
(15) 
Replacing (15) in (14) we obtain
which shows that and satisfy constraint (4).
If
Using (11) and (10) with index , we have
and since ,
(16) 
Using the triangle inequality we can write
(17) 
which again shows that and satisfy constraint (4).
In both cases and , we have shown that constraint (4) is satisfied.
In the following of the paper, we focus on the two subproblems and rather than on . From the proof of Theorem 4.1 given above, we can directly obtain the solution of :
The solution of the problem is
4.2 Complexity of
The problem being solved, we now focus in this section on the resolution of . In particular, we show that it is NPhard, even in the particular case where there is only one element in the sample . In this particular case, the problem amounts to maximizing of the distance under an intersection of balls as we show in the following lemma.
If the cardinality of is equal to :
then the optimal solution to satisfies
where maximizes subject to
The unique constraint concerning is an interval. Therefore takes the value of the lower bound of the interval. In order to obtain the lowest such value, the righthandside of (7) must be maximized under the other constraints.
Note that if the cardinality of is also equal to , then can be solved exactly, as we will later show in Corollary 3. But, in the general case where this problem of maximizing a distance under a set of ballconstraints is NPhard as we now prove. To do it, we introduce the MNBC (for “Max Norm with Ball Constraints”) decision problem:
[MNBC Decision Problem] Given , the MNBC problem is to determine whether there exists such that
and
MNBC is NPhard.
To prove it, we will do a reduction from the programming feasibility problem [40]. More precisely, we consider in this proof the programming feasibility problem, which is equivalent. The problem is, given to find whether there exists that satisfies . This problem is known to be NPhard and we now provide a polynomial reduction to MNBC.
The dimension is kept the same in both problems. The first step is to define a set of constraints for MNBC such that the only potential feasible solutions are exactly We define
and
For , we define
with and for all and .
Similarly for , we define
with and for all and .
Claim
It is readily verified that any belongs to the above sets.
Consider that belongs to the above sets. Consider an index . Using the constraints defining the sets, we can in particular write
that we can write algebraically
(18)  
(19)  
(20) 
By computing and , we obtain and respectively. This implies that
and the equality is obtained if and only if we have that for all which proves the claim.
It remains to prove that we can encode any linear inequality through a ball constraint. Consider an inequality of the type We assume that and that is even and therefore that there exists no such that We want to show that there exists and such that
(21) 
Let be the intersection point of the hyperplane and the line . Let be defined as follows:
We claim that choosing and allows us to obtain (21). To prove it, we need to show that belongs to the ball if and only if it satisfies the constraint . Let . There are two cases to consider:

Suppose first that .
Since is the closest point to that satisfies , it also implies that any point such that is such that proving that:

Suppose now that and in particular that with (see Figure 2).
Let be the intersection point of the hyperplane and the line . Since form a right triangle with the right angle in and since