# CAQL: Continuous Action Q-Learning

## Abstract

Value-based reinforcement learning (RL) methods like Q-learning have shown success in a variety of domains. One challenge in applying Q-learning to continuous-action RL problems, however, is the continuous action maximization (max-Q) required for optimal Bellman backup. In this work, we develop CAQL, a (class of) algorithm(s) for continuous-action Q-learning that can use several plug-and-play optimizers for the max-Q problem. Leveraging recent optimization results for deep neural networks, we show that max-Q can be solved optimally using mixed-integer programming (MIP). When the Q-function representation has sufficient power, MIP-based optimization gives rise to better policies and is more robust than approximate methods (e.g., gradient ascent, cross-entropy search). We further develop several techniques to accelerate inference in CAQL, which despite their approximate nature, perform well. We compare CAQL with state-of-the-art RL algorithms on benchmark continuous-control problems that have different degrees of action constraints and show that CAQL outperforms policy-based methods in heavily constrained environments, often dramatically.

## 1 Introduction

Reinforcement learning (RL) has shown success in a variety of domains such as games (Mnih et al., 2013) and recommender systems (RSs) (Gauci et al., 2018). When the action space is finite, value-based algorithms such as Q-learning (Watkins and Dayan, 1992), which implicitly finds a policy by learning the optimal value function, are often very efficient because action optimization can be done by exhaustive enumeration. By contrast, in problems with a continuous action spaces (e.g., robotics (Peters and Schaal, 2006)), policy-based algorithms, such as policy gradient (PG) (Sutton et al., 2000; Silver et al., 2014) or cross-entropy policy search (CEPS) (Mannor et al., 2003; Kalashnikov et al., 2018), which directly learn a return-maximizing policy, have proven more practical. Recently, methods such as ensemble critic (Fujimoto et al., 2018) and entropy regularization (Haarnoja et al., 2018) have been developed to improve the performance of policy-based RL algorithms.

Policy-based approaches require a reasonable choice of policy parameterization. In some continuous control problems, Gaussian distributions over actions conditioned on some state representation is used. However, in applications such as RSs, where actions often take the form of high-dimensional item-feature vectors, policies cannot typically be modeled by common action distributions. Furthermore, the admissible action set in RL is constrained in practice, for example, when actions must lie within a specific range for safety (Chow et al., 2018). In RSs, the admissible actions are often random functions of the state (Boutilier et al., 2018). In such cases, it is non-trivial to define policy parameterizations that handle such factors. On the other hand, value-based algorithms are well-suited to these settings, providing potential advantage over policy methods. Moreover, at least with linear function approximation (Melo and Ribeiro, 2007), under reasonable assumptions, Q-learning converges to optimality, while such optimality guarantees for non-convex policy-based methods are generally limited (Fazel et al., 2018). Empirical results also suggest that value-based methods are more data-efficient and less sensitive to hyper-parameters (Quillen et al., 2018). Of course, with large action spaces, exhaustive action enumeration in value-based algorithms can be expensive—-one solution is to represent actions with continuous features (Dulac-Arnold et al., 2015).

The main challenge in applying value-based algorithms to continuous-action domains is selecting optimal actions (both at training and inference time). Previous work in this direction falls into three broad categories. The first solves the inner maximization of the (optimal) Bellman residual loss using global nonlinear optimizers, such as the cross-entropy method (CEM) for QT-Opt (Kalashnikov et al., 2018), gradient ascent (GA) for actor-expert (Lim et al., 2018), and action discretization (Uther and Veloso, 1998; Smart and Kaelbling, 2000; Lazaric et al., 2008). However, these approaches do not guarantee optimality. The second approach restricts the Q-function parameterization so that the optimization problem is tractable. For instance, one can discretize the state and action spaces and use a tabular Q-function representation. However, due to the curse of dimensionality, discretizations must generally be coarse, often resulting in unstable control. Millán et al. (2002) circumvents this issue by averaging discrete actions weighted by their Q-values. Wire-fitting (Gaskett et al., 1999; III and Klopf, 1993) approximates Q-values piecewise-linearly over a discrete set of points, chosen to ensure the maximum action is one of the extreme points. The normalized advantage function (NAF) (Gu et al., 2016) constructs the state-action advantage function to be quadratic, hence analytically solvable. Parameterizing the Q-function with an input-convex neural network (Amos et al., 2017) ensures it is concave. These restricted functional forms, however, may degrade performance if the domain does not conform to the imposed structure. The third category replaces optimal Q-values with a “soft” counterpart (Haarnoja et al., 2018): an entropy regularizer ensures that both the optimal Q-function and policy have closed-form solutions. However, the sub-optimality gap of this soft policy scales with the interval and dimensionality of the action space (Neu et al., 2017).

Motivated by the shortcomings of prior approaches, we propose Continuous Action Q-learning (CAQL),
a Q-learning framework for continuous actions in which the Q-function is modeled by a generic feed-forward neural network.^{1}

## 2 Preliminaries

We consider an infinite-horizon, discounted Markov decision process (Puterman, 2014) with states , (continuous) action space , reward function , transition kernel , initial state distribution and discount factor , all having the usual meaning. A (stationary, Markovian) policy specifies a distribution over actions to be taken at state . Let be the set of such policies. The expected cumulative return of is . An optimal policy satisfies . The Bellman operator over state-action value function has unique fixed point (Puterman, 2014), which is the optimal Q-function . An optimal (deterministic) policy can be extracted from :

For large or continuous state/action spaces, the optimal Q-function can be approximated, e.g., using a deep neural network (DNN) as in DQN (Mnih et al., 2013). In DQN, the value function is updated using the value label , where is a target Q-function. Instead of training these weights jointly, is updated in a separate iterative fashion using the previous for a fixed number of training steps, or by averaging for some small momentum weight (Mnih et al., 2016). DQN is off-policy—the target is valid no matter how the experience was generated (as long as it is sufficiently exploratory). Typically, the loss is minimized over mini-batches of past data sampled from a large experience replay buffer (Lin and Mitchell, 1992). One common loss function for training is mean squared Bellman error: Under this loss, RL can be viewed as -regression of w.r.t. target labels . We augment DQN, using double Q-learning for more stable training (Hasselt et al., 2016), whose loss is:

(1) |

A hinge loss can also be used in Q-learning, and has connections to the linear programming (LP) formulation of the MDP (Puterman (2014)). The optimal -network weights can be specified as: , where is a tunable penalty w.r.t. constraint: , . To stabilize training, we replace the -network of the inner maximization with the target -network and the optimal Q-value with the double-Q label, giving (see Appendix A for details):

(2) |

In this work, we assume the -function approximation to be a feed-forward network. Specifically, let be a -layer feed-forward NN with state-action input (where lies in a -dimensional real vector space) and hidden layers arranged according to the equations:

(3) |

where are the multiplicative and bias weights, is the output weight of the -network, are the weights of the -network, denotes pre-activation values at layer , and is the (component-wise) activation function. For simplicity, in the following analysis, we restrict our attention to the case when the activation functions are ReLU’s. We also assume that the action space is a -dimensional -ball with some radius and center . Therefore, at any arbitrary state the max-Q problem can be re-written as . While the above formulation is intuitive, the nonlinear equality constraints in the neural network formulation (3) makes this problem non-convex and NP-hard (Katz et al., 2017).

## 3 Continuous Action Q-Learning Algorithm

Policy-based methods (Silver et al., 2014; Fujimoto et al., 2018; Haarnoja et al., 2018) have been widely-used to handle continuous actions in RL. However, they suffer from several well-known difficulties, e.g., (i) modeling high-dimensional action distributions, (ii) handling action constraints, and (iii) data-inefficiency. Motivated by earlier work on value-based RL methods, such as QT-Opt (Kalashnikov et al., 2018) and actor-expert (Lim et al., 2018), we propose Continuous Action Q-learning (CAQL), a general framework for continuous-action value-based RL, in which the Q-function is parameterized by a NN (Eq. 3). One novelty of CAQL is the formulation of the “max-Q” problem, i.e., the inner maximization in (1) and (2), as a mixed-integer programming (MIP).

The benefit of the MIP formulation is that it guarantees that we find the optimal action (and its true bootstrapped Q-value) when computing target labels (and at inference time). We show empirically that this can induce better performance, especially when the Q-network has sufficient representation power. Moreover, since MIP can readily model linear and combinatorial constraints, it offers considerable flexibility when incorporating complex action constraints in RL. That said, finding the optimal Q-label (e.g., with MIP) is computationally intensive. To alleviate this, we develop several approximation methods to systematically reduce the computational demands of the inner maximization. In Sec. 3.2, we introduce the action function to approximate the -policy at inference time, and in Sec. 4 we propose three techniques, dynamic tolerance, dual filtering, and clustering, to speed up max-Q computation during training.

### 3.1 Plug-N-Play Max-Q Optimizers

In this section, we illustrate how the max-Q problem, with the Q-function represented by a ReLU network, can be formulated as a MIP, which can be solved using off-the-shelf optimization packages (e.g., SCIP (Gleixner et al., 2018), CPLEX (CPLEX, 2019), Gurobi (Gurobi, 2019)). In addition, we detail how approximate optimizers, specifically, gradient ascent (GA) and the cross-entropy method (CEM), can trade optimality for speed in max-Q computation within CAQL. For ease of exposition, we focus on Q-functions parameterized by a feedforward ReLU network. Extending our methodology (including the MIP formulation) to convolutional networks (with ReLU activation and max pooling) is straightforward (see Anderson et al. (2019)). While GA and CEM can handle generic activation functions beyond ReLU, our MIP requires additional approximations for those that are not piecewise linear.

#### Mixed-Integer Programming (MIP)

A trained feed-forward ReLU network can be modeled as a MIP by formulating the nonlinear activation function at each neuron with binary constraints. Specifically, for a ReLU with pre-activation function of form , where is a -dimensional bounded input, , , and are the weights, bias and lower-upper bounds respectively, consider the following set with a binary variable indicating whether the ReLU is active or not:

In this formulation, both and can be computed in linear time in . We assume and , otherwise the function can be replaced by or . These constraints ensure that is the output of the ReLU: If , then they are reduced to , and if , then they become .

This can be extended to the ReLU network in (3) by chaining copies of intermediate ReLU formulations. More precisely, if the ReLU Q-network has neurons in layer , for any given state , the max-Q problem can be reformulated as the following MIP:

(4) | ||||

s.t. | ||||

where are the (action) input-bound vectors. Since the output layer of the ReLU NN is linear, the MIP objective is linear as well. Here, and are the weights and bias of neuron in layer . Furthermore, are interval bounds for the outputs of the neurons in layer for , and computing them can be done via interval arithmetic or other propagation methods (Weng et al., 2018) from the initial action space bounds (see Appendix C for details). As detailed by Anderson et al. (2019), this can be further tightened with additional constraints, and its implementation can be found in the tf.opt package described therein. As long as these bounds are redundant, having these additional box constraints will not affect optimality. We emphasize that the MIP returns provably global optima, unlike GA and CEM. Even when interrupted with stopping conditions such as a time limit, MIP often produces high-quality solutions in practice.

In theory, this MIP formulation can be solved in time exponential on the number of ReLUs and polynomial on the input size (e.g., by naively solving an LP for each binary variable assignment). In practice however, a modern MIP solver combines many different techniques to significantly speed up this process, such as branch-and-bound, cutting planes, preprocessing techniques, and primal heuristics (Linderoth and Savelsbergh, 1999). Versions of this MIP model have been used in neural network verification (Cheng et al., 2017; Lomuscio and Maganti, 2017; Bunel et al., 2018; Dutta et al., 2018; Fischetti and Jo, 2018; Anderson et al., 2019; Tjeng et al., 2019) and analysis (Serra et al., 2018; Kumar et al., 2019), but its application to RL is novel. While Say et al. (2017) also proposed a MIP formulation to solve the planning problem with non-linear state transition dynamics model learned with a NN, it is different than ours, which solves the max-Q problem.

#### Gradient Ascent

GA (Nocedal and Wright, 2006) is a simple first-order optimization method for finding the (local) optimum of a differentiable objective function, such as a neural network Q-function. At any state , given a “seed” action , the optimal action is computed iteratively by , where is a step size (either a tunable parameter or computed using back-tracking line search (Nocedal and Yuan, 1998)). This process repeats until convergence, , or a maximum iteration count is reached.

#### Cross-Entropy Method

CEM (Rubinstein, 1999) is a derivative-free optimization algorithm. At any given state , it samples a batch of actions from using a fixed distribution (e.g., a Gaussian) and ranks the corresponding Q-values . Using the top actions, it then updates the sampling distribution, e.g., using the sample mean and covariance to update the Gaussian. This is repeated until convergence or a maximum iteration count is reached.

### 3.2 Action Function

In traditional Q-learning, the policy is “implemented” by acting greedily w.r.t. the learned Q-function:
.^{2}

## 4 Accelerating Max-Q Computation

In this section, we propose three methods to speed up the computationally-expensive max-Q solution during training: (i) dynamic tolerance, (ii) dual filtering, and (iii) clustering.

#### Dynamic Tolerance

Tolerance plays a critical role in the stopping condition of nonlinear optimizers. Intuitively, in the early phase of CAQL, when the Q-function estimate has high Bellman error, it may be wasteful to compute a highly accurate max-Q label when a crude estimate can already guide the gradient of CAQL to minimize the Bellman residual. We can speed up the max-Q solver by dynamically adjusting its tolerance based on (a) the TD-error, which measures the estimation error of the optimal Q-function, and (b) the training step , which ensures the bias of the gradient (induced by the sub-optimality of max-Q solver) vanishes asymptotically so that CAQL converges to a stationary point. While relating tolerance with the Bellman residual is intuitive, it is impossible to calculate that without knowing the max-Q label. To resolve this circular dependency, notice that the action function approximates the optimal policy, i.e., . We therefore replace the optimal policy with the action function in Bellman residual and propose the dynamic tolerance: where and are tunable parameters. Under standard assumptions, CAQL with dynamic tolerance converges a.s. to a stationary point (Thm. 1, (Carden, 2014)).

#### Dual Filtering

The main motivation of dual filtering is to reduce the number of max-Q problems at each CAQL training step. For illustration, consider the formulation of hinge Q-learning in (2). Denote by the max-Q label w.r.t. the target Q-network and next state . The structure of the hinge penalty means the TD-error corresponding to sample is inactive whenever —this data can be discarded. In dual filtering, we efficiently estimate an upper bound on using some convex relaxation to determine which data can be discarded before max-Q optimization. Specifically, recall that the main source of non-convexity in (3) comes from the equality constraint of the ReLU activation function at each NN layer. Similar to MIP formulation, assume we have component-wise bounds on the neurons, such that . The ReLU equality constraint can be relaxed using a convex outer-approximation (Wong and Kolter, 2017): . We use this approximation to define the relaxed NN equations, which replace the nonlinear equality constraints in (3) with the convex set . We denote the optimal Q-value w.r.t. the relaxed NN as , which is by definition an upper bound on . Hence, the condition: is a conservative certificate for checking whether the data is inactive. For further speed up, we estimate with its dual upper bound (see Appendix C for derivations) where is defined by the following recursion “dual” network: , and is a diagonal matrix with , and replace the above certificate with an even more conservative one: .

Although dual filtering is derived for hinge Q-learning, it also applies to the -loss counterpart by replacing the optimal value with its dual upper-bound estimate whenever the verification condition holds (i.e., the TD error is negative). Since the dual estimate is greater than the primal, the modified loss function will be a lower bound of the original in (1), i.e., whenever , which can stabilize training by reducing over-estimation error.

One can utilize the inactive samples in the action function () learning problem by replacing the max-Q label with its dual approximation .
Since , this replacement will not affect optimality. ^{3}

#### Clustering

To reduce the number of max-Q solves further still, we apply online state aggregation (Meyerson, 2001), which picks a number of centroids from the batch of next states as the centers of -metric balls with radius , such that the union of these balls form a minimum covering of . Specifically, at training step , denote by the set of next-state centroids. For each next state , we compute the max-Q value , where is the corresponding optimal action. For all remaining next states , we approximate their max-Q values via first-order Taylor series expansion in which is the closest centroid to , i.e., . By the envelope theorem for arbitrary choice sets (Milgrom and Segal, 2002), the gradient is equal to . In this approach the cluster radius controls the number of max-Q computations, which trades complexity for accuracy in Bellman residual estimation. This parameter can either be a tuned or adjusted dynamically (similar to dynamic tolerance), e.g., with hyperparameters and . Analogously, with this exponentially-decaying cluster radius schedule we can argue that the bias of CAQL gradient (induced by max-Q estimation error due to clustering) vanishes asymptotically, and the corresponding Q-function converges to a stationary point. To combine clustering with dual filtering, we define as the batch of next states that are inconclusive after dual filtering, i.e., . Then instead of applying clustering to we apply this method onto the refined batch .

Dynamic tolerance not only speeds up training, but also improves CAQL’s performance (see Tables 4 and 5); thus, we recommend using it by default. Dual filtering and clustering both trade off training speed with performance. These are practical options—with tunable parameters—that allow practitioners to explore their utility in specific domains.

## 5 Experiments on MuJoCo Benchmarks

To illustrate the effectiveness of CAQL, we (i) compare several CAQL variants with several state-of-the-art RL methods on multiple domains, and (ii) assess the trade-off between max-Q computation speed and policy quality via ablation analysis.

#### Comparison with Baseline RL Algorithms

We compare CAQL with four baseline methods, DDPG (Silver et al., 2014), TD3 (Fujimoto et al., 2018), and SAC (Haarnoja et al., 2018)—three popular policy-based deep RL algorithms—and NAF (Gu et al., 2016), a value-based method using an action-quadratic Q-function. We train CAQL using three different max-Q optimizers, MIP, GA, and CEM. Note that CAQL-CEM counterpart is similar to QT-Opt (Kalashnikov et al., 2018) and CAQL-GA reflects some aspects actor-expert (Lim et al., 2018). These CAQL variants allow assessment of the degree to which policy quality is impacted by Q-learning with optimal Bellman residual (using MIP) rather than an approximation (using GA or CEM), at the cost of steeper computation. To match the implementations of the baselines, we use loss when training CAQL. Further ablation analysis on CAQL with loss vs. hinge loss is provided in Appendix E.

We evaluate CAQL on one classical control benchmark (Pendulum) and five MuJoCo benchmarks (Hopper, Walker2D, HalfCheetah, Ant, Humanoid).^{4}^{5}

For the more difficult MuJoCo environments (i.e., Ant, HalfCheetah, Humanoid), the number of training steps is set to , while for simpler ones (i.e., Pendulum, Hopper, Walker2D), it is set to . Policy performance is evaluated every training iterations, using a policy with no exploration. Each measurement is an average return over episodes, each generated using a separate random seed. To smooth learning curves, data points are averaged over a sliding window of size . Similar to the setting of Lim et al. (2018), CAQL measurements are based on trajectories that are generated by the learned action function instead of the optimal action w.r.t. the Q-function.

Table 1 and Figure 1 show the average return of CAQL and the baselines under the best hyperparameter configurations. CAQL significantly outperforms NAF on most benchmarks, as well as DDPG, TD3, and SAC on of benchmarks. Of all the CAQL policies, those trained using MIP are among the best performers in low-dimensional benchmarks (e.g., Pendulum and Hopper). This verifies our conjecture about CAQL: Q-learning with optimal Bellman residual (using MIP) performs better than using approximation (using GA, CEM) when the Q-function has sufficient representation power (which is more likely in low-dimensional tasks). Moreover, CAQL-MIP policies have slightly lower variance than those trained with GA and CEM on most benchmarks. Table 2 and Figure 2 show summary statistics of the returns of CAQL and the baselines on all 320 configurations () and illustrates the sensitivity to hyperparameters of each method. CAQL is least sensitive in 11 of 14 tasks, and policies trained using MIP optimization, specifically, are best in 6 of 14 tasks. This corroborates the hypothesis that value-based methods are generally more robust to hyperparameters than their policy-based counterparts. Table 9 in Appendix E.1 compares the speed (in terms of average elapsed time) of various max-Q solvers (MIP, GA, and CEM), with MIP clearly the most computationally intensive.

We note that CAQL-MIP suffers from performance degradation in several high-dimensional environments with large action ranges (e.g., Ant [-0.25, 0.25] and Humanoid [-0.25, 0.25]). In these experiments, its performance is even worse than that of CAQL-GA or CAQL-CEM. We speculate that this is due to the fact that the small ReLU NN () doesn’t have enough representation power to accurately model the Q-functions in more complex tasks, and therefore optimizing for the true max-Q value using an inaccurate function approximation impedes learning.

We also test CAQL using the standard MuJoCo 1000-step episode length, using gradient ascent as the optimizer, and a Q-function is parameterized with a feedforward ReLU network for Hopper and with for the rest benchmarks. CAQL-GA is trained using dynamic tolerance and an action function but without dual filtering or clustering. Figure 6 in Appendix E shows that CAQL-GA performs better than, or similar to, the best of the baseline methods, except on Hopper [-0.25, 0.25]—SAC performed best in that setting, however, it suffers from very high performance variance.

#### Ablation Analysis

We now study the effects of using dynamic tolerance, dual filtering, and clustering on CAQL via two ablation analyses. For simplicity, we experiment on standard benchmarks (with full action ranges), and primarily test CAQL-GA using an loss. Default values on tolerance and maximum iteration are 1e-6 and 200, respectively.

Table 3 shows how reducing the number of max-Q problems using dual filtering and clustering affects performance of CAQL. Dual filtering (DF) manages to reduce the number of max-Q problems (from to across different benchmarks), while maintaining similar performance with the unfiltered CAQL-GA. On top of dual filtering we apply clustering (C) to the set of inconclusive next states , in which the degree of approximation is controlled by the cluster radius. With a small cluster radius (e.g., ), clustering further reduces max-Q solves without significantly impacting training performance (and in some cases it actually improves performance), though further increasing the radius would significant degrade performance. To illustrate the full trade-off of max-Q reduction versus policy quality, we also include the Dual method, which eliminates all max-Q computation with the dual approximation. Table 4 shows how dynamic tolerance influences the quality of CAQL policies. Compared with the standard algorithm, with a large tolerance () GA achieves a notable speed up (with only step per max-Q optimization) in training but incurs a loss in performance. GA with dynamic tolerance attains the best of both worlds—it significantly reduces inner-maximization steps (from to across different problems and initial settings), while achieving good performance.

Additionally, Table 5 shows the results of CAQL-MIP with dynamic tolerance (i.e., optimality gap). This method significantly reduces both median and variance of the MIP elapsed time, while having better performance. Dynamic tolerance eliminates the high latency in MIP observed in the early phase of training (see Figure 3).

Env. [Action range] | MIP + Tol(1e-4) | MIP + DTol(1,1e-4) |
---|---|---|

HalfCheetah [-0.5, 0.5] | 718.6 199.9 (Med(): 263.5, SD(): 88.269) | 764.5 132.9 (Med(): 118.5, SD(): 75.616) |

Ant [-0.1, 0.1] | 402.3 27.4 (Med(): 80.7, SD(): 100.945) | 404.9 27.7 (Med(): 40.3, SD(): 24.090) |

Ant [-0.25, 0.25] | 413.1 60.0 (Med(): 87.6, SD(): 160.921) | 424.9 60.9 (Med(): 62.0, SD(): 27.646) |

Humanoid [-0.1, 0.1] | 405.7 112.5 (Med(): 145.7, SD(): 27.381) | 475.0 173.4 (Med(): 29.1, SD(): 10.508) |

Humanoid [-0.25, 0.25] | 460.2 143.2 (Med(): 71.2, SD(): 45.763) | 410.1 174.4 (Med(): 39.7, SD(): 11.088) |

## 6 Conclusions and Future Work

We proposed Continuous Action Q-learning (CAQL), a general framework for handling continuous actions in value-based RL, in which the Q-function is parameterized by a neural network. While generic nonlinear optimizers can be naturally integrated with CAQL, we illustrated how the inner maximization of Q-learning can be formulated as mixed-integer programming when the Q-function is parameterized with a ReLU network. CAQL (with action function learning) is a general Q-learning framework that includes many existing value-based methods such as QT-Opt and actor-expert. Using several benchmarks with varying degrees of action constraint, we showed that the policy learned by CAQL-MIP generally outperforms those learned by CAQL-GA and CAQL-CEM; and CAQL is competitive with several state-of-the-art policy-based RL algorithms, and often outperforms them (and is more robust) in heavily-constrained environments. Future work includes: extending CAQL to the full batch learning setting, in which the optimal Q-function is trained using only offline data; speeding up the MIP computation of the max-Q problem to make CAQL more scalable; and applying CAQL to real-world RL problems.

## Appendix A Hinge Q-learning

Consider an MDP with states , actions , transition probability function , discount factor , reward function , and initial state distribution . We want to find an optimal -function by solving the following optimization problem:

(5) |

The formulation is based on the LP formulation of MDP (see Puterman (2014) for more details). Here the distribution is given by the data-generating distribution of the replay buffer . (We assume that the replay buffer is large enough such that it consists of experience from almost all state-action pairs.) It is well-known that one can transform the above constrained optimization problem into an unconstrained one by applying a penalty-based approach (to the constraints). For simplicity, here we stick with a single constant penalty parameter (instead of going for a state-action Lagrange multiplier and maximizing that), and a hinge penalty function . With a given penalty hyper-parameter (that can be separately optimized), we propose finding the optimal -function by solving the following optimization problem:

(6) |

Furthermore, recall that in many off-policy and offline RL algorithms (such as DQN), samples in form of are independently drawn from the replay buffer, and instead of the optimizing the original objective function, one goes for its unbiased sample average approximation (SAA). However, viewing from the objective function of problem (6), finding an unbiased SAA for this problem might be challenging, due to the non-linearity of hinge penalty function . Therefore, alternatively we turn to study the following unconstrained optimization problem:

(7) |

Using the Jensen’s inequality for convex functions, one can see that the objective function in (7) is an upper-bound of that in (6). Equality of the Jensen’s inequality will hold in the case when transition function is deterministic. (This is similar to the argument of PCL algorithm.) Using Jensen’s inequality one justifies that optimization problem (7) is indeed an eligible upper-bound optimization to problem (6).

Recall that is the data-generation distribution of the replay buffer . The unbiased SAA of problem (7) is therefore given by

(8) |

where are the samples drawn independently from the replay buffer. In the following, we will find the optimal function by solving this SAA problem. In general when the state and action spaces are large/uncountable, instead of solving the -function exactly (as in the tabular case), we turn to approximate the -function with its parametrized form , and optimize the set of real weights (instead of ) in problem (8).

## Appendix B Continuous Action Q-learning Algorithm

## Appendix C Details of Dual Filtering

Recall that the Q-function NN has a nonlinear activation function, which can be viewed as a nonlinear equality constraint, according to the formulation in (3). To tackle this constraint, Wong and Kolter (2017) proposed a convex relaxation of the ReLU non-linearity. Specifically, first, they assume that for given and such that , there exists a collection of component-wise bounds such that . As long as the bounds are redundant, adding these constraints into primal problem does not affect the optimal value. Second, the ReLU non-linear equality constraint is relaxed using a convex outer-approximation. In particular, for a scalar input within the real interval , the exact ReLU non-linearity acting on is captured by the set

Its convex outer-approximation is given by:

(9) |

Analogously to (3), define the relaxed NN equations as:

(10a) | ||||

(10b) | ||||