Exploration versus exploitation in reinforcement learning: a stochastic control approach
Abstract
We consider reinforcement learning (RL) in continuous time and study the problem of achieving the best tradeoff between exploration of a black box environment and exploitation of current knowledge. We propose an entropyregularized reward function involving the differential entropy of the distributions of actions, and motivate and devise an exploratory formulation for the feature dynamics that captures repetitive learning under exploration. The resulting optimization problem is a resurrection of the classical relaxed stochastic control. We carry out a complete analysis of the problem in the linear–quadratic (LQ) case and deduce that the optimal control distribution for balancing exploitation and exploration is Gaussian. This in turn interprets and justifies the widely adopted Gaussian exploration in RL, beyond its simplicity for sampling. Moreover, the exploitation and exploration are reflected respectively by the mean and variance of the Gaussian distribution. We also find that a more random environment contains more learning opportunities in the sense that less exploration is needed, other things being equal. As the weight of exploration decays to zero, we prove the convergence of the solution to the entropyregularized LQ problem to that of the classical LQ problem. Finally, we characterize the cost of exploration, which is shown to be proportional to the entropy regularization weight and inversely proportional to the discount rate in the LQ case.
Key words. Reinforcement learning, exploration, exploitation, entropy regularization, stochastic control, linear–quadratic, Gaussian.
First draft: March 2018
This draft: December 2018
1 Introduction
Reinforcement learning (RL) is currently one of the most active and fast developing subareas in machine learning. It has been successfully applied to solve large scale real world, complex decisionmaking problems in recent years, including playing perfectinformation board games such as Go (AlphaGo/AlphaGo Zero, Silver et al. (2016), Silver et al. (2017)), achieving humanlevel performance on video games (Mnih et al. (2015)), and autonomous driving (Levine et al. (2016), Mirowski et al. (2016)). An RL agent does not prespecify a structural model or a family of models, but instead gradually learns the best (or nearbest) strategies based on trial and error, through interactions with the random (black box) environment and incorporation of the responses of these interactions in order to improve the overall performance. This is a “kill two birds with one stone” case: the agent’s actions (controls) serve both as a means to explore (learn) and a way to exploit (optimize). Since exploration is inherently costly in terms of both resources and time, a natural and crucial question in RL is to address the dichotomy between exploration of uncharted territory and exploitation of existing knowledge. Such question exists in both the stateless RL settings represented by the multiarmed bandit problem, and the more general multistate RL settings with delayed reinforcement (Sutton and Barto (2018), Kaelbling et al. (1996)). More specifically, the agent must balance between greedily exploiting what has been learned so far to choose actions that yield nearterm higher rewards, and continually exploring the environment to acquire more information to potentially achieve longterm benefits. Extensive studies have been conducted to find optimal strategies for trading off exploitation and exploration.^{4}^{4}4For the multiarmed bandit problem, wellknown strategies include Gittinsindex approach (Gittins (1974)), Thompson sampling (Thompson (1933)), and upper confidence bound algorithm (Auer et al. (2002)), with sound theoretical optimality established (Russo and Van Roy (2013, 2014)). For general RL problems, various efficient exploration methods have been proposed that have been proved to induce low sample complexity (see, e.g., Brafman and Tennenholtz (2002), Strehl and Littman (2008), Strehl et al. (2009)) among other advantages.
However, most of the contributions to balancing exploitation and exploration do not include exploration formally as a part of the optimization objective; the attention has mainly focused on solving the classical optimization problem maximizing the accumulated rewards, while exploration is typically treated separately as an adhoc chosen exogenous component, rather than being endogenously derived as a part of the solution to the overall RL problem. The recently proposed discretetime entropyregularized (also termed as “entropyaugmented” or “softmax”) RL formulation, on the other hand, explicitly incorporates exploration into the optimization objective, with a tradeoff weight imposed on the entropy of the exploration strategy (Ziebart et al. (2008), Nachum et al. (2017a), Fox et al. (2015)). An exploratory distribution with a greater entropy signifies a higher level of exploration, and is hence more favorable on the exploration front. The extreme case of a Dirac measure having minimal entropy implies no exploration, reducing to the case of classical optimization with complete knowledge about the underlying model. Recent works have devoted to the designing of various algorithms to solve the entropy regulated RL problem, where numerical experiments demonstrate remarkable robustness and multimodal policy learning (Haarnoja et al. (2017), Haarnoja et al. (2018)).
In this paper, we study the tradeoff between exploration and exploitation for RL in a continuous time setting with both continuous control (action) and state (feature) spaces.^{5}^{5}5The terms “feature” and “action” are typically used in the RL literature, whose counterparts in the control literature are “state” and “control” respectively. Since this paper uses the control approach to study RL, we will interchangeably use these terms whenever there is no confusion. Such continuoustime formulation is especially appealing if the agent can interact with the environment at ultrahigh frequency aided by the modern computing resources.^{6}^{6}6A notable example is high frequency stock trading. Meanwhile, an indepth and comprehensive study of the RL problem also becomes possible which leads to elegant and insightful results, once cast in continuous time, thanks in no small measure to the tools of stochastic calculus and differential equations.
Our first main contribution is to propose an entropyregularized reward function involving the differential entropy for probability distributions over the continuous action space, and motivate and devise an “exploratory formulation” for the state dynamics that captures repetitive learning under exploration in the continuoustime limit. Existing theoretical works on exploration mainly concentrate on the analysis at the algorithmic level, e.g., proving convergence of the proposed exploration algorithms to the solutions of the classical optimization problems (see, e.g., Singh et al. (2000), Jaakkola et al. (1994)), yet they rarely look into the learning algorithms’ impact on changing significantly the underlying dynamics. Indeed, exploration not only substantially enriches the space of control strategies (from that of Dirac measures to that of all possible probability distributions) but also, as a result, enormously expands the reachable space of states. This, in turn, sets out to change the underlying state transitions and system dynamics.
We show that our exploratory formulation can account for the effects of learning in both the rewards received and the state transitions observed from the interactions with the environment. It, thus, unearths the important characteristics of learning at a more refined and indepth level, beyond the theoretical analysis of mere learning algorithms. Intriguingly, this formulation of the state dynamics coincides with the relaxed control framework in classical control theory (El Karoui et al. (1987), Kurtz and Stockbridge (1998, 2001)), which was motivated by entirely different reasons. Relaxed controls were introduced to mainly deal with the theoretical question of whether an optimal control exists. The approach is essentially a randomization device to convexify the universe of control strategies. To our best knowledge, our paper is the first to bring back the fundamental ideas and formulation of the relaxed control, guided by a practical motivation: exploration and learning.
We then carry out a complete analysis of the continuoustime entropyregularized RL problem, assuming that the original dynamic system is linear in both control and state and the original reward function is quadratic in the two. This type of linear–quadratic (LQ) problems has occupied the center stage for research in classical control theory for its elegant solutions and its ability to approximate more general nonlinear problems. One of the most important, conceptual contributions of this paper is to show that the optimal control distribution for balancing exploitation and exploration is Gaussian. As is well known, a pure exploitation optimal distribution is Dirac, and a pure exploration optimal distribution is uniform. Our results reveal that Gaussian is the right choice if one seeks a balance between those two extremes. Moreover, we find that the mean of this optimal distribution is a feedback of the current state independent of the intended exploration level, whereas the variance is a linear function of the entropy regulating weight (also called the “temperature parameter”) irrespective of the current state. This result highlights a separation between exploitation and exploration: the former is reflected in the mean and the latter in the variance of the final optimal actions.
There is yet another intriguing result. All other things being equal, the more volatile the original dynamic system is, the smaller the variance of the optimal Gaussian distribution is. Conceptually, this implies that an environment reacting more aggressively to an action in fact contains more learning opportunities and hence is less costly for learning.
Another contribution of the paper is to establish the connection between the solvability of the exploratory LQ problem and that of the classical LQ problem. We prove that as the exploration weight in the former decays to zero, the optimal Gaussian control distribution and its value function converge respectively to the optimal Dirac measure and the value function of the latter, a desirable result for practical learning purposes.
Finally, we observe that, beyond the LQ problems, the Gaussian distribution remains, formally, optimal for a much larger class of control problems, namely, problems with drift and volatility linear in control, and reward functions linear or quadratic in control. Such family of problems can be seen as the locallinearquadratic approximation to more general stochastic control problems whose state dynamics are linearized in the control variables and the reward functions are approximated by quadratic functions in controls locally (Todorov and Li (2005), Li and Todorov (2007)). Note that although such iterative LQ approximation generally has different parameters at different local stateaction pairs, our result on the optimality of Gaussian distribution under the exploratory LQ framework still holds at any local point, and therefore justifies, from a stochastic control perspective, why Gaussian distribution is commonly used in practice for exploration (Haarnoja et al. (2017), Haarnoja et al. (2018), Nachum et al. (2017b)), beyond its simplicity for sampling.
The paper is organized as follows. In section 2, we motivate and propose the relaxed stochastic control problem involving an exploratory state dynamics and an entropyregularized reward function. We then present the HamiltonJacobiBellman (HJB) equation and the optimal control distribution for general entropyregularized stochastic control problems in section 3. In section 4, we study the special LQ problem in both the stateindependent and statedependent reward cases, corresponding respectively to the multiarmed bandit problem and general RL problem in discrete time. We discuss the connections between the exploratory LQ problem and the classical LQ problem in section 5, establish the equivalence of the solvability of the two and the convergence result for vanishing exploration, and finally characterize the cost of engaging exploration. Finally, section 6 concludes. Some technical proofs are relegated to appendices.
2 An EntropyRegularized Stochastic Control
Problem
We introduce an entropyregularized stochastic control problem and provide its motivation in the context of RL.
Consider a filtered probability space in which we define an adapted Brownian motion An “action space” is given, representing the constraint on an agent’s decisions (“controls” or “actions”). An admissible control is an adapted measurable process taking value in . Denote by the set of all admissible controls.
The classical stochastic control problem is to control the state (or “feature”) dynamics
(1) 
so as to achieve the maximum expected total discounted reward represented by the value function
(2) 
where is the reward function and is the discount rate.
In the classical setting, where the model is fully known (namely when the functions and are fully specified) and the dynamic programming is applicable, the optimal control can be derived and represented as a deterministic mapping from the current state to the action space , where is the optimal feedback policy (or “law”). This feedback policy is derived at and will be carried out through .
In contrast, under the RL setting when the underlying model is not known and therefore dynamic learning is needed, the agent employs exploration to interact with and learn the unknown environment through trial and error. This exploration can be modelled by a distribution of controls over the control space from which each “trial” is sampled. We can therefore extend the notion of controls to distributions^{7}^{7}7A classical control can be regarded as a Dirac distribution (or “measure”) where . In a similar fashion, a feedback policy can be embedded as a Dirac measure , parameterized by the current state .. The agent executes a control for rounds over the same time horizon, while at each round, a classical control is sampled from the distribution . The reward of such a policy becomes accurate enough when is large. This procedure, known as policy evaluation, is considered as a fundamental element of most RL algorithms in practice (see, e.g., Sutton and Barto (2018)). Hence, for evaluating such a policy distribution in our continuoustime setting it is necessary to consider the limiting situation as .
In order to quickly capture the essential idea, let us first examine the special case when the reward depends only on the control, namely, One then considers identical independent copies of the control problem in the following way: at round , a control is sampled under the (possibly random) control distribution , and executed for its corresponding copy of the control problem (1)–(2). Then, at each fixed time , from the law of large numbers and under certain mild technical conditions, it follows that the average reward over , with small enough, should satisfy
For a general reward which depends on the state, we first need to describe how exploration might alter the state dynamics (1), by defining appropriately its “exploratory” version. For this, we look at the effect of repetitive learning under a given control distribution for rounds. Let , , be independent sample paths of the Brownian motion , and , , be the copies of the state process respectively under the controls , , each sampled from . Then the increments of these state process copies are, for ,
(3) 
Then, each such process , , can be viewed as an independent sample from the exploratory state dynamics . The superscript of indicates that each is generated according to the classical dynamics (3), with the corresponding sampled independently under the policy
It then follows from (3) and the law of large numbers that, as ,
(4) 
In the above, we have implicitly applied the (reasonable) assumption that both and are independent of the increments of the Brownian motion sample paths, which are identically distributed over .
Similarly, as ,
(5) 
As we see, not only but also are affected by repetitive learning under the given policy .
Finally, as the individual state is an independent sample from , we have that and , , are the independent samples from and , respectively. As a result, the law of large numbers gives that as ,
This interpretation, together with (4) and (5), motivates us to propose the exploratory version of the state dynamics
(6) 
where the coefficients and are defined as
(7) 
and
(8) 
We will call (6) the exploratory formulation of the controlled state dynamics, and and in (7) and (8), respectively, the exploratory drift and the exploratory volatility.^{8}^{8}8The exploratory formulation (6), inspired by repetitive learning, is consistent with the notion of relaxed control in the control literature (see, e.g., El Karoui et al. (1987); Kurtz and Stockbridge (1998, 2001); Fleming and Nisio (1984)). Indeed, let be a bounded and twice continuously differentiable function, and define the infinitesimal generator associated to the classical controlled process (1) as In the classical relaxed control framework, the controlled dynamics are replaced by the sixtuple , such that and (9) It is easy to verify that our proposed exploratory formulation (6) agrees with the above martingale formulation. However, even though the mathematical formulations are equivalent, the motivations of the two are entirely different. Relaxed control was introduced to mainly deal with the existence of optimal controls, whereas the exploratory formulation here is motivated by learning and exploration in RL.
In a similar fashion,
(10) 
Hence the reward function in (2) needs to be modified to its exploratory version
(11) 
If, on the other hand, the model is fully known, exploration would not be needed and the control distributions would all degenerate into the Dirac measures, and we are in the realm of the classical stochastic control. Thus, in the RL context, we need to add a “regularization term” to account for model uncertainty and to encourage exploration. We use Shanon’s differential entropy to measure the degree of exploration:
where is a control distribution.
We therefore introduce the following entropyregularized value function
(12) 
where is an exogenous exploration weight parameter capturing the tradeoff between exploitation (the original reward function) and exploration (the entropy), and is the set of the admissible control distributions (which may in general depend on ).
It remains to specify . Denote by the Borel algebra on and by the set of probability measures on that are absolutely continuous with respect to the Lebesgue measure. The admissible set contains all measurevalued processes satisfying:
i) for each , a.s.;
ii) for each , is progressively measurable;
iii) the stochastic differential equation (SDE) (6) has a unique solution if is applied;
iv) the expectation on the right hand side of (12) is finite.
3 HJB Equation and Optimal Control Distributions
We present the general procedure for solving the optimization problem (12).
To this end, applying the classical Bellman’s principle of optimality, we have
Proceeding with standard arguments, we deduce that satisfies the following HamiltonJacobiBellmam (HJB) equation
(13) 
or
(14) 
where, with a slight abuse of notation, we have denoted its generic solution by . Note that if and only if
(15) 
Solving the (constrained) maximization problem on the right hand side of (14) then yields the “feedbacktype” optimizer
(16) 
This leads to an optimal measurevalued process
(17) 
where , , solves (6) when the feedback control distribution is applied and assuming that
The formula (17) elicits qualitative understanding about an optimal exploration. We further investigate this in the next section.
4 The Linear–Quadratic Case
We focus on the family of entropyregularized problems with linear state dynamics and quadratic rewards, in which^{9}^{9}9We assume both the state and the control are scalarvalued for notational simplicity. There is no essential difficulty with these being vectorvalued.
(18) 
where are given constants with , and
(19) 
where , .
In the classical control literature, this type of linear–quadratic (LQ) control problems is one of the most important in that it admits elegant and simple solutions and, furthermore, more complex, nonlinear problems can be approximated by LQ problems. As is standard with LQ control, we assume that the control is unconstrained, namely, .
Fix an initial state . For each measuredvalued control we define its mean and variance processes namely,
Then (6) can be rewritten as
(20) 
Further, define
We will be working with the following assumption for the entropyregularized LQ problem.
Assumption 1
The discount rate satisfies
Here, means is sufficiently larger than . This assumption requires a sufficiently large discount rate, or (implicitly) a sufficiently short planning horizon. Such an assumption is standard in an infinite horizon problem with running rewards.
We are now ready to define the admissible control set for the entropyregularized LQ problem as follows: , if

for each , a.s.;

for each , is progressively measurable;

for each , ;

there exists , such that

with solving (20), .
Under the above conditions, it is immediate that for any , both the drift and volatility terms of (20) satisfy a global Lipschitz condition in the state variable; hence the SDE (20) admits a unique strong solution .
In the following two subsections, we respectively derive explicit solutions for the case of stateindependent reward and of general statedependent reward, respectively.
4.1 The case of stateindependent reward
We start with the technically less challenging case when , namely, the reward is state independent. In this case, the system dynamics become irrelevant. However, the problem is still interesting in its own right as it corresponds to the stateindependent RL problem, which is known as the continuousarmed bandit problem in the continuoustime setting (see, e.g., Mandelbaum (1987); Kaspi and Mandelbaum (1998)).
Following the derivation in the previous section, the optimal control distribution in feedback form, (16), reduces to
(22) 
So the optimal control distribution appears to be Gaussian. More specifically, at any present state , the agent should embark exploration according to the Gaussian distribution with mean and variance given, respectively, by and . We note that it is required in the expression (22) that , , a condition that will be justified and discussed later on.
Remark 2
If we examine the derivation of (22) more closely, we easily see that the optimality of Gaussian distribution still holds so long as the state dynamics is linear in control and the reward is quadratic in control whereas their dependence on state can be generally nonlinear.
Substituting (22) back to (13), the HJB equation becomes, after straightforward calculations,
(23) 
In general, this nonlinear equation has multiple smooth solutions, even among quadratic polynomials that satisfy . One such solution is a constant, given by
(24) 
with the corresponding optimal control distribution (22) being
Note that the classical LQ problem with the stateindependent reward function clearly has the optimal control , which is the mean of the optimal Gaussian control distribution . The following result establishes that this constant is indeed the value function .
Henceforth, we denote, for notational convenience, by the distribution function of a Gaussian random variable with mean and variance .
Theorem 3
If , then the value function in (21) is given by
and the optimal control distribution is Gaussian: , . The associated optimal state process solves the SDE
(25) 
which can be explicitly expressed as follows:
i) If , then
(26) 
where the function
and the process , , is the unique pathwise solution to the random ordinary differential equation
ii) If , , then
(27) 
Proof. See Appendix A.
The above solution suggests that when the reward is independent of the state, so is the optimal control distribution . This is intuitive as objective (12) does not explicitly distinguish between states.^{10}^{10}10Similar observation can be made for the (stateindependent) pure entropy maximization formulation, where the goal is to solve (28) This problem becomes relevant when in the entropyregularized objective (21), corresponding to the extreme case when the least informative (or the highest entropy) distribution is favored for pure exploration without considering exploitation (i.e., without maximizing any rewards). To solve problem (28), we can pointwisely maximize the integrand there, leading to the stateindependent optimization problem (29) If the action space is a finite interval, say on , then it is straightforward that the optimal control distribution is, for all , the uniform distribution on the interval . This is in accordance with the traditional static setting where uniform distribution achieves maximum entropy on any finite interval (see, e.g., Shannon (2001)).
A remarkable feature of the derived optimal distribution is that the mean coincides with the optimal control of the original nonexploratory LQ problem, whereas the variance is determined by the parameter . In the context of continuousarmed bandit problem, the mean is concentrated on the current incumbent of the best arm and the variance is determined by the parameter. The more weight put on the level of exploration, the more spread out the exploration around the current best arm. This type of exploration/exploitation strategies is clearly intuitive, and in turn, it gives a guidance on how to actually choose the parameter in practice: it is nothing else than the variance of the exploration the agent wishes to engage.
However, we shall see in the next section that when the reward depends on local state, the HJB equation does not admit a constant solution. Consequently, the optimal control distribution must be of a feedback form, depending on where the state is at any given time.
4.2 The case of statedependent reward
We now consider the case of a general statedependent reward, i.e.,
We start with the following lemma that will be used for the verification arguments.
Lemma 4
For each , , the unique solution , , to the SDE (20) satisfies