Exploration versus exploitation in reinforcement learning: a stochastic control approach

# Exploration versus exploitation in reinforcement learning: a stochastic control approach

Haoran Wang111Department of Mathematics, The University of Texas at Austin, Austin, TX 78712, USA. Email: hwang@math.utexas.edu.    Thaleia Zariphopoulou222Department of Mathematics and IROM, The University of Texas at Austin, Austin, USA and the Oxford-Man Institute, University of Oxford, Oxford, UK. Email: zariphop@math.utexas.edu.    Xun Yu Zhou333Department of Industrial Engineering and Operations Research, Columbia University, New York, NY 10027, USA. Email: xz2574@columbia.edu.
###### Abstract

We consider reinforcement learning (RL) in continuous time and study the problem of achieving the best trade-off between exploration of a black box environment and exploitation of current knowledge. We propose an entropy-regularized reward function involving the differential entropy of the distributions of actions, and motivate and devise an exploratory formulation for the feature dynamics that captures repetitive learning under exploration. The resulting optimization problem is a resurrection of the classical relaxed stochastic control. We carry out a complete analysis of the problem in the linear–quadratic (LQ) case and deduce that the optimal control distribution for balancing exploitation and exploration is Gaussian. This in turn interprets and justifies the widely adopted Gaussian exploration in RL, beyond its simplicity for sampling. Moreover, the exploitation and exploration are reflected respectively by the mean and variance of the Gaussian distribution. We also find that a more random environment contains more learning opportunities in the sense that less exploration is needed, other things being equal. As the weight of exploration decays to zero, we prove the convergence of the solution to the entropy-regularized LQ problem to that of the classical LQ problem. Finally, we characterize the cost of exploration, which is shown to be proportional to the entropy regularization weight and inversely proportional to the discount rate in the LQ case.

Key words. Reinforcement learning, exploration, exploitation, entropy regularization, stochastic control, linear–quadratic, Gaussian.

First draft: March 2018

This draft: December 2018

## 1 Introduction

Reinforcement learning (RL) is currently one of the most active and fast developing subareas in machine learning. It has been successfully applied to solve large scale real world, complex decision-making problems in recent years, including playing perfect-information board games such as Go (AlphaGo/AlphaGo Zero, Silver et al. (2016), Silver et al. (2017)), achieving human-level performance on video games (Mnih et al. (2015)), and autonomous driving (Levine et al. (2016), Mirowski et al. (2016)). An RL agent does not pre-specify a structural model or a family of models, but instead gradually learns the best (or near-best) strategies based on trial and error, through interactions with the random (black box) environment and incorporation of the responses of these interactions in order to improve the overall performance. This is a “kill two birds with one stone” case: the agent’s actions (controls) serve both as a means to explore (learn) and a way to exploit (optimize). Since exploration is inherently costly in terms of both resources and time, a natural and crucial question in RL is to address the dichotomy between exploration of uncharted territory and exploitation of existing knowledge. Such question exists in both the stateless RL settings represented by the multi-armed bandit problem, and the more general multi-state RL settings with delayed reinforcement (Sutton and Barto (2018), Kaelbling et al. (1996)). More specifically, the agent must balance between greedily exploiting what has been learned so far to choose actions that yield near-term higher rewards, and continually exploring the environment to acquire more information to potentially achieve long-term benefits. Extensive studies have been conducted to find optimal strategies for trading off exploitation and exploration.444For the multi-armed bandit problem, well-known strategies include Gittins-index approach (Gittins (1974)), Thompson sampling (Thompson (1933)), and upper confidence bound algorithm (Auer et al. (2002)), with sound theoretical optimality established (Russo and Van Roy (2013, 2014)). For general RL problems, various efficient exploration methods have been proposed that have been proved to induce low sample complexity (see, e.g., Brafman and Tennenholtz (2002), Strehl and Littman (2008), Strehl et al. (2009)) among other advantages.

However, most of the contributions to balancing exploitation and exploration do not include exploration formally as a part of the optimization objective; the attention has mainly focused on solving the classical optimization problem maximizing the accumulated rewards, while exploration is typically treated separately as an ad-hoc chosen exogenous component, rather than being endogenously derived as a part of the solution to the overall RL problem. The recently proposed discrete-time entropy-regularized (also termed as “entropy-augmented” or “softmax”) RL formulation, on the other hand, explicitly incorporates exploration into the optimization objective, with a trade-off weight imposed on the entropy of the exploration strategy (Ziebart et al. (2008), Nachum et al. (2017a), Fox et al. (2015)). An exploratory distribution with a greater entropy signifies a higher level of exploration, and is hence more favorable on the exploration front. The extreme case of a Dirac measure having minimal entropy implies no exploration, reducing to the case of classical optimization with complete knowledge about the underlying model. Recent works have devoted to the designing of various algorithms to solve the entropy regulated RL problem, where numerical experiments demonstrate remarkable robustness and multi-modal policy learning (Haarnoja et al. (2017), Haarnoja et al. (2018)).

In this paper, we study the trade-off between exploration and exploitation for RL in a continuous time setting with both continuous control (action) and state (feature) spaces.555The terms “feature” and “action” are typically used in the RL literature, whose counterparts in the control literature are “state” and “control” respectively. Since this paper uses the control approach to study RL, we will interchangeably use these terms whenever there is no confusion. Such continuous-time formulation is especially appealing if the agent can interact with the environment at ultra-high frequency aided by the modern computing resources.666A notable example is high frequency stock trading. Meanwhile, an in-depth and comprehensive study of the RL problem also becomes possible which leads to elegant and insightful results, once cast in continuous time, thanks in no small measure to the tools of stochastic calculus and differential equations.

Our first main contribution is to propose an entropy-regularized reward function involving the differential entropy for probability distributions over the continuous action space, and motivate and devise an “exploratory formulation” for the state dynamics that captures repetitive learning under exploration in the continuous-time limit. Existing theoretical works on exploration mainly concentrate on the analysis at the algorithmic level, e.g., proving convergence of the proposed exploration algorithms to the solutions of the classical optimization problems (see, e.g., Singh et al. (2000), Jaakkola et al. (1994)), yet they rarely look into the learning algorithms’ impact on changing significantly the underlying dynamics. Indeed, exploration not only substantially enriches the space of control strategies (from that of Dirac measures to that of all possible probability distributions) but also, as a result, enormously expands the reachable space of states. This, in turn, sets out to change the underlying state transitions and system dynamics.

We show that our exploratory formulation can account for the effects of learning in both the rewards received and the state transitions observed from the interactions with the environment. It, thus, unearths the important characteristics of learning at a more refined and in-depth level, beyond the theoretical analysis of mere learning algorithms. Intriguingly, this formulation of the state dynamics coincides with the relaxed control framework in classical control theory (El Karoui et al. (1987), Kurtz and Stockbridge (1998, 2001)), which was motivated by entirely different reasons. Relaxed controls were introduced to mainly deal with the theoretical question of whether an optimal control exists. The approach is essentially a randomization device to convexify the universe of control strategies. To our best knowledge, our paper is the first to bring back the fundamental ideas and formulation of the relaxed control, guided by a practical motivation: exploration and learning.

We then carry out a complete analysis of the continuous-time entropy-regularized RL problem, assuming that the original dynamic system is linear in both control and state and the original reward function is quadratic in the two. This type of linear–quadratic (LQ) problems has occupied the center stage for research in classical control theory for its elegant solutions and its ability to approximate more general nonlinear problems. One of the most important, conceptual contributions of this paper is to show that the optimal control distribution for balancing exploitation and exploration is Gaussian. As is well known, a pure exploitation optimal distribution is Dirac, and a pure exploration optimal distribution is uniform. Our results reveal that Gaussian is the right choice if one seeks a balance between those two extremes. Moreover, we find that the mean of this optimal distribution is a feedback of the current state independent of the intended exploration level, whereas the variance is a linear function of the entropy regulating weight (also called the “temperature parameter”) irrespective of the current state. This result highlights a separation between exploitation and exploration: the former is reflected in the mean and the latter in the variance of the final optimal actions.

There is yet another intriguing result. All other things being equal, the more volatile the original dynamic system is, the smaller the variance of the optimal Gaussian distribution is. Conceptually, this implies that an environment reacting more aggressively to an action in fact contains more learning opportunities and hence is less costly for learning.

Another contribution of the paper is to establish the connection between the solvability of the exploratory LQ problem and that of the classical LQ problem. We prove that as the exploration weight in the former decays to zero, the optimal Gaussian control distribution and its value function converge respectively to the optimal Dirac measure and the value function of the latter, a desirable result for practical learning purposes.

Finally, we observe that, beyond the LQ problems, the Gaussian distribution remains, formally, optimal for a much larger class of control problems, namely, problems with drift and volatility linear in control, and reward functions linear or quadratic in control. Such family of problems can be seen as the local-linear-quadratic approximation to more general stochastic control problems whose state dynamics are linearized in the control variables and the reward functions are approximated by quadratic functions in controls locally (Todorov and Li (2005), Li and Todorov (2007)). Note that although such iterative LQ approximation generally has different parameters at different local state-action pairs, our result on the optimality of Gaussian distribution under the exploratory LQ framework still holds at any local point, and therefore justifies, from a stochastic control perspective, why Gaussian distribution is commonly used in practice for exploration (Haarnoja et al. (2017), Haarnoja et al. (2018), Nachum et al. (2017b)), beyond its simplicity for sampling.

The paper is organized as follows. In section 2, we motivate and propose the relaxed stochastic control problem involving an exploratory state dynamics and an entropy-regularized reward function. We then present the Hamilton-Jacobi-Bellman (HJB) equation and the optimal control distribution for general entropy-regularized stochastic control problems in section 3. In section 4, we study the special LQ problem in both the state-independent and state-dependent reward cases, corresponding respectively to the multi-armed bandit problem and general RL problem in discrete time. We discuss the connections between the exploratory LQ problem and the classical LQ problem in section 5, establish the equivalence of the solvability of the two and the convergence result for vanishing exploration, and finally characterize the cost of engaging exploration. Finally, section 6 concludes. Some technical proofs are relegated to appendices.

## 2 An Entropy-Regularized Stochastic Control Problem

We introduce an entropy-regularized stochastic control problem and provide its motivation in the context of RL.

Consider a filtered probability space in which we define an -adapted Brownian motion An “action space” is given, representing the constraint on an agent’s decisions (“controls” or “actions”). An admissible control is an -adapted measurable process taking value in . Denote by the set of all admissible controls.

The classical stochastic control problem is to control the state (or “feature”) dynamics

 dxut=b(xut,ut)dt+\boldmath\mathchar283(xut,ut)dWt,t>0;xu0=x (1)

so as to achieve the maximum expected total discounted reward represented by the value function

 w(x):=supUE[∫∞0e−\boldmath\mathchar282tr(xut,ut)dt∣∣∣xu0=x], (2)

where is the reward function and is the discount rate.

In the classical setting, where the model is fully known (namely when the functions and are fully specified) and the dynamic programming is applicable, the optimal control can be derived and represented as a deterministic mapping from the current state to the action space , where is the optimal feedback policy (or “law”). This feedback policy is derived at and will be carried out through .

In contrast, under the RL setting when the underlying model is not known and therefore dynamic learning is needed, the agent employs exploration to interact with and learn the unknown environment through trial and error. This exploration can be modelled by a distribution of controls over the control space from which each “trial” is sampled. We can therefore extend the notion of controls to distributions777A classical control can be regarded as a Dirac distribution (or “measure”) where . In a similar fashion, a feedback policy can be embedded as a Dirac measure , parameterized by the current state .. The agent executes a control for rounds over the same time horizon, while at each round, a classical control is sampled from the distribution . The reward of such a policy becomes accurate enough when is large. This procedure, known as policy evaluation, is considered as a fundamental element of most RL algorithms in practice (see, e.g., Sutton and Barto (2018)). Hence, for evaluating such a policy distribution in our continuous-time setting it is necessary to consider the limiting situation as .

In order to quickly capture the essential idea, let us first examine the special case when the reward depends only on the control, namely, One then considers identical independent copies of the control problem in the following way: at round , a control is sampled under the (possibly random) control distribution , and executed for its corresponding copy of the control problem (1)–(2). Then, at each fixed time , from the law of large numbers and under certain mild technical conditions, it follows that the average reward over , with small enough, should satisfy

 1NN∑i=1e−\boldmath\mathchar282tr(uit)Δt {a.s.} −−−−−→E[e−\boldmath\mathchar282t∫Ur(u)\boldmath\mathchar281t(u)duΔt],{as}N→∞.

For a general reward which depends on the state, we first need to describe how exploration might alter the state dynamics (1), by defining appropriately its “exploratory” version. For this, we look at the effect of repetitive learning under a given control distribution for rounds. Let , , be independent sample paths of the Brownian motion , and , , be the copies of the state process respectively under the controls , , each sampled from . Then the increments of these state process copies are, for ,

 Δxit≡xit+Δt−xit ≈ b(xit,uit)Δt+\boldmath\mathchar283(xit,uit)(Wit+Δt−Wit),t≥0. (3)

Then, each such process , , can be viewed as an independent sample from the exploratory state dynamics . The superscript of indicates that each is generated according to the classical dynamics (3), with the corresponding sampled independently under the policy

It then follows from (3) and the law of large numbers that, as ,

 1N∑Ni=1Δxit ≈ 1N∑Ni=1b(xit,uit)Δt+1N∑Ni=1\boldmath\mathchar283(xit,uit)(Wit+Δt−Wit) {a.s.} −−−−−→E[∫Ub(X%\boldmath$\mathchar281$t,u)\boldmath\mathchar281t(u)duΔt]+E[∫U\boldmath\mathchar283(X\boldmath\mathchar281t,u)\boldmath\mathchar281t(u)du]E[Wt+Δt−Wt]=E[∫Ub(X\boldmath\mathchar281t,u)% \boldmath\mathchar281t(u)duΔt]. (4)

In the above, we have implicitly applied the (reasonable) assumption that both and are independent of the increments of the Brownian motion sample paths, which are identically distributed over .

Similarly, as ,

 1NN∑i=1(Δxit)2 ≈ 1NN∑i=1\boldmath\mathchar2832(xit,uit)Δt {a.s.} −−−−−→E[∫U%\boldmath$\mathchar283$2(X%\boldmath$\mathchar281$t,u)\boldmath\mathchar281t(u)duΔt]. (5)

As we see, not only but also are affected by repetitive learning under the given policy .

Finally, as the individual state is an independent sample from , we have that and , , are the independent samples from and , respectively. As a result, the law of large numbers gives that as ,

 1NN∑i=1Δxit {a.s.} −−−−−→E[ΔX\boldmath\mathchar281t]   {and }1NN∑i=1(Δxit)2 {a.s.} −−−−−→E[(ΔX\boldmath\mathchar281t)2].

This interpretation, together with (4) and (5), motivates us to propose the exploratory version of the state dynamics

 dX\boldmath\mathchar281t=~b(X\boldmath\mathchar281t,% \boldmath\mathchar281t)dt+~\boldmath% \mathchar283(X%\boldmath$\mathchar281$t,\boldmath\mathchar281t)dWt,t>0;X\boldmath\mathchar2810=x, (6)

where the coefficients and are defined as

 ~b(X\boldmath\mathchar281t,\boldmath\mathchar281t):=∫Ub(X\boldmath\mathchar281t,u)\boldmath\mathchar281t(u)du, (7)

and

 ~\boldmath\mathchar283(X\boldmath\mathchar281t,\boldmath\mathchar281t):=√∫U\boldmath\mathchar2832(X\boldmath\mathchar281t,u)\boldmath\mathchar281t(u)du. (8)

We will call (6) the exploratory formulation of the controlled state dynamics, and and in (7) and (8), respectively, the exploratory drift and the exploratory volatility.888The exploratory formulation (6), inspired by repetitive learning, is consistent with the notion of relaxed control in the control literature (see, e.g., El Karoui et al. (1987); Kurtz and Stockbridge (1998, 2001); Fleming and Nisio (1984)). Indeed, let be a bounded and twice continuously differentiable function, and define the infinitesimal generator associated to the classical controlled process (1) as In the classical relaxed control framework, the controlled dynamics are replaced by the six-tuple , such that and (9) It is easy to verify that our proposed exploratory formulation (6) agrees with the above martingale formulation. However, even though the mathematical formulations are equivalent, the motivations of the two are entirely different. Relaxed control was introduced to mainly deal with the existence of optimal controls, whereas the exploratory formulation here is motivated by learning and exploration in RL.

In a similar fashion,

 1NN∑i=1e−\boldmath\mathchar282tr(xit,uit)Δt {a.s.} −−−−−→E[e−\boldmath\mathchar282t∫Ur(X\boldmath\mathchar281t,u)\boldmath\mathchar281t(u)duΔt],{as}N→∞. (10)

Hence the reward function in (2) needs to be modified to its exploratory version

 ~r(X\boldmath\mathchar281t,\boldmath\mathchar281t):=∫Ur(X\boldmath\mathchar281t,u)% \boldmath\mathchar281t(u)du. (11)

If, on the other hand, the model is fully known, exploration would not be needed and the control distributions would all degenerate into the Dirac measures, and we are in the realm of the classical stochastic control. Thus, in the RL context, we need to add a “regularization term” to account for model uncertainty and to encourage exploration. We use Shanon’s differential entropy to measure the degree of exploration:

 H(\boldmath\mathchar281t):=−∫U\boldmath\mathchar281t(u)ln% \boldmath\mathchar281t(u)du,

where is a control distribution.

We therefore introduce the following entropy-regularized value function

 (12)

where is an exogenous exploration weight parameter capturing the trade-off between exploitation (the original reward function) and exploration (the entropy), and is the set of the admissible control distributions (which may in general depend on ).

It remains to specify . Denote by the Borel algebra on and by the set of probability measures on that are absolutely continuous with respect to the Lebesgue measure. The admissible set contains all measure-valued processes satisfying:

i) for each , a.s.;

ii) for each , is -progressively measurable;

iii) the stochastic differential equation (SDE) (6) has a unique solution if is applied;

iv) the expectation on the right hand side of (12) is finite.

## 3 HJB Equation and Optimal Control Distributions

We present the general procedure for solving the optimization problem (12).

To this end, applying the classical Bellman’s principle of optimality, we have

 V(x)=sup\boldmath\mathchar281∈A(x)E[∫s0e−%\boldmath$\mathchar282$t(~r(X%\boldmath$\mathchar281$t,\boldmath\mathchar281t)+\boldmath\mathchar277H(\boldmath\mathchar281t))dt+e−\boldmath\mathchar282sV(X\boldmath\mathchar281s)∣∣∣X\boldmath\mathchar2810=x],s>0.

Proceeding with standard arguments, we deduce that satisfies the following Hamilton-Jacobi-Bellmam (HJB) equation

 \boldmath\mathchar282V(x)=max\boldmath\mathchar281∈P(U)(~r(x,\boldmath\mathchar281)−\boldmath\mathchar277∫U\boldmath\mathchar281(u)ln\boldmath\mathchar281(u)du+12~\boldmath\mathchar2832(x,% \boldmath\mathchar281)V′′(x)
 +~b(x,\boldmath\mathchar281\boldmath\mathchar281)V′(x)),x∈D, (13)

or

 \boldmath\mathchar282V(x)=max\boldmath\mathchar281∈P(U)∫U(r(x,u)−\boldmath\mathchar277ln% \boldmath\mathchar281(u)+12% \boldmath\mathchar2832(x,u)V′′(x)+b(x,u)V′(x))\boldmath\mathchar281(u)du, (14)

where, with a slight abuse of notation, we have denoted its generic solution by . Note that if and only if

 ∫U\boldmath\mathchar281(u)du=1{and}\boldmath\mathchar281(u)≥0 {a.e.}{on} U. (15)

Solving the (constrained) maximization problem on the right hand side of (14) then yields the “feedback-type” optimizer

 \boldmath\mathchar281∗(u;x)=exp(1% \boldmath\mathchar277(r(x,u)+12\boldmath\mathchar2832(x,u)V′′(x)+b(x,u)V′(x)))∫Uexp(1\boldmath\mathchar277(r(x,u)+12\boldmath\mathchar2832(x,u)V′′(x)+b(x,u)V′(x)))du. (16)

This leads to an optimal measure-valued process

 \boldmath\mathchar281∗t=% \boldmath\mathchar281∗(u;X∗t)=exp(1\boldmath\mathchar277(r(X∗t,u)+12% \boldmath\mathchar2832(X∗t,u)V′′(X∗t)+b(X∗t,u)V′(X∗t)))∫Uexp[1%\boldmath$\mathchar277$(r(X∗t,u)+12\boldmath\mathchar2832(X∗t,u)V′′(X∗t)+b(X∗t,u)V′(X∗t))]du, (17)

where , , solves (6) when the feedback control distribution is applied and assuming that

The formula (17) elicits qualitative understanding about an optimal exploration. We further investigate this in the next section.

We focus on the family of entropy-regularized problems with linear state dynamics and quadratic rewards, in which999We assume both the state and the control are scalar-valued for notational simplicity. There is no essential difficulty with these being vector-valued.

 b(x,u)=Ax+Bu,{and}\boldmath\mathchar283(x,u)=Cx+Du,x,u∈R (18)

where are given constants with , and

 r(x,u)=−(M2x2+Rxu+N2u2+Px+Qu),x,u∈R (19)

where , .

In the classical control literature, this type of linear–quadratic (LQ) control problems is one of the most important in that it admits elegant and simple solutions and, furthermore, more complex, nonlinear problems can be approximated by LQ problems. As is standard with LQ control, we assume that the control is unconstrained, namely, .

Fix an initial state . For each measured-valued control we define its mean and variance processes namely,

 \boldmath\mathchar278t:=∫Ru\boldmath\mathchar281t(u)du  {and}\boldmath\mathchar2832t:=∫Ru2\boldmath\mathchar281t(u)du−% \boldmath\mathchar2782t{ }.

Then (6) can be rewritten as

 dX\boldmath\mathchar281t=(AX\boldmath\mathchar281t+B\boldmath\mathchar278t)dt+√C2(X\boldmath\mathchar281t)2+2CDX\boldmath\mathchar281t\boldmath\mathchar278t+D2(% \boldmath\mathchar2782t+\boldmath\mathchar2832t) dWt=(AX\boldmath\mathchar281t+B\boldmath\mathchar278t)dt+√(CX\boldmath\mathchar281t+D\boldmath\mathchar278t)2+D2\boldmath\mathchar2832t dWt,t>0;X\boldmath\mathchar2810=x. (20)

Further, define

 L(X\boldmath\mathchar281t,% \boldmath\mathchar281t):=∫Rr(X\boldmath\mathchar281t,u)\boldmath\mathchar281t(u)du−\boldmath\mathchar277∫R\boldmath\mathchar281t(u)ln\boldmath\mathchar281t(u)du.

We will be working with the following assumption for the entropy-regularized LQ problem.

###### Assumption 1

The discount rate satisfies

Here, means is sufficiently larger than . This assumption requires a sufficiently large discount rate, or (implicitly) a sufficiently short planning horizon. Such an assumption is standard in an infinite horizon problem with running rewards.

We are now ready to define the admissible control set for the entropy-regularized LQ problem as follows: , if

• for each , a.s.;

• for each , is -progressively measurable;

• for each , ;

• there exists , such that

 liminfT→∞e−\boldmath\mathchar270T∫T0E[\boldmath\mathchar2782t+\boldmath\mathchar2832t]dt=0;
• with solving (20), .

Under the above conditions, it is immediate that for any , both the drift and volatility terms of (20) satisfy a global Lipschitz condition in the state variable; hence the SDE (20) admits a unique strong solution .

Next, we solve the entropy-regularized stochastic LQ problem

 V(x)=sup\boldmath\mathchar281∈A(x)E[∫∞0e−%\boldmath$\mathchar282$t(∫Rr(X\boldmath\mathchar281t,u)\boldmath\mathchar281t(u)du−\boldmath\mathchar277∫R\boldmath\mathchar281t(u)ln\boldmath\mathchar281t(u)du)dt] (21)

with as in (19) and as in (20).

In the following two subsections, we respectively derive explicit solutions for the case of state-independent reward and of general state-dependent reward, respectively.

### 4.1 The case of state-independent reward

We start with the technically less challenging case when , namely, the reward is state independent. In this case, the system dynamics become irrelevant. However, the problem is still interesting in its own right as it corresponds to the state-independent RL problem, which is known as the continuous-armed bandit problem in the continuous-time setting (see, e.g., Mandelbaum (1987); Kaspi and Mandelbaum (1998)).

Following the derivation in the previous section, the optimal control distribution in feedback form, (16), reduces to

 \boldmath\mathchar281∗(u;x)=exp(1\boldmath\mathchar277((−N2u2+Qu)+12(Cx+Du)2v′′(x)+(Ax+Bu)v′(x)))∫Rexp(1\boldmath\mathchar277((−N2u2+Qu)+12(Cx+Du)2v′′(x)+(Ax+Bu)v′(x)))du
 =exp(−(u−CDxv′′(x)+Bv′(x)−QN−D2v′′(x))2/2%\boldmath$\mathchar277$N−D2v′′(x))∫Rexp(−(u−CDxv′′(x)+Bv′(x)−QN−D2v′′(x))2/2\boldmath\mathchar277N−D2v′′(x)) du. (22)

So the optimal control distribution appears to be Gaussian. More specifically, at any present state , the agent should embark exploration according to the Gaussian distribution with mean and variance given, respectively, by and . We note that it is required in the expression (22) that , , a condition that will be justified and discussed later on.

###### Remark 2

If we examine the derivation of (22) more closely, we easily see that the optimality of Gaussian distribution still holds so long as the state dynamics is linear in control and the reward is quadratic in control whereas their dependence on state can be generally nonlinear.

Substituting (22) back to (13), the HJB equation becomes, after straightforward calculations,

 \boldmath\mathchar282v(x)=(CDxv′′(x)+Bv′(x)−Q)22(N−D2v′′(x))+\boldmath\mathchar2772(ln(2\boldmath\mathchar281e\boldmath\mathchar277N−D2v′′(x))−1)+12C2x2v′′(x)+Axv′(x). (23)

In general, this nonlinear equation has multiple smooth solutions, even among quadratic polynomials that satisfy . One such solution is a constant, given by

 v(x)≡v=Q22\boldmath\mathchar282N+\boldmath\mathchar2772\boldmath\mathchar282(ln2\boldmath\mathchar281e\boldmath\mathchar277N−1), (24)

with the corresponding optimal control distribution (22) being

 \boldmath\mathchar281∗(u;x)=e−(u+QN)2/2\boldmath\mathchar277N∫Re−(u+QN)2/2\boldmath\mathchar277Ndu.

Note that the classical LQ problem with the state-independent reward function clearly has the optimal control , which is the mean of the optimal Gaussian control distribution . The following result establishes that this constant is indeed the value function .

Henceforth, we denote, for notational convenience, by the distribution function of a Gaussian random variable with mean and variance .

###### Theorem 3

If , then the value function in (21) is given by

 V(x)=Q22\boldmath\mathchar282N+\boldmath\mathchar2772\boldmath\mathchar282(ln2\boldmath\mathchar281e\boldmath\mathchar277N−1),x∈R,

and the optimal control distribution is Gaussian: , . The associated optimal state process solves the SDE

 dX∗t=(AX∗t−BQN)dt+√(CX∗t−DQN)2+\boldmath\mathchar277D2NdWt, X∗0=x, (25)

which can be explicitly expressed as follows:

i) If , then

 X∗t=F(Wt,Yt),t≥0, (26)

where the function

 F(z,y)=√\boldmath\mathchar277N∣∣∣DC∣∣∣sinh(|C|z+sinh(−1)(√N\boldmath\mathchar277∣∣∣CD∣∣∣(y−DQCN)))+DQCN,

and the process , , is the unique pathwise solution to the random ordinary differential equation

 dYtdt=AF(Wt,Yt)−BQN−12((CF(Wt,Yt)−DQN)2+%\boldmath$\mathchar277$D2N)∂∂yF(z,y)∣∣z=Wt,y=Yt,Y0=x.

ii) If , , then

 X∗t=xeAt−BQAN(1−eAt)+|D|N√Q2+\boldmath\mathchar277N∫t0eA(t−s)dWs,t≥0. (27)

Proof. See Appendix A.

The above solution suggests that when the reward is independent of the state, so is the optimal control distribution . This is intuitive as objective (12) does not explicitly distinguish between states.101010Similar observation can be made for the (state-independent) pure entropy maximization formulation, where the goal is to solve (28) This problem becomes relevant when in the entropy-regularized objective (21), corresponding to the extreme case when the least informative (or the highest entropy) distribution is favored for pure exploration without considering exploitation (i.e., without maximizing any rewards). To solve problem (28), we can pointwisely maximize the integrand there, leading to the state-independent optimization problem (29) If the action space is a finite interval, say on , then it is straightforward that the optimal control distribution is, for all , the uniform distribution on the interval . This is in accordance with the traditional static setting where uniform distribution achieves maximum entropy on any finite interval (see, e.g., Shannon (2001)).

A remarkable feature of the derived optimal distribution is that the mean coincides with the optimal control of the original non-exploratory LQ problem, whereas the variance is determined by the parameter . In the context of continuous-armed bandit problem, the mean is concentrated on the current incumbent of the best arm and the variance is determined by the parameter. The more weight put on the level of exploration, the more spread out the exploration around the current best arm. This type of exploration/exploitation strategies is clearly intuitive, and in turn, it gives a guidance on how to actually choose the parameter in practice: it is nothing else than the variance of the exploration the agent wishes to engage.

However, we shall see in the next section that when the reward depends on local state, the HJB equation does not admit a constant solution. Consequently, the optimal control distribution must be of a feedback form, depending on where the state is at any given time.

### 4.2 The case of state-dependent reward

We now consider the case of a general state-dependent reward, i.e.,

 r(x,u)=−(M2x2+Rxu+N2u2+Px+Qu),x,u∈R.

We start with the following lemma that will be used for the verification arguments.

###### Lemma 4

For each , , the unique solution ,