# Online Learning in Kernelized Markov Decision Processes

###### Abstract

We consider online learning for minimizing regret in unknown, episodic Markov decision processes (MDPs) with continuous states and actions. We develop variants of the UCRL and posterior sampling algorithms that employ nonparametric Gaussian process priors to generalize across the state and action spaces. When the transition and reward functions of the true MDP are either sampled from Gaussian process priors (fully Bayesian setting) or are members of the associated Reproducing Kernel Hilbert Spaces of functions induced by symmetric psd kernels (frequentist setting), we show that the algorithms enjoy sublinear regret bounds. The bounds are in terms of explicit structural parameters of the kernels, namely a novel generalization of the information gain metric from kernelized bandit, and highlight the influence of transition and reward function structure on the learning performance. Our results are applicable to multi-dimensional state and action spaces with composite kernel structures, and generalize results from the literature on kernelized bandits, and the adaptive control of parametric linear dynamical systems with quadratic costs.

Electrical Communication Engineering,

Indian Institute of Science,

Bangalore 560012, India

Aditya Gopalan aditya@iisc.ac.in

Electrical Communication Engineering,

Indian Institute of Science,

Bangalore 560012, India

## 1 Introduction

The reinforcement learning (RL) paradigm involves an agent acting in an unknown environment, receiving reward signals, and simultaneously influencing the evolution of the environment’s state. The goal in RL problems is to learn optimal behavior (policies) by repeated interaction with the environment – usually modelled by a Markov Decision Process (MDP), and performance is typically measured by the amount of interaction, in terms of episodes or rounds, needed to arrive at an optimal (or near-optimal) policy, also known as the sample complexity of RL (Strehl et al., 2009). The sample complexity objective encourages efficient exploration across states and actions, but, at the same time, is indifferent to the reward earned during the learning phase.

In contrast, the goal in online RL is to learn while accumulating high rewards, or equivalently keep the learner’s regret (the gap between its and the optimal net reward) as low as possible along the way. This is preferable in settings where experimentation time is at a premium and/or the reward earned in each round is of direct value, e.g., recommender systems (in which rewards correspond to clickthrough events and ultimately translate to revenue), dynamic pricing, automated trading, and, more generally, the control of dynamically evolving systems with instantaneous costs. It is well-known that the regret objective encourages more aggressive exploitation and conservative exploration than would be typically needed for optimizing sample complexity.

A primary challenge in RL is to learn efficiently, with only a meagre reward signal as feedback, across complex (very large or infinite) state and action spaces. In the most general tabula rasa case of uncertainty about the MDP, it is known that the learner must explore each state-action transition before developing a reasonably clear understanding of the environment, rendering learning in reasonably small time impossible. Real-world domains, however, possess more structure: transition and reward behavior often varies smoothly over states and actions, making it possible to generalize via inductive inference. Observing a state transition or reward is now informative of many other related or similar transitions or rewards. Scaling RL to large, complex, real-world domains thus requires exploiting regularity structure in the environment, and this is typically accomplished through the use of a parametric model.

While the principle of exploiting regularity structure has been extensively developed for classical, model-free RL in the form of function approximation techniques (Van Roy, 1998), it has received far lesser attention in the online RL setup. Notable work in this regard includes online learning for parametric and nonparametric multi-armed bandits or single-state MDPs (Agrawal and Goyal, 2013; Abbasi-Yadkori et al., 2011; Srinivas et al., 2009; Gopalan et al., 2014), and, more recently, regret minimization in the parametric MDP setting (Osband and Van Roy, 2014a; Gopalan and Mannor, 2015; Agrawal and Jia, 2017).

This paper takes a step in developing theory and algorithms for online RL in environments with smooth transition and reward structure. We specifically consider the episodic online learning problem in the nonparametric, kernelizable MDP setting, i.e., of minimizing regret (relative to an optimal finite-horizon policy) in MDPs with continuous state and action spaces, whose transition and reward functions exhibit smoothness over states and actions compatible with the structure of a reproducing kernel. We develop variants of the well-known UCRL and posterior sampling algorithms for MDPs with continuous state and action spaces, and show that they enjoy sublinear, finite-time regret bounds when the mean transition and reward functions are assumed to either a) be sampled from Gaussian processes with symmetric psd kernels, or b) belong to the associated Reproducing Kernel Hilbert Space (RKHS) of functions.

Our results bound the regret of the algorithms in terms of a novel generalization of the information gain of the state transition and reward function kernels, from the memoryless kernel bandit setting (Srinivas et al., 2009) to the state-based kernel MDP setting, and help shed light on how the choice of kernel model influences regret performance. We also leverage recent concentration of measure results for RKHS-valued martingales, developed originally for the kernelized bandit setting (Chowdhury and Gopalan, 2017b), to prove the results in the paper. To the best of our knowledge, these are the first concrete regret bounds for RL in the kernelizable setting, explicitly showing the dependence of regret on kernel structure.

Our results represent a generalisation of several streams of work. We generalise online learning in the kernelized bandit setting (Srinivas et al., 2009; Valko et al., 2013; Chowdhury and Gopalan, 2017b) to kernelized MDPs, and tabula rasa online learning approaches for MDPs such as UCRL Jaksch et al. (2010) and PSRL Osband et al. (2013) to kernelized (structured) MDPs. Lastly, this work also generalizes online RL for the well-known Linear Quadratic Regulator (LQR) problem Abbasi-Yadkori and Szepesvári (2011, 2015); Ibrahimi et al. (2012); Abeille and Lazaric (2017) to its nonlinear, nonparametric, infinite-dimensional, kernelizable counterpart.

## 2 Problem Statement

We consider the problem of learning to optimize reward in an unknown finite-horizon MDP, , over repeated episodes of interaction. Here, represents the state space, the action space, the episode length, the reward distribution over , the transition distribution over , and the initial state distribution over . At each period within an episode, an agent observes a state , takes an action , observes a reward , and causes the MDP to transition to a next state . We assume that the agent, while not possessing knowledge of the reward and transition distribution of the unknown MDP , knows , , and .

A policy is defined to be a mapping from state and period to an action . For an MDP and policy , define the finite horizon, undiscounted, value function for every state and every period as , where the subscript indicates the application of the learning policy , i.e., , and the subscript explicitly references the MDP environment , i.e., , for all .

We use to denote the mean of the reward distribution that corresponds to playing action at state in the MDP . We can view a sample from the reward distribution as , where denotes a sample of zero-mean, real-valued additive noise. Similarly, the transition distribution can also be decomposed as a mean value in plus a zero-mean additive noise in so that lies in^{1}^{1}1Osband and Van Roy (2014a) argue that the assumption is not restrictive for most practical settings. . A policy is said to be optimal for the MDP if for all and .

For an MDP , a distribution over and for every period , define the one step future value function as the expected value of the optimal policy , with the next state distributed according to , i.e. . We assume the following regularity condition on the future value function of any MDP in our uncertainty class, also made by Osband and Van Roy (2014b).

Assumption. For any two single-step transition distributions over ,

(1) |

where denotes the mean of the distribution . In other words, the one-step future value functions for each period are Lipschitz continuous with respect to the -norm of the mean^{2}^{2}2Assumption (1) is essentially equivalent to assuming knowledge of the centered state transition noise distributions, since it implies that any two transition distributions with the same means are identical., with Lipschitz constant and is defined to be the global Lipschitz constant for MDP . Also, assume that there is a known constant such that .

Regret. At the beginning of each episode , an RL algorithm chooses a policy depending upon the observed state-action-reward sequence upto episode , denoted by the history , and executes it for the entire duration of the episode. In other words, at each period of the -th episode, the learning algorithm chooses action , receives reward and observes the next state . The goal of an episodic online RL algorithm is to maximize its cumulative reward across episodes, or, equivalently, minimize its cumulative regret: the loss incurred in terms of the value function due to not knowing the optimal policy of the unknown MDP beforehand and instead using the policy for each episode , . The cumulative (expected) regret of an RL algorithm upto time (i.e., period) horizon is defined as

## 3 Algorithms

Representing uncertainty. The algorithms we design represent uncertainty in the reward and transition distribution by maintaining Gaussian process (GP) priors over the mean reward function and the mean transition function of the unknown MDP . A Gaussian Process over , denoted by , is a collection of random variables , one for each , such that every finite sub-collection of random variables is jointly Gaussian with mean and covariance , , . We use and as the initial prior distributions over and , with positive semi-definite covariance (kernel) functions and respectively. We also assume that the noise variables and are drawn independently, across and , from and respectively, with . Then, by standard properties of GPs (Rasmussen and Williams, 2006), conditioned on the history , the posterior distribution over is also a Gaussian process, , with mean and kernel functions

(2) | |||||

(3) | |||||

(4) |

Here, is the vector of rewards observed at , is the (positive semi-definite) kernel matrix (corresponding to ), and . Thus, at the end of episode , conditioned on the history , the posterior distribution over is updated and maintained as for every . Similarly at the end of episode , the posterior distribution over is for every , with

(5) | |||||

(6) | |||||

(7) |

where is vector of states transitioned at , is the corresponding kernel matrix and . This representation not only permits generalization via inductive inference across continuum state and action spaces, but also allows for tractable updates.

We now present our online algorithms GP-UCRL and -PSRL for kernelized MDPs.

##### GP-UCRL Algorithm.

GP-UCRL (Algorithm 1) is an optimistic algorithm based on the Upper Confidence Bound principle, which adapts the confidence sets of UCRL2 (Jaksch et al., 2010) to exploit the kernel structure. GP-UCRL, at the start of every episode , constructs two confidence sets and respectively^{3}^{3}3The exact forms of the confidence sets appear in the relevant theoretical results later., using the parameters of corresponding posteriors and appropriately chosen confidence widths . Then it builds the set of all plausible MDPs with the global Lipschitz constant of future value functions (as defined in 1) , mean reward function and mean transition function . Finally, it chooses the optimistic policy in the set , satisfying
for all and , and execute it for the entire episode. The pseudo-code is given in Algorithm 1.

Optimizing for an optimistic policy is not computationally tractable in general, even though planning for the optimal policy is possible for a given MDP. A popular approach to overcome this difficulty is to sample a random MDP at every episode and solve for its optimal policy, called Posterior sampling (Osband and Van Roy, 2016).

##### -PSRL Algorithm.

-PSRL, in its most general form, starts with a prior distribution over MDPs^{4}^{4}4For an MDP, this is a prior over reward distributions and transition dynamics. and at the start of episode samples an MDP from the posterior distribution ^{5}^{5}5Sampling can be done using MCMC methods even if doesn’t admit any closed form.. For example, when is specified by the GPs and the observation model is Gaussian, then the posterior admits a nice closed form and is given by GP posteriors discussed above and we denote the corresponding algorithm as GP-PSRL.
Hence at the start of episode , GP-PSRL samples an MDP , with mean reward function and mean transition function . Then it chooses the optimal policy of , satisfying for all and ,
and execute it for the entire episode. The pseudo-code is given in supplementary material.

Discussion. A fundamental issue in model-based RL is that planning for the optimal policy may be computationally intractable even in a given MDP and it is common practice in the literature to assume access to an approximate MDP planner like extended value iteration (Jaksch et al., 2010). The design of such approximate planners for continuous state and action spaces is a subject of active research, and our focus in this work is on the statistical efficiency of the online learning problem.

## 4 Main Results

We start by recalling the following result, which will be a key tool to derive our regret bounds. Although it has been used previously in an implicit form to derive regret bounds for GP-based bandit algorithms (Srinivas et al., 2009; Krause and Ong, 2011; Bogunovic et al., 2016; Chowdhury and Gopalan, 2017b), we could not find an explicit presentation of the form below.

###### Lemma 1

Let be a symmetric positive semi-definite kernel function. For any , , and , let , where and . Then,

with being the kernel matrix computed at .

The right hand side term in the conclusion of Lemma 1 is called the maximum information gain . It is well-known that for a compact and convex subset of , for the Squared Exponential and Matrn kernels (with smoothness parameter ), is and respectively (Srinivas et al., 2009), and it depends only poly-logarithmically on the time .

The proof of the lemma appears in the supplementary material, and relies on the fact that if is drawn from , then the posterior distribution of is Gaussian with variance conditioned on the observations , where each with noise sequence is iid .

Composite kernels. In our kernelized MDP setting, the kernel matrix in Lemma 1 is over state-action pairs, and hence, we consider composite kernels on the product space . We can use either a product kernel^{6}^{6}6Many widely used kernel functions are already in the product form. For example product of two SE kernels (or Matrn kernels with smoothness parameters ) is a SE kernel (or Matrn kernel with the same smoothness parameter ). with , or an additive kernel with . Krause and Ong (2011) show the following for product kernels:

(8) |

if the kernel function has rank at most . Therefore, if ’s for the individual kernels are logarithmic in , then the same is true for the composite kernel. For example, for the product of a dimensional linear kernel and a dimensional SE kernel is .

With these tools in hand, we are now in a position to state the regret bounds for our algorithms. However, attaining sub-linear regret is impossible in general for arbitrary mean reward function and mean transition function , so some regularity assumptions are needed.

##### Regret Bound of GP-UCRL in the Bayesian Setup.

We assume that themselves are sampled from respectively, the same GP priors used by the GP-UCRL algorithm, and that the noise in state transitions and rewards are distributed as and , respectively. Thus, in this case the algorithm has exact knowledge of the data generating process (the ‘fully Bayesian setup’). Also, in order to achieve non-trivial regret for continuous state/action MDPs, we need the following smoothness assumptions similar to those made by Srinivas et al. (2009) on the kernels. We assume that and are compact state and action spaces, and that the kernels and satisfy high probability bounds on the derivatives of GP sample paths and , respectively, as follows: For some positive constants , for every ,

(9) |

where is the dimension of the product space ^{7}^{7}7This smoothness assumption holds for stationary kernels that are four times differentiable, such as SE and Matrn kernels with .. In this setup, at every episode and for any fixed , GP-UCRL constructs discretization sets and of state space and action space , with respective size and , such that for all and for all , and , where denotes the closest point in to and denotes the closest point in to . Then, GP-UCRL sets , and constructs the confidence sets and as the following:

(10) | |||||

(11) |

where denotes the state-action pair . The following result shows the sublinear regret guarantee of GP-UCRL under these assumptions.

###### Theorem 1 (Bayesian regret bound for GP-UCRL)

Let and be compact, and let . Let be symmetric, positive-semidefinite kernels satisfying bounded variance: for all ^{8}^{8}8This assumption holds for stationary kernels, such as SE and Matrn.. Let be samples from respectively and the kernels and satisfy (9). Let the noise variables be iid, zero-mean, Gaussian with variance , respectively. Also, let be a known upper bound over the global Lipschitz constant for the future value function, and be the diameter of .
For any , GP-UCRL with confidence sets constructed as in (10), (11), enjoys, with probability at least , the regret bound^{9}^{9}9Here we use the shorthand and

where is a constant depending on the properties of such that^{10}^{10}10This is a mild assumption (Bogunovic et al., 2016) on the kernel , since is Gaussian and thus has exponential tails. .

Remark. Since for all and as discussed earlier, and grow only poly-logarithmically with for commonly used kernels and for their compositions, the regret of GP-UCRL with such kernels grows sub-linearly with . Further, the bound depends linearly on the term , which can be viewed as the ’scaled diameter’ measuring the connectedness of the MDP space. A similar observation was made by Osband and Van Roy (2014b) although in the different setting of factored MDPs.

##### Regret Bound of GP-UCRL in the Agnostic Setup.

We now consider an agnostic setting, where
we assume that the mean reward and mean transition functions have small norm in the Reproducing Kernel Hilbert Space (RKHS) of functions , with kernels respectively. A RKHS, denoted by , is completely specified by its kernel function and vice-versa, with an inner product obeying the reproducing property: for all . The RKHS norm is a measure of smoothness of with respect to the kernel function . We assume known bounds on the RKHS norms of the mean reward and mean transition functions: and . Further, we assume that the noise variables and are predictably^{11}^{11}11See Abbasi-Yadkori et al. (2011), Durand et al. (2017) for details on sub-Gaussian predictable models. -sub-Gaussian and -sub-Gaussian for fixed constants and respectively, i.e., ,

(12) |

where is the generated -algebra.

Note that we still run the same GP-UCRL algorithm as before, but whose prior and noise models are now misspecified. However, GP-UCRL assumes the knowledge of sub-Gaussianity parameters , kernel functions and upper bounds of RKHS norms . Then, for any , it sets and , and constructs confidence sets and as follows:

(13) | |||||

(14) |

Our next result shows that GP-UCRL, with this choice of confidence sets, attains sublinear regret even in this agnostic setting.

###### Theorem 2 (Frequentist regret bound for GP-UCRL)

Let , be compact, be defined as in Theorem 1 and be symmetric psd kernels with bounded variance. Let , be members of the RKHS of real-valued functions on with kernel and respectively, with corresponding RKHS norms bounded by and . Further, let the noise variables be predictably and -sub-Gaussian (12), respectively. For any , GP-UCRL, with confidence sets (13), (14), enjoys, with probability at least , the regret bound

##### Regret Bounds of -PSRL in the kernelized setup.

With these (high probability) regret bounds of GP-UCRL in hand, we can obtain a bound on the Bayes regret, defined as the expected regret under the prior distribution , of -PSRL using techniques similar to those of Russo and Van Roy (2014); Osband and Van Roy (2016).

###### Theorem 3 (Bayes regret of PSRL)

Let the assumptions of Theorem 2 hold, with . Let be the (known) distribution of and be the global Lipschitz constant for the future value function. Then, the Bayes regret of -PSRL satisfies

Moreover, if is specified by GP priors as in Theorem 1, then the Bayes regret of GP-PSRL satisfies

where and .

Remark. -PSRL obtains a regret bound of , where measures the MDP connectedness. This is directly comparable to the bound derived by Osband and Van Roy (2014a) (see Corollary therein), where they consider a bounded function class assumption over the mean reward and transition functions and additive sub-Gaussian noise. In fact, we see that the our definition of maximum information gain is comparable to the Kolmogorov and Eluder dimensions defined there, and all three are measures of complexity of the corresponding function classes. Though their results hold for more general function classes, we emphasize that our bounds cannot be deduced and thus require a separate analysis.

To the best of our knowledge, Theorem 1 and and Theorem 3 are the first Bayesian regret bounds for UCRL and PSRL respectively, whereas Theorem 2 is the first frequentist regret bound of UCRL, in the kernel MDP setting. We see that both algorithms achieve similar regret bounds in terms of dependencies on time, MDP connectedness and maximum information gain. However, GP-UCRL has stronger probabilistic guarantees than -PSRL since its bounds hold with high probability for any MDP and not just in expectation over the draw from the prior distribution.

## 5 Multi-dimensional State Spaces

In this section we extend our results to the case when . Here the transition dynamics takes the form , where the mean transition function takes values in , and is a noise vector. Now, similar to Berkenkamp et al. (2017), we model a scalar-valued function , where , in order to lift the standard scalar-valued GP output model to multiple dimensions and express state transitions. In this case, our algorithms use as the prior over , with a kernel defined over , and also assume multi-variate Gaussian noise . We can define and