Breaking Bellman’s Curse of Dimensionality: Efficient Kernel Gradient Temporal Difference

Breaking Bellman’s Curse of Dimensionality:
Efficient Kernel Gradient Temporal Difference

\nameAlec Koppel \
\addrDepartment of Electrical and Systems Engineering
University of Pennsylvania
Philadelphia, PA 19104, USA \AND\nameGarrett Warnell \
\addrComputational and Information Sciences Directorate
U.S. Army Research Laboratory
Adelphi, MD 20783, USA \AND\nameEthan Stump \
\addrComputational and Information Sciences Directorate
U.S. Army Research Laboratory
Adelphi, MD 20783, USA \AND\namePeter Stone \
\addrDepartment of Computer Science
University of Texas at Austin
2317 Speedway, Austin, TX 78712 \AND\nameAlejandro Ribeiro \
\addrDepartment of Electrical and Systems Engineering
University of Pennsylvania
Philadelphia, PA 19104, USA

We consider policy evaluation in infinite-horizon discounted Markov decision problems (MDPs) with infinite spaces. We reformulate this task a compositional stochastic program with a function-valued decision variable that belongs to a reproducing kernel Hilbert space (RKHS). We approach this problem via a new functional generalization of stochastic quasi-gradient methods operating in tandem with stochastic sparse subspace projections. The result is an extension of gradient temporal difference learning that yields nonlinearly parameterized value function estimates of the solution to the Bellman evaluation equation. Our main contribution is a memory-efficient non-parametric stochastic method guaranteed to converge exactly to the Bellman fixed point with probability with attenuating step-sizes. Further, with constant step-sizes, we obtain mean convergence to a neighborhood and that the value function estimates have finite complexity. In the Mountain Car domain, we observe faster convergence to lower Bellman error solutions than existing approaches with a fraction of the required memory.

Breaking Bellman’s Curse of Dimensionality: Efficient Kernel Gradient Temporal Difference Alec Koppel
Department of Electrical and Systems Engineering
University of Pennsylvania
Philadelphia, PA 19104, USA
Garrett Warnell
Computational and Information Sciences Directorate
U.S. Army Research Laboratory
Adelphi, MD 20783, USA
Ethan Stump
Computational and Information Sciences Directorate
U.S. Army Research Laboratory
Adelphi, MD 20783, USA
Peter Stone
Department of Computer Science
University of Texas at Austin
2317 Speedway, Austin, TX 78712
Alejandro Ribeiro
Department of Electrical and Systems Engineering
University of Pennsylvania
Philadelphia, PA 19104, USA


1 Policy Evaluation in Markov Decision Processes

We consider an autonomous agent acting in an environment defined by a Markov decision process (MDP) (Sutton and Barto, 1998) with continuous spaces, which is increasingly relevant to emerging technologies such as robotics (Kober et al., 2013), power systems (Scott et al., 2014), and others. A MDP is a quintuple , where is the action-dependent transition probability of the process: when the agent starts in state at time and takes an action , a transition to next state is distributed according to After the agent transitions to a particular , the MDP provides to it an instantaneous reward , where the reward function is a map .

We focus on the problem of policy evaluation: control decisions are chosen according to a fixed stationary stochastic policy , where denotes the set of probability distributions over . Policy evaluation underlies methods that seek optimal policies through repeated evaluation and improvement (Lagoudakis and Parr, 2003). In policy evaluation, we seek to compute the value of a policy when starting in state , quantified by the discounted expected sum of rewards, or value function :111In MDPs more generally, we choose actions to maximize the reward accumulation starting from state , i.e., For fixed , this simplifies to (1).


For a single trajectory through the state space , . The value function (1) is parameterized by a discount factor , which determines the agent’s farsightedness. Decomposing the summand in (1) into its first and subsequent terms, and using both the stationarity of the transition probability and the Markov property yields the Bellman evaluation equation (Bellman, 1957):


The right-hand side of (2) defines a Bellman evaluation operator over , the space of bounded continuous value functions :


(Bertsekas and Shreve, 1978)[Proposition 4.2(b)] establishes that the stationary point of (3) is , i.e., . As a stepping stone to finding optimal policies in infinite MDPs, we seek here to find the fixed point of (3). Specifically, the goal of this work is stable value function estimation in infinite MDPs, with nonlinear parameterizations that are allowed to be infinite, but are nonetheless memory-efficient.

Challenges To solve (3), fixed point methods, i.e., value iteration (), have been proposed (Bertsekas and Shreve, 1978), but only apply when the value function can be represented by a vector whose length is defined by the number of states and the state space is small enough that the expectation222The integral in (2) defines a conditional expectation: . in can be computed. For large spaces, stochastic approximations of value iteration, i.e., temporal difference (TD) learning (Sutton, 1988), have been utilized to circumvent this intractable expectation. Incremental methods (least-squares TD) provide an alternative when has a finite linear parameterization (Bradtke and Barto, 1996), but their extensions to infinite representations require infinite memory (Powell and Ma, 2011) or elude stability (Xu et al., 2005).

Solving the fixed point problem defined by (3) requires surmounting the fact that this expression is defined for each , which for continuous has infinitely many unknowns. This phenomenon is one example of Bellman’s curse of dimensionality (Bellman, 1957), and it is frequently sidestepped by parameterizing the value function using a finite linear (Tsitsiklis and Van Roy, 1997; Melo et al., 2008) or nonlinear (Bhatnagar et al., 2009) basis expansion. Such methods have paved the way for the recent success of neural networks in value function-based approaches to MDPs (Mnih et al., 2013), but combining TD learning with different parameterizations may cause divergence (Baird, 1995; Tsitsiklis and Van Roy, 1997): in general, the representation must be tied to the stochastic update (Jong and Stone, 2007) to ensure both the parameterization and the stochastic process are stable.

Contributions Our main result is a memory-efficient, non-parametric, stochastic method that converges to the Bellman fixed point almost surely when it belongs to a reproducing kernel Hilbert space (RKHS). Our approach is to reformulate (2) as a compositional stochastic program (Section 2), a topic studied in operations research (Shapiro et al., 2014) and probability (Korostelev, 1984; Konda and Tsitsiklis, 2004). These problems motivate stochastic quasi-gradient (SQG) methods which use two time-scale stochastic approximation to mitigate the fact that the objective’s stochastic gradient is biased with respect to its average (Ermoliev, 1983). Here, we use SQG for policy evaluation in infinite MDPs (finite MDPs addressed in (Bhatnagar et al., 2009; Sutton et al., 2009)).

In (2), the decision variable is a continuous function, which we address by hypothesizing the Bellman fixed point belongs to a RKHS (Kimeldorf and Wahba, 1971; Slavakis et al., 2013). However, a function in a RKHS has comparable complexity to the number of training samples processed, which could be infinite (an issue ignored in many kernel methods for MDPs (Ormoneit and Sen, 2002; Xu et al., 2005; Taylor and Parr, 2009; Powell and Ma, 2011; Grünewälder et al., 2012; Farahmand et al., 2016; Dai et al., 2016)). We will tackle this memory bottleneck by requiring memory efficiency in both the function sample path and in its limit.

To find a memory-efficient sample path in the function space, we generalize SQG to RKHSs (Section 3), and combine this generalization with greedily-constructed sparse subspace projections (Section 3.1). These subspaces are constructed via matching pursuit (Pati et al., 1993; Lever et al., 2016), a procedure motivated by the facts that (a) kernel matrices induced by arbitrary data streams likely violate requirements for convex-relaxation-based sparsity (Candes, 2008), and (b) parsimony is more important than exact recovery since SQG iterates are not the target signal but rather a point along the convergence path to Bellman fixed point. Rather than unsupervised forgetting (Engel et al., 2003), we tie the projection-induced error to stochastic descent (Koppel et al., 2016) which keeps only those dictionary points needed for convergence (Sec. 4).

As a result, we conduct functional SQG descent via sparse projections of the SQG. This maintains a moderate-complexity sample path exactly towards , which may be made arbitrarily close to the Bellman fixed point by decreasing the regularizer. By generalizing the relationship between SQG and supermartingales in (Wang et al., 2017) to Hilbert spaces, we establish that the sparse projected SQG sequence converges almost surely to the Bellman fixed point with decreasing learning rates, and converges in mean while maintaining finite complexity when constant learning rates are used (Section 4).

2 Policy Evaluation as Compositional Stochastic Programming

We turn to reformulating the functional fixed point problem (3) defined by Bellman’s equation so that it may be identified with a nested stochastic program. We note that the resulting domain of this problem is intractable, and address this by hypothesizing that the Bellman fixed point belongs to a RKHS, which, in turn, requires the introduction of regularization.

We proceed with reformulating (3): subtract the value function that satisfies the fixed point relation from both sides, and then pull it inside the expectation:


Value functions satisfying (4) are equivalent to those which satisfy the quadratic expression which is null for all . Solving this expression for every may be achieved by considering this expression in an initialization-independent manner. That is, integrating out , the starting point of the trajectory defining the value function (1), as well as policy , yields the compositional stochastic program:


whose solutions coincide exactly with the fixed points of (3).

(5) defines a functional optimization problem which is intractable when we search over all bounded continuous functions . However, when we restrict to a Hilbert space equipped with a unique reproducing kernel, i.e., an inner product-like map such that


we may apply the Representer Theorem to transform the functional problem (5) into a parametric one (Kimeldorf and Wahba, 1971; Schölkopf et al., 2001; Norkin and Keyzer, 2009) In a RKHS, the optimal function of (5) then takes the form


where is a realization of the random variable . Thus, is an expansion of kernel evaluations only at training samples. We refer to the upper summand index in (7) in the kernel expansion of as the model order, which here coincides with the training sample size. Common kernel choices are polynomials and radial basis (Gaussian) functions, i.e., and , respectively. In (6), property (i) is called the reproducing property, which follows from Riesz Representation Theorem (Wheeden et al., 1977). Replacing by in (6) (i) yields the expression , the origin of the term “reproducing kernel.” Moreover, property (6) (ii) states that functions admit a basis expansion in terms of kernel evaluations (7). Function spaces of this type are referred to as reproducing kernel Hilbert spaces (RKHSs). For universal kernels the kernel is universal (Micchelli et al., 2006), e.g., a Gaussian, a continuous function over a compact set may be approximated uniformly by one in a RKHS.

Subsequently, we seek to solve (5) with the restriction that , and independent and identically distributed samples from the triple are sequentially available, yielding


Hereafter, define and . The regularization term in (8) is needed to apply the Representer Theorem (7) (Schölkopf et al., 2001). Thus, policy evaluation in infinite MDPs (8) is both a specialization of compositional stochastic programming (Wang et al., 2017) to an objective defined by dynamic programming, and a generalization to the case where the decision variable is not vector-valued but is instead a function.

3 Functional Stochastic Quasi-Gradient Method

To apply functional SQG to (8), we differentiate the compositional objective , which is of the form , with and , and then consider its stochastic estimate. Consider the Frecht derivative of :


On the first line, we pull the differential operator inside the expectation, and on the second line we make use of the chain rule and reproducing property of the kernel (6)(i). For future reference, we define the expression as the average temporal difference (Sutton, 1988). To perform stochastic descent in function space , we need a stochastic approximate of (9) evaluated at a state-action-state triple , which together with the regularizer yields


where is defined as the (instantaneous) temporal difference. Observe that we cannot obtain unbiased samples of due to the fact that the terms inside the inner expectations in (9) are dependent, a problem first identified in (Sutton et al., 2009) for finite MDPs. Therefore, we require a method that constructs a coupled stochastic descent procedure by considering noisy estimates of both terms in the product-of-expectations expression in (9).

Due to the fact that the first term in (10) is a difference of kernel maps, building up its total expectation will, in the limit, be of infinite complexity (Kivinen et al., 2004). Thus, we propose instead to construct a sequence based on samples of the second term. That is, based on realizations of , we consider a fixed point recursion that builds up an estimate of by defining a scalar sequence as


where we define (Sutton, 1988) as the temporal difference at time in (11) Thus, (11) approximately averages the temporal difference sequence : estimates , and is a learning rate.

To define a stochastic descent step, we replace the first term inside the outer expectation in (9) with its instantaneous approximate, i.e., , evaluated at a sample triple , which yields the stochastic quasi-gradient step (Ermoliev, 1983; Wang et al., 2017)


where the coefficient comes from the regularizer, and is a positive scalar learning rate. This update is a stochastic quasi-gradient step because the true stochastic gradient of is , but this estimator is biased with respect to its average since the terms in this product are correlated. By replacing by auxiliary variable this issue may be circumvented in the construction of coupled supermartingales (Section 4).

Kernel Parameterization Suppose . Then the update in (12) at time , making use of the Representer Theorem (7), implies the function is a kernel expansion of past states as


On the right-hand side of (13) we introduce the notation and , and: and The kernel expansion in (13), together with the functional update (12), yields the fact that functional SQG in amounts to the following updates on the kernel dictionary and coefficient vector :


Observe that this update causes to have two more columns than . We define the model order as number of data points in the dictionary at time , which for functional stochastic quasi-gradient descent is . Asymptotically, then, the complexity of storing is infinite.

3.1 Sparse Stochastic Subspace Projections

Since the update (12) has complexity due to the parameterization induced by RKHS (Kivinen et al., 2004; Koppel et al., 2016), it is impractical in settings with streaming data or arbitrarily large training sets. We address this issue by replacing the stochastic descent step (12) with an orthogonally projected variant (Koppel et al., 2016), where the projection is onto a low-dimensional functional subspace of , i.e.,


where again is a scalar step-size, and for some collection of sample instances . The interpretation of the un-projected function SQG method (12) (Section 3) in terms of subspace projections is in Appendix A.1, motivating (15).

We proceed to describe the construction of these subspace projections. Consider subspaces that consist of functions that can be represented using some dictionary , i.e., . For convenience, we define , and as the resulting kernel matrix from this dictionary. We enforce function parsimony by selecting dictionaries that .

Coefficient update The update (15), for a fixed dictionary , may be expressed in terms of the parameter space of coefficients only. To do so, first define the stochastic quasi-gradient update without projection, given function parameterized by dictionary and coefficients , as


This update may be represented using dictionary and weight vector


Observe that has columns, which is the length of . For a fixed dictionary , the stochastic projection in (A.1) is a least-squares problem on the coefficient vector, i.e.,


where we define the cross-kernel matrix whose entry is . Kernel matrices and are similarly defined. Here is the number of columns in , while is that of in [cf. (17)]. Appendix A.2 contains a derivation of (18). We now turn to selecting the dictionary from the MDP trajectory .

  initialize , i.e. initial dict., coeffs., and aux. variable null
  for  do
     Obtain trajectory realization
     Compute the temporal difference and update the auxiliary sequence [cf. (11)]:
     Compute unconstrained functional stochastic quasi-gradient step [cf. (12)]
     Revise dictionary , weights
     Obtain greedy compression of function parameterization via Algorithm 2
  end for
Algorithm 1 PKGTD: Parsimonious Kernel Gradient Temporal Difference

Dictionary Update We select kernel dictionary via greedy compression, a topic studied in compressive sensing (Needell et al., 2008). The function defined by SQG method without projection (16) is parameterized by dictionary [cf. (17)]. We form by selecting a subset of columns from that best approximate in terms of Hilbert norm error. To accomplish this, we use kernel orthogonal matching pursuit (KOMP) (Vincent and Bengio, 2002) with error tolerance to find a dictionary based that which adds the latest samples . We tune to ensure both stochastic descent (Lemma 6(ii)) and finite model order (Corollary 4).

With respect to the KOMP procedure above, we specifically use a variant called destructive KOMP with pre-fitting (see (Vincent and Bengio, 2002), Section 2.3), (see Appendix A.3, Algorithm 2). This flavor of KOMP takes as an input a candidate function of model order parameterized by its dictionary and coefficients . The method then approximates by with a lower model order. Initially, the candidate is the original so that its dictionary is initialized with , with coefficients . Then, we sequentially and greedily remove model points from initial dictionary until threshold is violated. The result is a sparse approximation of .

We summarize the proposed method, Parsimonious Kernel Gradient Temporal Difference (PKGTD) in Algorithm 1: we execute the stochastic projection of the functional SQG iterates onto sparse subspaces stated in (A.1). With initial function null (empty dictionary and coefficients ),at each step, given an i.i.d. sample and step-sizes , we compute the unconstrained functional SQG iterate parameterized by and as stated in (17), which are fed into KOMP (Algorithm 2) with budget , i.e., .

4 Convergence Analysis

We now analyze the stability and memory requirements of Algorithm 1 developed in Section 3. Our approach is fundamentally different from stochastic fixed point methods such as TD learning, which are not descent techniques, and thus exhibit delicate convergence. The interplay between the Bellman operator contraction (Bertsekas and Shreve, 1978) and expectations prevents the construction of supermartingales underlying stochastic descent stability (Robbins and Monro, 1951). Attempts to mitigate this issue, such as those based on stochastic backward-differences (Kiefer et al., 1952) ((Tsitsiklis, 1994; Jaakkola et al., 1994)) or Lyapunov approaches (Borkar and Meyn, 2000), e.g., (Sutton et al., 2009), require the state space to be completely explored in the limit per step (intractable when ), or stipulate that data dependent matrices be non-singular, respectively. Thus, there is a long-standing question of how to perform policy evaluation in MDPs under conditions applicable to practitioners while also guaranteeing stability. We provide an answer by connecting RKHS-valued stochastic quasi-gradient methods (Algorithm 1) with coupled supermartingale theory (Wang and Bertsekas, 2014).

Iterate Convergence Under the technical conditions stated at the outset of Appendix B, it is possible to derive the fact that the auxiliary variable and value function estimate satisfy supermartingale-type relationships, but their behavior is intrinsically coupled to one another. We generalize recently developed coupled supermartingale tools in (Wang and Bertsekas, 2014), i.e., Lemma 7 in Appendix B, to establish the following almost sure convergence result when the step-sizes and compression budget are diminishing.

Theorem 1

Consider the sequence [cf. (11)] and [cf. 15] as stated in Algorithm 1. Assume the regularizer is positive , Assumptions 1 - 3 hold, and the step-size conditions hold: 333One step-size sequence satisfying (19) is , where is an arbitrarily small constant so that series and diverge. Generally, satisfying (19), requires: , with and .


Then defined by (8) with probability , and thus achieves the regularized Bellman fixed point (4) restricted to the reproducing kernel Hilbert space.

Proofs are given in Appendices B - C. Theorem 1 states that the value functions generated by Algorithm 1 converge almost surely to the optimal defined by (8). With regularizer made arbitrarily small but nonzero, using a universal kernel (e.g., a Gaussian), converges arbitrarily close to a function satisfying Bellman’s equation in infinite MDPs (3). This is the first guarantee w.p.1 for a true stochastic descent method with an infinitely and nonlinearly parameterized value function. Theorem 1 requires attenuating step-sizes such that the stochastic approximation error approaches null. In contrast, constant learning rates allow for the perpetual revision of the value function estimates without diminishing algorithm adaptivity, motivating the following result.

Theorem 2

Suppose Algorithm 1 is run with constant positive learning rates and and constant compression budget with sufficiently large regularization, i.e.


where is a scalar, and . Then, under Assumptions 1 - 3, the sub-optimality sequence converges in mean to a neighborhood:


Theorem 2 (proof in Appendix D) establishes that the value function estimates generated by Algorithm 1 converge in expectation to a neighborhood when constant step-sizes and and sparsification budget in Algorithm 2 are small constants. In particular, the bias induced by sparsification does not cause instability even when it is not going to null. Moreover, this result only holds when the regularizer is chosen large enough, which numerically induces a forgetting factor on past kernel dictionary weights (17). We may make the learning rates and arbitrarily small, which yield a proportional decrease in the radius of convergence to a neighborhood of the Bellman fixed point (3).

Remark 3

(Aggressive Constant Learning Rates) In practice, one may obtain better performance by using larger constant step-sizes. To do so, the criterion (20) may be relaxed: we require but may be any positive scalar. Then, the radius of convergence is (see Appendix D)


The ratios and dominate (22) and must be made small to obtain accurate solutions.

Theorem 2 is the first constant learning rate result for nonparametric compositional stochastic programming of which we are aware, and allows for repeatedly revising value function without the need for stochastic approximation error to approach null. Use of constant learning rates yields the fact that value function estimates have moderate complexity even in the worst case, as we detail next.

Model Order Control As noted in Section 3, the complexity of functional stochastic quasi-gradient method in a RKHS is of order which grows without bound. To mitigate this issue, we develop the sparse subspace projection in Section 3.1. We formalize here that this projection does indeed limit the complexity of the value function when constant learning rates and compression budget are used. This result is a corollary, since it is an extension of Theorem 3 in (Koppel et al., 2016). To obtain this result, the reward function must be bounded (Assumption 4 in Appendix E).

Corollary 4

Denote as the value function sequence defined by Algorithm 1 with constant step-sizes and with compression budget and regularization parameter as in Remark 3. Let be the model order of the value function i.e., the number of columns of the dictionary which parameterizes . Then there exists a finite upper bound such that, for all , the model order is always bounded as . Consequently, the model order of the limiting function is finite.

The results above establish that Algorithm 1 yields convergent behavior for the problem (8) in both diminishing and constant step-size regimes. With diminishing step-sizes [cf. (19)] and compression budget , we obtain exact convergence with probability of the function sequence in the RKHS to that of the regularized Bellman fixed point of the evaluation equation (Theorem 1). This result holds for any positive regularizer , and thus can be made arbitrarily close to the true Bellman fixed point [cf. (2)] by decreasing . However, an exact solution requires increasing the complexity of the function estimate such that its limiting memory becomes infinite. This drawback motivates us to consider the case where both the learning rates , and the compression budget are constant. Under specific selections (20), the algorithm converges to a neighborhood of the optimal value function, whose radius depends on the step-sizes, and may be made small by decreasing at the cost of a decreasing learning rate. Moreover, the use of constant step-sizes and compression budget with large enough regularization yields a value function parameterized by a dictionary whose model order is always bounded (Corollary 4).

5 Experiments

Figure 1: Experimental comparison of PKGTD to existing kernel methods for policy evaluation on the Mountain Car task. Test set error (left), and the parameterization complexity (center) vs. iterations. PKGTD learns fastest and most stably with the least complexity (best viewed in color). We plot the contour of the learned value function (right): its minimal value is in the valley, and states near the goal are close to null. Bold black dots are kernel dictionary elements, or retained instances.

Our experiments aim to compare PKGTD to other policy evaluation techniques in this domain. Because it seeks memory-efficient solutions over an RKHS, we expect PKGTD to obtain accurate estimates of the value function using only a fraction of the memory required by the other methods. We perform experiments on the classical Mountain Car domain (Sutton and Barto, 1998): an agent applies discrete actions to a car that starts at the bottom of a valley and attempts to climb up to a goal at the top of one of the mountain sides. The state space is continuous, consisting of the car’s scalar position and velocity, i.e., . The reward function is unless is the goal state at the mountain top, in which case it is and the episode terminates.

To obtain a benchmark policy for this task, we make use of trust region policy optimization (Schulman et al., 2015). To evaluate value function estimates, we form an offline training set of state transitions and associated rewards by running this policy through consecutive episodes until we had one training trajectory of 5000 steps and then repeat this for 100 training trajectories to generate sample statistics. For ground truth, we generate one long trajectory of 10000 steps and randomly sample 2000 states from it. From each of these 2000 states, we apply the policy until episode termination and use the observed discounted return as . Since our policy was deterministic, we only performed this procedure once per sampled state. For value function , we define the percentage error metric: We compared PKGTD with a Gaussian kernel to two other techniques for policy evaluation that also use kernel-based value function representations: (1) Gaussian process temporal difference (GPTD) (Engel et al., 2003), and (2) gradient temporal difference (GTD) (Sutton et al., 2009) using radial basis function (RBF) network features.

Figure 1 depicts the results of our experiment. We fix a kernel bandwidth across all techniques, and select parameter values that yield the best results for each method (Appendix F). For RBF feature generation, we use two fixed grids with different spacing. The first was one for which GTD yielded a value function estimate with percentage error similar to that which we obtained using PKGTD (RBF-49), and the second was one which yielded a number of basis functions that was similar to what PKGTD selected (RBF-25). Observe that GTD with fixed RBF features requires a much denser grid in order to reach the same Percentage Error as Algorithm 1. Moreover, PKGTD’s adaptive instance selection results in both faster initial learning and smaller error. Compared to GPTD, which chooses model points online according to a fixed linear-dependence criterion, PKGTD requires fewer model points and converges to a better estimate of the value function more quickly and stably.

6 Discussion

In this paper, we considered the problem of policy evaluation in infinite MDPs with value functions that belong to a RKHS. To solve this problem, we extended recent SQG methods for compositional stochastic programming to a RKHS, and used the result, combined with greedy sparse subspace projection, in a new policy-evaluation procedure called PKGTD (Algorithm 1). Under diminishing step sizes, PKGTD solves Bellman’s evaluation equation exactly under the hypothesis that its fixed point belongs to a RKHS (Theorem 1). Under constant step sizes, we can further guarantee finite-memory approximations (Corollary 4) that still exhibit mean convergence to a neighborhood of the optimal value function (Theorem 2). In our Mountain Car experiments, PKGTD yields excellent sample efficiency and model complexity, and therefore holds promise for large state space problems common in robotics where fixed state-action space tiling may prove impractical.


  • Anthony and Bartlett [2009] Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations. cambridge university press, 2009.
  • Baird [1995] Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In In Proceedings of the Twelfth International Conference on Machine Learning, pages 30–37. Morgan Kaufmann, 1995.
  • Bellman [1957] Richard Bellman. Dynamic Programming. Princeton University Press, Princeton, NJ, USA, 1 edition, 1957. URL
  • Bertsekas and Shreve [1978] Dimitri P Bertsekas and Steven E Shreve. Stochastic optimal control: The discrete time case, volume 23. Academic Press, 1978.
  • Bhatnagar et al. [2009] Shalabh Bhatnagar, Doina Precup, David Silver, Richard S Sutton, Hamid R Maei, and Csaba Szepesvári. Convergent temporal-difference learning with arbitrary smooth function approximation. In Advances in Neural Information Processing Systems, pages 1204–1212, 2009.
  • Borkar and Meyn [2000] Vivek S Borkar and Sean P Meyn. The ode method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization, 38(2):447–469, 2000.
  • Bradtke and Barto [1996] Steven J Bradtke and Andrew G Barto. Linear least-squares algorithms for temporal difference learning. Machine learning, 22(1-3):33–57, 1996.
  • Brezis [2010] Haim Brezis. Functional analysis, Sobolev spaces and partial differential equations. Springer Science & Business Media, 2010.
  • Candes [2008] Emmanuel J Candes. The restricted isometry property and its implications for compressed sensing. Comptes Rendus Mathematique, 346(9):589–592, 2008.
  • Dai et al. [2016] Bo Dai, Niao He, Yunpeng Pan, Byron Boots, and Le Song. Learning from conditional distributions via dual kernel embeddings. arXiv preprint arXiv:1607.04579, 2016.
  • Engel et al. [2004] Y. Engel, S. Mannor, and R. Meir. The kernel recursive least-squares algorithm. IEEE Transactions on Signal Processing, 52(8):2275–2285, Aug 2004. ISSN 1053-587X. doi: 10.1109/TSP.2004.830985.
  • Engel et al. [2003] Yaakov Engel, Shie Mannor, and Ron Meir. Bayes meets bellman: The gaussian process approach to temporal difference learning. In Proc. of the 20th International Conference on Machine Learning, 2003.
  • Ermoliev [1983] Yuri Ermoliev. Stochastic quasigradient methods and their application to system optimization. Stochastics: An International Journal of Probability and Stochastic Processes, 9(1-2):1–36, 1983.
  • Farahmand et al. [2016] Amir-massoud Farahmand, Csaba Ghavamzadeh, Mohammadand Szepesvári, and Shie Mannor. Regularized policy iteration with nonparametric function spaces. Journal of Machine Learning Research, 17(139):1–66, 2016. URL
  • Grünewälder et al. [2012] S Grünewälder, G Lever, L Baldassarre, M Pontil, and A Gretton. Modelling transition dynamics in mdps with rkhs embeddings. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, volume 1, pages 535–542, 2012.
  • Jaakkola et al. [1994] Tommi Jaakkola, Michael I Jordan, and Satinder P Singh. On the convergence of stochastic iterative dynamic programming algorithms. Neural computation, 6(6):1185–1201, 1994.
  • Jong and Stone [2007] Nicholas K Jong and Peter Stone. Model-based function approximation in reinforcement learning. In Proceedings of the 6th international joint conference on Autonomous agents and multiagent systems, page 95. ACM, 2007.
  • Kiefer et al. [1952] Jack Kiefer, Jacob Wolfowitz, et al. Stochastic estimation of the maximum of a regression function. The Annals of Mathematical Statistics, 23(3):462–466, 1952.
  • Kimeldorf and Wahba [1971] George Kimeldorf and Grace Wahba. Some results on tchebycheffian spline functions. Journal of mathematical analysis and applications, 33(1):82–95, 1971.
  • Kivinen et al. [2004] J. Kivinen, A. J. Smola, and R. C. Williamson. Online Learning with Kernels. IEEE Transactions on Signal Processing, 52:2165–2176, August 2004. doi: 10.1109/TSP.2004.830991.
  • Kober et al. [2013] Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, page 0278364913495721, 2013.
  • Konda and Tsitsiklis [2004] Vijay R Konda and John N Tsitsiklis. Convergence rate of linear two-time-scale stochastic approximation. Annals of applied probability, pages 796–819, 2004.
  • Koppel et al. [2016] Alec Koppel, Garrett Warnell, Ethan Stump, and Alejandro Ribeiro. Parsimonious online learning with kernels via sparse projections in function space. arXiv preprint arXiv:1612.04111, 2016.
  • Korostelev [1984] A. Korostelev. Stochastic recurrent procedures: Local properties. Nauka: Moscow (in Russian), 1984.
  • Lagoudakis and Parr [2003] Michail G Lagoudakis and Ronald Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4(Dec):1107–1149, 2003.
  • Lever et al. [2016] Guy Lever, John Shawe-Taylor, Ronnie Stafford, and Csaba Szepesvari. Compressed conditional mean embeddings for model-based reinforcement learning. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
  • Melo et al. [2008] Francisco S Melo, Sean P Meyn, and M Isabel Ribeiro. An analysis of reinforcement learning with function approximation. In Proceedings of the 25th international conference on Machine learning, pages 664–671. ACM, 2008.
  • Micchelli et al. [2006] Charles A Micchelli, Yuesheng Xu, and Haizhang Zhang. Universal kernels. Journal of Machine Learning Research, 7(Dec):2651–2667, 2006.
  • Mnih et al. [2013] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • Needell et al. [2008] Deanna Needell, Joel Tropp, and Roman Vershynin. Greedy signal recovery review. In Signals, Systems and Computers, 2008 42nd Asilomar Conference on, pages 1048–1050. IEEE, 2008.
  • Norkin and Keyzer [2009] Vladimir Norkin and Michiel Keyzer. On stochastic optimization and statistical learning in reproducing kernel hilbert spaces by support vector machines (svm). Informatica, 20(2):273–292, 2009.
  • Ormoneit and Sen [2002] Dirk Ormoneit and Śaunak Sen. Kernel-based reinforcement learning. Machine learning, 49(2-3):161–178, 2002.
  • Pati et al. [1993] Y. Pati, R. Rezaiifar, and P.S. Krishnaprasad. Orthogonal Matching Pursuit: Recursive Function Approximation with Applications to Wavelet Decomposition. In Proceedings of the Asilomar Conference on Signals, Systems and Computers, 1993.
  • Powell and Ma [2011] Warren B Powell and Jun Ma. A review of stochastic algorithms with continuous value function approximation and some new approximate policy iteration algorithms for multidimensional continuous applications. Journal of Control Theory and Applications, 9(3):336–352, 2011.
  • Robbins and Monro [1951] Herbert Robbins and Sutton Monro. A stochastic approximation method. Ann. Math. Statist., 22(3):400–407, 09 1951. doi: 10.1214/aoms/1177729586.
  • Schölkopf et al. [2001] Bernhard Schölkopf, Ralf Herbrich, and Alex J Smola. A generalized representer theorem. In International Conference on Computational Learning Theory, pages 416–426. Springer, 2001.
  • Schulman et al. [2015] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 1889–1897, 2015.
  • Scott et al. [2014] Warren R Scott, Warren B Powell, and Somayeh Moazehi. Least squares policy iteration with instrumental variables vs. direct policy search: Comparison against optimal benchmarks using energy storage. arXiv preprint arXiv:1401.0843, 2014.
  • Shapiro et al. [2014] Alexander Shapiro, Darinka Dentcheva, et al. Lectures on stochastic programming: modeling and theory, volume 16. Siam, 2014.
  • Slavakis et al. [2013] Konstantinos Slavakis, Pantelis Bouboulis, and Sergios Theodoridis. Online learning in reproducing kernel hilbert spaces. Signal Processing Theory and Machine Learning, pages 883–987, 2013.
  • Sutton [1988] Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.
  • Sutton and Barto [1998] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
  • Sutton et al. [2009] Richard S Sutton, Hamid R Maei, and Csaba Szepesvári. A convergent temporal-difference algorithm for off-policy learning with linear function approximation. In Advances in neural information processing systems, pages 1609–1616, 2009.
  • Taylor and Parr [2009] Gavin Taylor and Ronald Parr. Kernelized value function approximation for reinforcement learning. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1017–1024. ACM, 2009.
  • Tsitsiklis [1994] John N Tsitsiklis. Asynchronous stochastic approximation and q-learning. Machine Learning, 16(3):185–202, 1994.
  • Tsitsiklis and Van Roy [1997] John N Tsitsiklis and Benjamin Van Roy. An analysis of temporal-difference learning with function approximation. IEEE transactions on automatic control, 42(5):674–690, 1997.
  • Vincent and Bengio [2002] P. Vincent and Y. Bengio. Kernel matching pursuit. Machine Learning, 48(1):165–187, 2002.
  • Wang and Bertsekas [2014] Mengdi Wang and Dimitri P Bertsekas. Incremental constraint projection-proximal methods for nonsmooth convex optimization. SIAM Journal on Optimization (to appear), 2014.
  • Wang et al. [2017] Mengdi Wang, Ethan X Fang, and Han Liu. Stochastic compositional gradient descent: Algorithms for minimizing compositions of expected-value functions. Mathematical Programming, 161(1-2):419–449, 2017.
  • Wheeden et al. [1977] R. Wheeden, R.L. Wheeden, and A. Zygmund. Measure and Integral: An Introduction to Real Analysis. Chapman & Hall/CRC Pure and Applied Mathematics. Taylor & Francis, 1977. ISBN 9780824764999. URL
  • Xu et al. [2005] Xin Xu, Tau Xie, Dewen Hu, and Xicheng Lu. Kernel least-squares temporal difference learning. International Journal of Information Technology, 11(9):54–63, 2005.

Supplementary Material for
Breaking Bellman’s Curse of Dimensionality:
Efficient Kernel Gradient Temporal Difference

Appendix A Derivation of Parametric Updates for Algorithm 1

a.1 Functional Stochastic Quasi-Gradient Update and Orthogonal Projections

By selecting at each step, the sequence (12) may be interpreted as a sequence of orthogonal projections. To see this, rewrite (12) as the quadratic minimization


where the first equality in (A.1) comes from ignoring constant terms which vanish upon differentiation with respect to , and the second comes from observing that can be represented using only the points , using (14). Notice now that (A.1) expresses as the orthogonal projection of the update onto the subspace defined by dictionary .

Rather than select dictionary , we propose instead to select a different dictionary, , which is extracted from the data points observed thus far, at each iteration. The process by which we select is discussed in Section A.3, and is of dimension , with . As a result, the sequence differs from the functional stochastic quasi-gradient method presented in Section 3.

The function is parameterized dictionary and weight vector . We denote columns of as for , where the time index is dropped for notational clarity but may be inferred from the context. We replace the update (A.1) in which the dictionary grows at each iteration by the functional stochastic quasi-gradient sequence projected onto the subspace as


where we define the projection operator onto subspace by the update (A.1). This orthogonal projection is the modification of the functional SQG iterate [cf. (12)] defined at the beginning of this subsection (15). Next we discuss how this update amounts to modifications of the parametric updates (14) defined by functional SQG.

a.2 Coefficient Update induced by Sparse Subspace Projections

We use the notation that is the sequence of projected quasi-FGSD iterates [cf. (15)] and is the update [cf. (16)] without projection in Section 3.1. The later is parameterized by dictionary and weights (17). When the dictionary defining is assumed fixed, we may use use of the Representer Theorem to rewrite (A.1) in terms of kernel expansions, and note that the coefficient vector is the only free parameter to write