1 Introduction
###### Abstract

Despite many algorithmic advances, our theoretical understanding of practical distributional reinforcement learning methods remains limited. One exception is Rowland et al. (2018)’s analysis of the C51 algorithm in terms of the Cramér distance, but their results only apply to the tabular setting and ignore C51’s use of a softmax to produce normalized distributions. In this paper we adapt the Cramér distance to deal with arbitrary vectors. From it we derive a new distributional algorithm which is fully Cramér-based and can be combined to linear function approximation, with formal guarantees in the context of policy evaluation. In allowing the model’s prediction to be any real vector, we lose the probabilistic interpretation behind the method, but otherwise maintain the appealing properties of distributional approaches. To the best of our knowledge, ours is the first proof of convergence of a distributional algorithm combined with function approximation. Perhaps surprisingly, our results provide evidence that Cramér-based distributional methods may perform worse than directly approximating the value function.

Distributional reinforcement learning with linear function approximation

Marc G. Bellemare                        Nicolas Le Roux                        Pablo Samuel Castro                        Subhodeep Moitra Google Brain

## 1 Introduction

In reinforcement learning one often seeks to predict the expected sum of discounted rewards, also called return or value, of a given state. The distributional perspective on reinforcement learning takes this idea further by suggesting that we should predict the full distribution of this random return, called value distribution (Bellemare et al., 2017a). This has produced state-of-the-art performance on a number of deep reinforcement learning benchmarks (e.g. Hessel et al., 2018; Barth-Maron et al., 2018; Dabney et al., 2018a, b).

The original distributional algorithm from this line of work is Bellemare et al.’s C51 algorithm. Core to C51 are 1) the use of a softmax transfer function to represent the value distribution, 2) a heuristic projection step, and finally 3) the minimization of a Kullback-Leibler (KL) loss. Rowland et al. (2018) showed that the heuristic projection minimizes a probability metric called the Cramér distance. However, their work did not explain the role of the KL loss in the algorithm.

The combination of two losses (Cramér and KL) is less than ideal, and makes the learning process technically more challenging to implement than, e.g., the classic Q-Learning algorithm (Watkins, 1989). This combination also makes it difficult to provide theoretical guarantees, both in terms of convergence but also in the quality of the value distribution generated by an approximate learner.

A natural question is whether it is possible to do away with the softmax and KL loss, and derive a “100% Cramér” algorithm, both for simplicity and theoretical understanding. In this paper we seek an algorithm which directly minimizes the Cramér distance between the output of the model, for example a deep network, and a target distribution. As it turns out, we can construct such an algorithm by treating the model outputs as an improper probability distribution, and deriving a variant of the Cramér distance which gracefully handles such distributions.

This new algorithm enables us to derive theoretical guarantees on the behaviour of a distributional algorithm when combined to linear function approximation, in the policy evaluation setting. Although convergence is guaranteed under the usual conditions, our performance bound is worse than that of an algorithm which only approximates the value function. This suggests that predicting the full distribution as an intermediate step in estimating the expected value could hurt performance. As a whole, our results suggest that the good performance of C51 cannot solely be attributed to a better-behaved loss function.

## 2 Background

We consider an agent acting in an environment described by a finite Markov Decision Process (Puterman, 1994). In this paper we study the policy evaluation setting, in which we assume a fixed policy mapping states to distributions over actions and consider the resulting state to state transition function :

 Prπ(x′|x):=∑a∈Aπ(a|x)Pr(x′|x,a).

We view the reward function as a collection of random variables describing the bounded, random reward received when an agent exits a state . The value distribution (Bellemare et al., 2017a) describes the random return, or sum of discounted rewards, received when beginning in state :

 Zπ(x):=∞∑t=0γtR(Xt)X0=x,Xt+1∼Prπ(⋅|Xt).

The expectation of the value distribution corresponds to the familiar value function (Sutton & Barto, 1998). Similar to the value function satisfying the Bellman equation, satisfies the distributional Bellman equation with an equality in distribution:

 Zπ(x)D=R(x)+γZπ(X′)X′∼Prπ(⋅|x),

The distributional Bellman operator over value distributions is defined as

 TπZ(x)D:=R(x)+γPrπZ(x), (1)

where with some abuse of notation we write . The operator is a contraction mapping in the following sense: let be a metric between probability distributions on , and for two random variables denote by the application of to their distributions. We define the maximal metric between two value distributions , as

 ¯d(Z1,Z2):=supx∈Xd(Z1(x),Z2(x)).

Now, we say that is 1) sum invariant if for any random variable independent of and , and 2) scale sensitive of order if for all , (Bellemare et al., 2017b). For any metric which satisfies both of these conditions (with ), then is a contraction mapping with modulus in the maximal metric :

 ¯d(TπZ1,TπZ2)≤γβ¯d(Z1,Z2).

Under mild assumptions and as a consequence of Banach’s fixed point theorem, the process converges to in .

### 2.1 Metrics Over Distributions

Let and be two probability distributions. The Kullback-Leibler (KL) divergence of from is

 DKL(p,q)=∫∞−∞p(t)logp(t)q(t)dt.

Note that the KL divergence is not properly a metric, but does define a loss function. However, the KL divergence is not scale sensitive. Furthermore, it is infinite whenever is not absolutely continuous with respect to , which can be problematic when designing a distributional algorithm with finite support: applying the Bellman operator to a discrete random variable typically changes its support.

The KL divergence is generally used in conjunction with a softmax transfer function which guarantees that has unit mass; without this constraint, the minimizer of may not be . Furthermore, the KL divergence corresponds to the matching loss for the softmax function, guaranteeing that the resulting optimization is convex (with respect to the softmax weights; Auer et al., 1995).

Unlike the KL divergence, the Cramér distance (Székely, 2002) is a proper distance between probability distributions. Given two distributions and over with cumulative distribution functions and , the Cramér distance is defined as

 DC(p,q) =∫+∞−∞(Fp(t)−Fq(t))2dt. (2)

For the purposes of distributional reinforcement learning, the Cramér distance has a number of appealing properties. First, it is both sum invariant and scale sensitive of order . Then, the Cramér distance can be minimized by stochastic gradient methods.

### 2.2 Approximation in the Distributional Setting

Let us write for the distribution of the random variable . There are two common hurdles to learning : first, we typically do not have access to a simulator, and must instead rely on sample transitions; second, we cannot in general store the value distribution exactly, and instead must maintain an approximation. These two issues have been well studied in the expected value setting of reinforcement learning (see, e.g. Bertsekas & Tsitsiklis, 1996; Tsitsiklis & Van Roy, 1997), in particular relating the mean behaviour of sample-based algorithms such as TD (Sutton, 1988) to their operator counterparts, including in the context of linear function approximation. This section provides analogous notation describing sample-based methods for distributional reinforcement learning.

With a tabular representation, where distributions are stored exactly, Rowland et al. (2018) showed the existence of a mixture update with step-size :

 P(x)←P(x)+α(fr,γ(P(x′))−P(x)).

In this mixture update, is the distribution corresponding to the random variable , . This update rule converges to under the usual stochastic optimization conditions.

Rowland et al. (2018) also analyzed a mixture update for approximately tabular representations, when is constrained to be a distribution over uniformly-spaced atoms (we will describe this parametrization in greater detail in the next section). The modified update incorporates a projection step which finds the constrained distribution closest to in Cramér distance:

 P(x)←P(x)+α(ΠCfr,γ(P(x′))−P(x)). (3)

This projection step is used in the C51 algorithm, which parametrizes using a neural network with weights and whose final layer uses a softmax transfer function to generate the vector of probabilities . Ignoring second order optimization terms, the C51 update is

 (4)

where the use of the KL divergence is justified as the matching loss to the softmax, and is a “target” copy of (Mnih et al., 2015).

Although the update rule Eq. (4) works well in practice, it is difficult to justify. The KL divergence is not scale sensitive, and it is not clear that its combination with the Cramér projection and the softmax function leads to a convergent algorithm.

To address this issue, here we consider an update rule which directly minimizes the Cramér distance:

 Θ←Θ−α∇ΘDC(fr,γ(P~Θ(x′)),PΘ(x)). (5)

By the matching-loss argument, this suggests doing away with the transfer function and measuring the loss with respect to linear outputs. At first glance this might seem nonsensical, as these may not form a valid probability distribution. Yet, as we will see, the Cramér distance can be extended to deal with arbitrary vectors.

## 3 Generalizing the Cramér Distance

In this section we generalize the Cramér distance to vectors which do not necessarily describe probability distributions. We then transform this generalized distance to obtain a loss that is suited to the distributional setting. At a high level, our approach is as follows:

1. We rewrite the Cramér distance between distributions with discrete support as a weighted squared distance between vectors;

2. We show that this distance has undesirable properties when generalized beyond the space of probability distributions, and address this by modifying the eigenstructure of the weighting used in defining the distance;

3. We further modify the distance into a loss which regularizes the sum of vectors towards 1. This modification is key in our construction of an algorithm that is theoretically well-behaved when combined with linear function approximation.

We consider the space of distributions over returns with finite, common, bounded support with . In this context, Eq. (2) simplifies to a sum with simple structure:

with , in , where is the cumulative distribution function of :

 Fp(zi)=i∑j=1p(zj).

We shall also assume that is odd and , i.e. , . Without detracting from our results, this simplifies their exposition.

Let and denote the vectors associated with , and write for the lower-triangular matrix of 1s:

 C =⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣100…000110…000111…000⋮⋱⋮111…100111…110111…111⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦.

If (resp., ), these can be viewed as the probabilities of a distribution over . Then, is the cumulative distribution of , and the Cramér distance between and becomes

 l2CC⊤(p,q) :=∥Cp−Cq∥2=∥p−q∥2CC⊤. (6)

One can replace the cumulative distributions with the tail cumulative distributions to get

 l2C⊤C(p,q) =∥∥C⊤p−C⊤q∥∥2=∥p−q∥2C⊤C. (7)

If or do not correspond to proper probability distributions, the Cramér distance of Eq. 2 may be infinite, while Eq. 6 and 7 remain finite. This suggests the use of this definition when comparing vector-valued objects that are close to, or attempt to approximate distributions.

However, the two distances can disagree when and do not correspond to proper probability distributions. Let be the “mass” of , reflecting its relationship to the mass of a probability distribution. If and have different mass, then . The issue is that Eq. 6 and 7 measure differently the difference in mass.

To resolve this discrepancy, we modify the Cramér distance to deal unambiguously with uneven masses. This leads to a two-part distance: The first is insensitive to differences of total mass while the second only penalizes that difference. Let

 e=[1/√k,…,1/√k]⊤ andΠe⊥=Ik−ee⊤,

our distance is

 l2λ(p,q) :=(p−q)⊤Πe⊥CC⊤Πe⊥(p−q) +λ((p−q)⊤e)2. (8)

Denoting , we have

 l2λ(p,q)=(p−q)⊤Cλ(p−q)=∥p−q∥2Cλ.

First, one may note that, when and have the same total mass, we have for all values of . As such, this new distance clarifies the behaviour for arbitrary vectors while being consistent with the existing Cramér loss for proper distributions. For any given distribution , the solution to

 minq∈Rkl2λ(p,q)

is . On the other hand, if the minimization is done over a constrained set, determines the magnitude of the penalty from the difference in total mass.

As we will later see, the distance , used as a loss, is not sufficient to guarantee good behaviour with linear function approximation. Instead, we define a related loss but with an explicit normalization penalty:

 ^l2λ(p,q) =(p−q)⊤Πe⊥CC⊤Πe⊥(p−q) +λ(q⊤e−1)2. (9)

Intuitively, recognizes that a distribution-like object should benefit from having unit mass. In the context of the distributional Bellman operator, this is the difference between backing up the mass at successor states versus normalizing the state’s distribution to sum to 1. However, does not define a distance proper, and our theoretical treatment of it in Section 4.2.2 will require additional care.

## 4 Analysis

We now explore, through a series of lemmas, properties of the Cramér distance of relevance to distributional reinforcement learning. Two of these results will be related to the minimization of the Cramér loss directly over distributions and two will be related to the use of linear function approximation.

### 4.1 Optimization properties

We begin by analyzing properties resulting from the minimization of over , beginning with the approximately tabular setting (Section 2.2).

#### 4.1.1 Impact on optimization speed

A well-known result in convex optimization states that, when minimizing a quadratic function with positive definite Hessian using a batch first-order method, e.g., Eq. 5, the convergence to the optimum is linear with a rate of where is the condition number of . Assuming we directly optimize the Cramér loss over with such a method, the convergence rate would depend on the condition number of the matrix used, i.e. when using the Cramér loss or when using the extended loss .

\thmt@toks\thmt@toks

Let be the set of symmetric matrices for which for all proper distributions and . Let the lowest condition number attained by matrices in . Then all the matrices of the form with , where and are the second smallest and largest eigenvalues of , respectively, have condition number .

###### Lemma 1 (Condition number).

The proof of this result and the following may be found in the appendix.

Lemma 4.1.1 shows that the optimal convergence rate is obtained for a potentially wide range of values for . As an example, for , the condition number of is about 4296 while is around 1053, about 4 times lower, and this is true for in the range .

#### 4.1.2 Preservation of the expectation

Although the prediction may not be a distribution, it still makes sense to talk of the dot product between and the support as its “expectation”: indeed, in many cases of interest the optimization procedure does yield valid distributions. In designing a full distributional agent, this generalized notion of expectation is also a natural way to convert into a scalar value, e.g. for decision making.

This section discusses potential guarantees on the difference in expected return between and when is the minimizer of the Cramér loss over a restricted set. Typically, we will ask to have a specific support but other constraints might include that must be normalized or that some values of cannot be modified. Specifically, the following lemma studies the impact on the expectation when minimizing the Cramér loss over an affine subset. \thmt@toks\thmt@toks Let be an arbitrary distribution over a discrete support. Let the projection of onto the linear subset . Then, if the first and the last columns of are equal, i.e. , then and have the same expectation.

###### Lemma 2 (Expectation preserving).

Lemma 4.1.2 covers projections onto specific supports, as used in C51, as well as constraints on the total mass of , for instance that the projection has unit mass. Here, we use to denote both the support and the vector containing the elements of that support. More generally, the Cramér projection offers a certain amount of freedom on the constraints that can be enforced while still preserving the expectation. In particular, leaving the two boundaries unconstrained is enough to preserve the expectation.

### 4.2 Linear function approximation

We next quantify the behaviour of our generalized loss function when combined to linear function approximation. Section 4.2.2 is the main theoretical contribution of this paper: it shows that the combination of a loss based on Equation 9 together with linear approximation produces a stable dynamical system, and quantifies the approximation error that results from it.

#### 4.2.1 Two-step optimization

In categorical distributional RL, the target distribution is the product of an application of the distributional Bellman operator and does not usually have the same support as the parametrized output distribution . Recall that is the set of distributions with support . C51 first projects onto , yielding

 Πλ,D(p)=argminu∈Dl2λ(p,u),

assuming is a proper distribution. Then, as a second step in the update process, it minimizes the KL divergence between and .

In our experiments we retain the projection onto from the C51 algorithm, and subsequently minimize our loss with respect to this projection. Doing so is equivalent to directly minimizing the Cramér loss, even when is not a proper distribution. Extending the result from Lemma 3 of Rowland et al. (2018), we note that is an orthogonal projection for :

 l2λ(p,q(θ))=l2λ(p,Πλ,D(p))+l2λ(Πλ,D(p),q(θ)).

Taking the derivative of the two sides of this equation with respect to , the parameters of the model, yields

 ∂l2λ(p,q(θ))∂θ=∂l2λ(Πλ,D(p),q(θ))∂θ

and minimizing the distance with the projection of onto the support of leads to the same gradients. With some additional care, the argument extends to the loss with a normalization penalty, .

#### 4.2.2 Convergence to a fixed point

We are now ready to show the convergence of distributional RL in the context of linear function approximation. Recall that a proof of convergence for C51 is hindered by the failure of the KL minimization process to be nonexpansive in the Cramér distance; as we will see, our result critically depends on the loss defined in Equation 9.

We consider a feature matrix , with the number of states and the number of features, and a weight matrix . That is, we consider outputs of the form , which with some abuse of terminology we call value distributions. As before, we write to denote the -dimensional output for state .

We study a stochastic update rule of the form given by Eq. (5), but where is replaced by the loss . When the states to be updated are sampled according to a distribution , the expected behaviour of this update rule corresponds to an operator akin to a projection (Tsitsiklis & Van Roy, 1997). In our setting, the operator minimizes the -weighted Cramér loss derived from , denoted

 ^l2ξ,λ(P,Q):=∑x∈Xξ(x)^l2λ(P(x),Q(x)).

We denote this operator by (the notation is made explicit in the appendix). Given a value distribution , the operator finds the value distribution in the span of which minimizes :

 ^Πξ,λ,ΦP=ΦΘ∗%whereΘ∗=argminΘ^l2ξ,λ(P,ΦΘ).

Finally, our analysis is performed with respect the distance , rather than the loss (which is not a distance). This leads to the -weighted distance

 l2ξ,λ(P,Q):=∑x∈Xξ(x)l2λ(P(x),Q(x)),

with corresponding projection operator .

We now show that the combination of the distributional Bellman operator and the -weighted, projection-like operator describes a convergent algorithm. When , we can further bound the distance of this fixed point to the true value distribution in terms of the best approximation in the class, . As is usual, is taken to be the stationary distribution of the Markov chain described by : . \thmt@toks\thmt@toks Let be the stationary distribution induced by the policy . The process

###### Theorem 1 (Convergence of the projected distributional Bellman process).

converges to a set such that for any two , there is a -indexed vector of constants such that

 P(x)=P′(x)+α(x)e.

If , consists of a single point which is the fixed point of the process. Furthermore, we can bound the error of this fixed point with respect to the true value distribution :

 l2ξ,λ(~P,Pπ) ≤11−γl2ξ,λ(Πξ,λ,ΦPπ,Pπ) −γλ1−γ∥∥~P−Pπ∥∥2ξ,ee⊤,

where the second term measures the difference in mass between and . Theorem 4.2.2 is significant for a number of reasons. First, it answers the question left open by Rowland et al. (2018), namely whether a proof of convergence exists for the distributional setting with an approximate representation, and with which representation. Second, it shows that there is a trade-off between the different components of the loss – while our result concerns linear function approximation, it suggests that similar trade-offs must exist within other distributional algorithms.

The parameter plays an important role in the theorem, both to guarantee convergence and (indirectly) to determine the approximation error. At a high level, this makes sense: a high value of forces the algorithm to output something close to a distribution, at the expense of actual predictions. On the other hand, taking yields a process which may not converge to a single point. Finally, we note that to guarantee convergence to a unique fixed point, it is not enough to use the loss from Eq. 3: in that case, we can only guarantee convergence to the set , even for . The following lemma, used to prove Theorem 4.2.2, shows why: the distributional Bellman operator is only a nonexpansion along the dimension , which captures the mass of the output vectors.

\thmt@toks\thmt@toks

Let be the stationary distribution induced by the policy . Write to mean the distributional Bellman operator followed by a projection onto the support . For a matrix and , write

 ∥Δ∥2ξ,B=∑x∈Xξ(x)∥Δ(x)∥2B.
\thmt@toks\thmt@toks

Then for any two value distributions ,

###### Lemma 3.

where . When and have equal mass, we recover the contraction result by Bellemare et al. (2017b) (albeit in -weighted Cramér distance, rather than maximal Cramér distance) – however, this also shows that our generalization of the distributional Bellman operator deals differently with probability mass itself. This is why Theorem 4.2.2 requires the normalization penalty loss , rather than the simpler .

#### 4.2.3 Bound on the approximation error

Our analysis provides us with a partial answer to the question: why and when should distributional reinforcement learning perform better empirically? In the linear approximation case that we study here, one answer is that it might hurt performance, as the following theorem suggests: \thmt@toks\thmt@toks Let be the -weighted norm over value functions. The squared expectation error of the fixed point with respect to the true value function is bounded as

\thmt@toks\thmt@toks
###### Theorem 2 (Error bound for the expected value).

The proof relies on a Rayleigh quotient argument, and shows that the bound is tight if the error vector is collinear with . In particular, if we take such that , then the constant is . Then, as , the constant goes to infinity. By contrast, the bound on the approximate value function derived from Tsitsiklis & Van Roy (1997) is better in two respects: first, its equivalent constant is 1. Second, our bound contains an amplification factor from the error term , which in their bound becomes the smaller constant , because the usual Bellman operator is a -contraction in , while the Cramér distance is only a -contraction in the equivalent norm.

However, the bound is slightly misleading. In our analysis we have assumed that the width of the support, i.e. , also grows with . We can instead normalize the matrix and the support to reflect a fixed width: and . In this case, the constant remains but the squared loss may in some cases be times smaller. Still, it is not unreasonable to expect that, given that the distributional approach models more things, it should be more susceptible to misspecification.

## 5 Experiments

The Cramér distance enjoys many theoretical properties that the KL divergence used in C51 lacks. To complement our theoretical results in the policy evaluation setting, we now study how our new loss affects the overall performance in the more complex control setting (Sutton & Barto, 1998). Our goals are to demonstrate that we can achieve qualitatively comparable performance to C51 with an algorithm based on this loss, and to study the similarities and differences between the two algorithms.

We compare the original C51 algorithm with our Cramér variant from Eq. 9, dubbed S51, on five games supported by the Arcade Learning Environment (Bellemare et al., 2013), and using the Dopamine framework (Castro et al., 2018). In a nutshell, S51 learns from samples, using the sample-based version of the distributional Bellman operator (Eq. 1), but where the fixed policy is replaced by one which backs up the distribution with maximum expected value (what Rowland et al. (2018) calls “Categorical Q-Learning”). Further experimental details, including on how to transform C51 into S51, are given in Appendix A.

Figure 1 shows that S51 achieves higher scores than DQN, demonstrating that it maintains the empirical benefits of the distributional perspective, and performs as well as C51 in three out of five games. This is especially significant given the relative freedom of the network in outputting arbitrary vectors. Nonetheless, our results suggest that there are benefits to enforcing normalized distributions – possibly in reducing the update variance.

To better understand the qualitative differences between the two algorithms, we studied agents playing through episodes of different games and visualized the predicted distribution for their selected actions (videos available in the supplemental; Figure 3). We find that C51 outputs value distributions which are bell-shaped and may have a separate mode at 0. In contrast, the S51 distributions are much more diverse; we highlight two interesting results:

Double negatives. S51 agents often assign negative mass to negative returns in games where such returns are impossible, such as Pong (Figure 2, left). The total mass in these cases is still close to 1.

Compensation around 0. In Space Invaders (Figure 2, right), the 0 return prediction is bracketed with small negative and positive corrections that cancel each other out. One explanation is that the network compensates for its limited capacity by relying on negative return predictions. This is particularly interesting as this behaviour is not possible under published distributional algorithms.

Noisier predictions (left and right). S51 assigns a small amount of probability to almost all returns. We hypothesize that this effect is visually absent from the C51 histograms because of the squashing effect of the softmax transfer function, and that this added noise explains some of the difference in performance. In particular, to generate a small probability the C51 network need only output a sufficiently negative logit; by contrast, S51 must output a value which is neither too negative nor too positive (i.e., is actually close to 0).

## 6 Discussion and Conclusion

While the convergence of the distributional approach with linear approximation may have been predictable, our proof shows that the result is not completely straightforward, and that the normalization penalty plays an important role in convergence. Because the softmax produces bounded outputs, it may still be possible to derive some convergence guarantees for it; however, it seems difficult to bound on its approximation error once we leave the convex regime of the linear outputs/squared loss combination. Another question is whether minimizing the Cramér distance in the context of function approximation for optimal control somehow results in learning dynamics that are more stable than in the expected case, as a wealth of empirical results now suggest.

The Wasserstein distance also plays an important role in distributional reinforcement learning. Dabney et al. (2018b) demonstrated that one can obtain a stable distributional algorithm which minimizes the Wasserstein distance even in the approximate case by performing quantile regression rather than gradient descent on the sample Wasserstein loss. A similar analysis to ours may in fact prove convergence in the approximate setting; we expect that minimizing the Wasserstein metric should also be susceptible to pathological cases yielding a worse approximation of expected values.

Despite our attempts, we could not match the raw performance of C51. While this may only be a matter of hyperparameter tuning, we might have lost other properties when moving away from the KL. One might also wonder if there are other losses even more suited to the problem than our modified Cramér loss. In particular, since the ultimate goal is to preserve the expectation of the target distribution, one could adapt the loss to strengthen the link between loss minimization and expectation preservation.

## References

• Auer et al. (1995) Auer, Peter, Herbster, Mark, and Warmuth, Manfred K. Exponentially many local minima for single neurons. In Advances in Neural Information Processing, 1995.
• Barth-Maron et al. (2018) Barth-Maron, Gabriel, Hoffman, Matthew W., Budden, David, Dabney, Will, Horgan, Dan, TB, Dhruva, Muldal, Alistair, Heess, Nicolas, and Lillicrap, Timothy. Distributional policy gradients. In Proceedings of the International Conference on Learning Representations, 2018.
• Bellemare et al. (2013) Bellemare, Marc G, Naddaf, Yavar, Veness, Joel, and Bowling, Michael. The Arcade Learning Environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research (JAIR), 47:253–279, 2013.
• Bellemare et al. (2017a) Bellemare, Marc G., Dabney, Will, and Munos, Rémi. A distributional perspective on reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2017a.
• Bellemare et al. (2017b) Bellemare, Marc G, Danihelka, Ivo, Dabney, Will, Mohamed, Shakir, Lakshminarayanan, Balaji, Hoyer, Stephan, and Munos, Rémi. The Cramér distance as a solution to biased Wasserstein gradients. arXiv, 2017b.
• Bertsekas & Tsitsiklis (1996) Bertsekas, Dimitri P. and Tsitsiklis, John N. Neuro-Dynamic Programming. Athena Scientific, 1996.
• Castro et al. (2018) Castro, Pablo S., Moitra, Subhodeep, Gelada, Carles, Kumar, Saurabh, and Bellemare, Marc G. Dopamine: A research framework for deep reinforcement learning. arXiv, 2018.
• Dabney et al. (2018a) Dabney, Will, Ostrovski, Georg, Silver, David, and Munos, Remi. Implicit quantile networks for distributional reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2018a.
• Dabney et al. (2018b) Dabney, Will, Rowland, Mark, Bellemare, Marc G., and Munos, Rémi. Distributional reinforcement learning with quantile regression. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018b.
• Hessel et al. (2018) Hessel, Matteo, Modayil, Joseph, van Hasselt, Hado, Schaul, Tom, Ostrovski, Georg, Dabney, Will, Horgan, Dan, Piot, Bilal, Azar, Mohammad, and Silver, David. Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
• Mnih et al. (2015) Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare, Marc G, Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K, Ostrovski, Georg, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
• Puterman (1994) Puterman, Martin L. Markov Decision Processes: Discrete stochastic dynamic programming. John Wiley & Sons, Inc., 1994.
• Rowland et al. (2018) Rowland, Mark, Bellemare, Marc G, Dabney, Will, Munos, Rémi, and Teh, Yee Whye. An analysis of categorical distributional reinforcement learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics, 2018.
• Sutton (1988) Sutton, Richard S. Learning to predict by the methods of temporal differences. Machine Learning, 3(1):9–44, 1988.
• Sutton & Barto (1998) Sutton, Richard S. and Barto, Andrew G. Reinforcement learning: An introduction. MIT Press, 1998.
• Székely (2002) Székely, Gabor J. E-statistics: The energy of statistical samples. Technical Report 02-16, Bowling Green State University, Department of Mathematics and Statistics, 2002.
• Tsitsiklis & Van Roy (1997) Tsitsiklis, John N. and Van Roy, Benjamin. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5):674–690, 1997.
• Watkins (1989) Watkins, Christopher J. C. H. Learning from delayed rewards. PhD thesis, Cambridge University, Cambridge, England, 1989.

## Appendix

We now provide the proofs of our lemmas.

### Impact on optimization speed

To prove Lemma 4.1.1, we first need to prove an additional lemma. \thmt@toks\thmt@toks The set of symmetric matrices such that for all normalized distributions and is the set

###### Proof.

Let . Since and are normalized, we have . Hence,

 (p−q)⊤M(p−q)= (p−q)⊤CC⊤(p−q) =(p−q)⊤CC⊤(p−q).

Conversely, let a symmetric matrix be such that for all normalized and . Then . For this to be true, must be colinear to . Thus, denoting where is not in the span of , we must have for all normalized and , i.e. . The symmetry constraint leads to . This concludes the proof. ∎

###### Proof.

Let be the vector associated with the maximum eigenvalue of and be an arbitrary vector. Because is a symmetric matrix whose only zero eigenvalue corresponds to , its eigenvectors are orthogonal to and in particular . Thus, we have

 L =v⊤LC0vL =v⊤LΠe⊥CC⊤Πe⊥vL =v⊤LCC⊤vL =v⊤LCC⊤vL+v⊤Lae⊤vL+v⊤Lea⊤vL

for any vector since . Denoting , we get

 L =v⊤LRavL ≤maxvv⊤Rav∥v∥2,

which is the largest eigenvalue of . Since this is true for every , has the lowest top eigenvalue from all the matrices in .

Similarly, let us denote be the vector associated with the second-smallest eigenvalue  111The smallest being 0.. As is the eigenvector associated with the eigenvalue , we have that and

 μ =v⊤μC0vμ =v⊤μCC⊤vμ =v⊤μCC⊤vμ+v⊤μae⊤vμ+v⊤μea⊤vμ =v⊤μRavμ ≥minvv⊤Rav∥v∥2.

Thus, for all , the second smallest eigenvalue of is larger than the smallest eigenvalue of .

This means that, for to have the smallest condition number of all the matrices in , it is sufficient to require that the eigenvalue associated with be between and , i.e. that . This concludes the proof. ∎

### Preservation of the expectation

To prove Lemma 4.1.2, we will need the following proposition: \thmt@toks\thmt@toks Let be defined as in Section 3, i.e. is the vector of evenly spaced returns between and with mean . Let . Then for all values of .

###### Proof.

We will prove that for all values of . First, we note that and .

Since , we have, denoting ,

 cj =∑iCijbi ={0if j = 11otherwise.

Multiplying by to get , we get

 di =∑jCij(C⊤b)j =i−1.

We now need to compute . Since , we have

 ri =di−V =i−1−k−12 =2i−1−k2 =zi.

This concludes the proof. ∎

###### Proof.

By definition,

 ΠA,b(p) =argminq(p−q)⊤Cλ(p−q)subject toAq=b.

Writing the Lagrange multipliers, this is a quadratic program whose solution is given by

 [ΠA,b(p)ν] =[CλA⊤A0]−1[Cλpb].

Inverting the block diagonal matrix yields

 [CλA⊤A0]−1 =[M11M12M21M22]

with

 M11 =C−1λ−C−1λA⊤(AC−1λA⊤)−1AC−1λ M12 =C−1λA⊤(AC−1λA⊤)−1 M21 =(AC−1λA⊤)−1AC−1λ M21 =−(AC−1λA⊤)−1.

Hence,

 ΠA,b(p) =M11Cλp+M12b =p−C−1λA⊤s

for some . Thus, the expected -value with respect to the projected distribution is equal to

 z⊤ΠA,b(p) =z⊤p−z⊤C−1λA⊤s

and the two expectations will be equal if . Using Proposition Preservation of the expectation, we know that . Thus, if it sufficient to have for the two expectations to match. Since only the first and the last components of are nonzeros and they are opposite of each other, we have when denoting the -th column of . This concludes the proof. ∎

### Convergence to a fixed point

This result requires additional definitions. A value distribution maps states to distributions on ; we extend this to vectors defined by a linear combination of features:

 P(x) =Θ⊤ϕ(x),

where is the feature vector at state and is the parameter matrix we try to estimate.

Concatening all feature vectors into a feature matrix , our linear approximation is . We assume that the vector approximates a distribution over the support , but it may have negative components and is not necessarily normalized.

We are given a distribution on and we shall use a Cramér distance between distributions over :

 l2λ(p,q):=∥p−q∥2Cλ.

We transform the matrix into an operator over continuous distributions, where with some abuse of notation we view as a distribution over a finite set of Diracs: . Then

 Πe⊥p(x) =p(x)−∫zky=z1p(y)dy Πe⊥q(x) =q(x)−∫zky=z1q(y)dy l2λ(p,q) =∫zkx=z1(∫xy=z1[Πe⊥p(y)−Πe⊥q(y)]dx)2dy +λ(∫zky=z1[p(y)−q(y)]dx)2. (10)

The first term on the right-hand side of Eq. (10) penalizes the difference in cdf of and while the second term penalizes the difference in mass. When applied to two distributions and over , this is equivalent to . We define the weighted Cramér distance over value distributions by

 l2ξ,λ(P,Q):=∑x∈Xξ(x)l2λ(P(x),Q(x)).

In what follows we identify three spaces of distributions or distribution-like objects. First, is the space of distributions with support the interval . is the space of distributions over . Finally, is the vector space spanned by the features , that is: .

While our value distribution will only output distributions over the support , the distributional Bellman operator transforms distributions over into distributions from . We thus need to consider the projection which projects onto :

 Πλ,Dp =argminq∈Dl2λ(p,q).

Lemma 3 from Rowland et al. (2018) states that, for any distribution , we have

 l2λ(p,q) =l2λ(p,Πλ,Dp)+l2λ(Πλ,Dp,q). (11)

We now move from the projection of distributions to the projection of value distributions. We define a projection in of a value distribution onto the subspace by

###### Definition 1.

The -weighted projection onto is

 Πξ,λ,VP:=argminQ∈Vl2ξ,λ(P,Q),

where both the projection and the distance are in bold to distinguish them from projection and distances in distribution space.

In particular, two projections are of interest. First, we consider the set of all value distributions from to distributions supported by . The projection onto this set is

 [Πξ,λ,DP](x) =Πλ,DP(x),

We are also interested in the -weighted projection onto , the set of linear value distributions:

 Πξ,λ,ΦP:=argminΦΘ,Θ∈Rm×kl2ξ,λ(P,ΘΦ),

The projection of the true value distribution gives us the closest linear value distribution according to the Cramér distance defined by .

\thmt@toks\thmt@toks

Let be an arbitrary value distribution supported on . The -weighted projection of onto , , is equal to the -weighted projection of .

###### Lemma 7 (Projection onto Φ).

The above lemma will let us restrict our attention to distributions on , that is .

###### Proof.

Fix . By definition, the support of is . Now

 l2ξ,λ(P,Q) =∑x∈Xξ(x)l2λ(P(x),Q(x)) =∑x∈Xξ(x)l2λ(Q(x),Πλ,DP(x)) +∑x∈Xξ(x)l2λ(Πλ,DP(x),P(x)) =l2ξ,λ(Q,Πλ,DP)+ ∑x∈Xξ(x)l2λ(Πλ,DP(x),P(x)),

using Eq. 11. From the above we deduce that the matrix which minimizes is also the minimizer of . ∎

\thmt@toks\thmt@toks

is a non-expansion in , i.e. for every pair of value distributions, we have

###### Proof.

We can view as a weighted norm over vectors in , with the corresponding projection onto the affine subspace spanned by . The result is standard from these observations. ∎

Recall that the loss between vectors is defined through the matrix : . To prove Theorem 4.2.2 we will consider two separate components of that loss: along and along the subspace orthogonal to . That is, let us write

 A:=Πe⊥C,

such that

 l2λ(p,q)=∥p−q∥2AA⊤+λ∥p−q∥2ee