Implicit Under-Parameterization Inhibits Data-Efficient Deep Reinforcement Learning

Implicit Under-Parameterization Inhibits Data-Efficient Deep Reinforcement Learning

Abstract

We identify an implicit under-parameterization phenomenon in value-based deep RL methods that use bootstrapping: when value functions, approximated using deep neural networks, are trained with gradient descent using iterated regression onto target values generated by previous instances of the value network, more gradient updates decrease the expressivity of the current value network. We characterize this loss of expressivity in terms of a drop in the rank of the learned value network features, and show that this corresponds to a drop in performance. We demonstrate this phenomenon on widely studies domains, including Atari and Gym benchmarks, in both offline and online RL settings. We formally analyze this phenomenon and show that it results from a pathological interaction between bootstrapping and gradient-based optimization. We further show that mitigating implicit under-parameterization by controlling rank collapse improves performance.

\iclrfinalcopy

1 Introduction

Many effective deep reinforcement learning (RL) algorithms estimate value functions using bootstrapping, that is, by sequentially fitting value functions to target value estimates generated from the value function learned in the previous iteration. Despite high-profile achievements (Silver et al., 2017), these algorithms are highly unreliable (Henderson et al., 2017) due to poorly understood optimization issues. Although a number of hypotheses have been proposed to explain these issues, resulting in several effective fixes (Achiam et al., 2019; Bengio et al., 2020; Fu et al., 2019; Igl et al., 2020; Liu et al., 2018; Kumar et al., 2020a; Du et al., 2019), a complete understanding remains elusive.

We identify an “implicit under-parameterization” phenomenon (Figure 1) that emerges when value networks are trained using gradient descent combined with bootstrapping. This phenomenon manifests as an excessive aliasing of features learned by the value network across states, which is exacerbated with more gradient updates. While the supervised deep learning literature suggests that some feature aliasing is desirable for generalization (e.g.,  Gunasekar et al., 2017; Arora et al., 2019), implicit under-parameterization exhibits more pronounced aliasing than in supervised learning. This over-aliasing causes an otherwise expressive value network to implicitly behave as an under-parameterized network, often resulting in poor performance.

Implicit under-parameterization becomes aggravated when the rate of data re-use is increased, restricting the sample efficiency of deep RL methods. In online RL, increasing the number of gradient steps between data collection steps for data-efficient RL (Fu et al., 2019; Fedus et al., 2020b) causes the problem to emerge more frequently. In the extreme case when no additional data is collected, referred to as offline RL (Lange et al., 2012; Agarwal et al., 2020; Levine et al., 2020), implicit under-parameterization manifests consistently, limiting the viability of offline methods.

We demonstrate the existence of implicit under-parameterization in common value-based deep RL methods, including Q-learning (Mnih et al., 2015; Hessel et al., 2018) and actor-critic algorithms (Haarnoja et al., 2018), as well as neural fitted-Q iteration (Riedmiller, 2005; Ernst et al., 2005). To isolate the issue, we study the effective rank of the features in the penultimate layer of the value network (Section 3). We observe that after an initial learning period, the rank of the features drops steeply. As the rank decreases, the features are less suitable for fitting subsequent target values, resulting in a sharp decrease in performance (Section 3.3).

To better understand the emergence of implicit under-parameterization, we formally study the dynamics of Q-learning under two distinct models of neural net behavior (Section 4): kernel regression (Jacot et al., 2018; Mobahi et al., 2020) and deep linear networks (Arora et al., 2018). We corroborate the existence of this phenomenon in both models, and show that implicit under-parameterization stems from a pathological interaction between bootstrapping and the implicit regularization of gradient descent. Since value networks are trained to regress towards targets generated by a previous version of the same model, this leads to a sequence of value networks of potentially decreasing expressivity, which can result in degenerate behavior and a drop in performance.

Figure 1: Implicit under-parameterization. Schematic diagram depicting the emergence of an effective rank collapse in deep Q-learning. Minimizing TD errors using gradient descent with deep neural network Q-function leads to a collapse in the effective rank of the learned features , which is exacerbated with further training.

The main contribution of this work is the identification of implicit under-parameterization in deep RL methods that use bootstrapping. Empirically, we demonstrate a collapse in the rank of the learned features during training, and show that it typically corresponds to a drop in performance in the Atari (Bellemare et al., 2013) and continuous control Gym (Brockman et al., 2016) benchmarks in both the offline and data-efficient online RL settings. We additionally analyze the causes of this phenomenon theoretically. We then show that mitigating this phenomenon via a simple penalty on the singular values of the learned features improves performance of value-based RL methods in the offline setting on Atari.

2 Preliminaries

The goal in RL is to maximize long-term discounted reward in a Markov decision process (MDP), defined as a tuple (Puterman, 1994), with state space , action space , a reward function , transition dynamics and a discount factor . The Q-function for a policy , is the expected long-term discounted reward obtained by executing action at state and following thereafter, . is the fixed point of the Bellman operator , : , which can be written in vector form as: . The optimal Q-function, , is the fixed point of the Bellman optimality operator : .

1:  Initialize Q-network , buffer .
2:  for fitting iteration in {1, …, N} do
3:     Compute and target values on for training
4:     Minimize TD error for via T gradient descent updates,
5:  end for
Algorithm 1 Fitted Q-Iteration (FQI)

Practical Q-learning methods (e.g.,  Mnih et al., 2015; Hessel et al., 2018; Haarnoja et al., 2018) convert the Bellman equation into an bootstrapping-based objective for training a Q-network, , via gradient descent. This objective, known as mean-squared temporal difference (TD) error, is given by: , where is a delayed copy of the Q-function, typically referred to as the target network. These methods train Q-networks via gradient descent and slowly update the target network via Polyak averaging on its parameters. We refer the output of the penultimate layer of the deep Q-network as the learned feature matrix , such that , where and .

For simplicity of analysis, we abstract deep Q-learning methods into a generic fitted Q-iteration (FQI) framework (Ernst et al., 2005). We refer to FQI with neural nets as neural FQI (Riedmiller, 2005). In the -th fitting iteration, FQI trains the Q-function, , to match the target values, generated using previous Q-function, (Algorithm 1). Practical methods can be instantiated as variants of FQI, with different target update styles, different optimizers, etc.

3 Implicit Under-Parameterization in Deep Q-Learning

In this section, we empirically demonstrate the existence of implicit under-parameterization in deep RL methods that use bootstrapping. We characterize implicit under-parameterization in terms of the effective rank (Yang et al., 2019) of the features learned by a Q-network. The effective rank of the feature matrix , for a threshold (we choose ), denoted as , is given by , where are the singular values of in decreasing order, i.e., . Intuitively, this quantity represents the number of “effective” unique components of the feature matrix that form the basis for linearly approximating the Q-values. When the network maps different states to orthogonal feature vectors, such as when is an identity matrix, then has high values close to . When the network “aliases” state-action pairs by mapping them to a smaller subspace, has only a few active singular directions, and takes on a small value. Based on this insight we can define implicit under-parameterization as:

Definition 1.

Implicit under-parameterization refers to a reduction in the effective rank of the features, , that occurs implicitly as a by-product of learning deep neural network Q-functions.

While rank decrease also occurs in supervised learning, it is usually beneficial for obtaining generalizable solutions (Gunasekar et al., 2017; Arora et al., 2019). However, we will show that in deep Q-learning, an interaction between bootstrapping and gradient descent can lead to more aggressive rank reduction (or rank collapse), which can hurt performance.

Figure 2: Data-efficient offline RL. and performance of neural FQI on gridworld, DQN on Atari and SAC on Gym environments in the offline RL setting. Note that low rank (top row) generally corresponds to worse policy performance (bottom row). Rank collapse is worse with more gradient steps per fitting iteration (T vs on gridworld). Even when a larger, high coverage dataset is used, marked as DQN (4x data), rank collapse occurs (for Asterix also see Figure A.2 for a complete figure with a larger number of gradient updates).

Experimental setup. To study implicit under-parameterization empirically, we compute on a minibatch of state-action pairs sampled i.i.d. from the training data (i.e., the dataset in the offline setting, and the replay buffer in the online setting). We investigate offline and online RL settings on benchmarks including Atari games (Bellemare et al., 2013) and Gym environments (Brockman et al., 2016). We also utilize gridworlds described by Fu et al. (2019) to compare the learned Q-function against the oracle solution computed using tabular value iteration. We evaluate DQN (Mnih et al., 2015) on gridworld and Atari and SAC (Haarnoja et al., 2018) on Gym domains.

3.1 Data-Efficient Offline RL

In offline RL, our goal is to learn effective policies by performing Q-learning on a fixed dataset of transitions. We investigate the presence of rank collapse when deep Q-learning is used with broad state coverage offline datasets from Agarwal et al. (2020). In the top row of Figure 2, we show that after an initial learning period, decreases in all domains (Atari, Gym and the gridworld). The final value of is often quite small – e.g., in Atari, only 20-100 singular components are active for -dimensional features, implying significant underutilization of network capacity. Since under-parameterization is implicitly induced by the learning process, even high-capacity value networks behave as low-capacity networks as more training is performed with a bootstrapped objective (e.g., mean squared TD error).

On the gridworld environment, regressing to using supervised regression results in a much higher  (black dashed line in Figure 2(left)) than when using neural FQI. On Atari, even when a 4x larger offline dataset with much broader coverage is used (blue line in Figure 2), rank collapse still persists, indicating that implicit under-parameterization is not due to limited offline dataset size. Figure 2 ( row) illustrates that policy performance generally deteriorates as drops, and eventually collapses simultaneously with the rank collapse. While we do not claim that implicit under-parameterization is the only issue in deep Q-learning, the results in Figure 2 show that the emergence of this under-parameterization is strongly associated with poor performance.

It is well-known that distributional shift between the offline dataset and the learned policy is a major reason for instability of standard RL algorithms in the offline regime (Fujimoto et al., 2019; Kumar et al., 2019). To control for this potential confounding factor, we also study CQL (Kumar et al., 2020b), a recently-proposed offline RL method designed to handle distribution mismatch by training the Q-function with an additional regularizer that encourages low Q-values at unseen actions. We find a similar degradation in effective rank and performance for CQL (Figure A.3), implying that implicit under-parameterization still appears when methods that correct for distribution shift are employed. The reason is that such methods still operate in the regime where multiple gradient steps are performed on a given unit of experience and so implicit under-parameterization can appear. We provide evidence for implicit under-parameterization in more Atari and Gym environments in Appendix A.1.

Figure 3: Data-efficient online RL. and performance of neural FQI on gridworld, DQN on Atari and SAC on Gym domains in the online RL setting, with varying numbers of gradient steps per environment step (). Rank collapse happens earlier with more gradient steps, and the corresponding performance is poor.

3.2 Data-Efficient Online RL

Figure 4: Data-efficient Rainbow. drop for the data efficient rainbow agent (van Hasselt et al., 2019) which uses 4 gradient updates per unit environment interaction () with multi-step returns.

Deep Q-learning methods typically use very few gradient updates () per unit amount of environment interaction (e.g., DQN performs 1 update for every unit environment interaction on Atari (= 4 environment steps) and thus , SAC performs 1 update per unit environment interaction (= 1 environment step). Improving the sample efficiency of these methods requires increasing to utilize the replay data more effectively. However, we find that using larger values of results in higher levels of rank collapse as well as performance degradation.

In the top row of Figure 3, we show that larger values of lead to a more aggressive drop in (red vs. blue/orange lines), and that rank continues to decrease with more training. Furthermore, the bottom row illustrates that larger values of result in worse performance, corroborating Fu et al. (2019); Fedus et al. (2020b). As in the offline setting, directly regressing to via supervised learning does not cause rank collapse (black line in Figure 3), unlike when using bootstrapping.

Finally, we evaluate data-efficient Rainbow (DER) (van Hasselt et al., 2019) that modifies the Rainbow algorithm (Hessel et al., 2018) to obtain improved performance with limited environment interaction. DER utilizes 4 gradient updates per environment step () with multi-step returns (20-steps) and a smaller Q-network. We observe that DER also exhibits rank collapse (Figure 4), similarly to the DQN algorithm with more updates. These results indicate that DER also suffers from implicit under-parameterization, and using multi-step returns does not alleviate the issue. When training DER for longer, up to 25M steps, we also observe that it performs suboptimally as compared to a standard Rainbow algorithm (), thus giving rise to similar trends in performance as the DQN variants in Figure 3. We present similar results demonstrating implicit under-parameterization on more Atari and Gym environments in the online data-efficient setting in Appendix A.2.

3.3 Understanding Implicit Under-parameterization and its Implications

Finally, we aim to understand the primary sources of error behind implicit under-parameterization and and how the emergence of this phenomenon affects the performance of RL algorithms in the data-efficient setting via controlled empirical experiments that we discuss next.

How does implicit under-parameterization degrade performance? Having established the presence of rank collapse in data-efficient RL, we now discuss how it can adversely affect performance. As the effective rank of the network features decreases, so does the network’s ability to fit the subsequent target values, and eventually results in inability to fit . In the gridworld domain, we measure this loss of expressivity by measuring the error in fitting oracle-computed values via a linear transformation of . When rank collapse occurs, the error in fitting steadily increases during training, and the consequent network is not able to predict at all by the end of training (Figure 5) – this entails a drop in performance. In Atari domains, we do not have access to , and so we instead measure TD error, that is, the error in fitting the target value estimates, . In Seaquest, as rank decreases, the TD error increases (Figure 6) and the value function is unable to fit the target values, culminating in a performance plateau (Figure 3). This observation is consistent across other environments; we present further supporting evidence in Appendix A.4.

Figure 5: Q Fitting Error. Fitting error for prediction for vs steps in Figure 3 (left). Observe that rank collapse inhibits fitting as the fitting error rises over training while rank collapses.
Figure 6: TD error (in log-scale) for varying values of for Seaquest and Asterix shown in Figure 3 (middle). The value of TD error is larger for larger values of (see orange vs. red in both the games), indicating that larger values of that exihibit lower values of the effective feature matrix rank also attain larger TD error in both cases.
Figure 7: Periodic reinitialization of Q-networks still exhibits implicit under-parameterization.

Does bootstrapping cause implicit under-parameterization? We perform a number of controlled experiments in the gridworld and Atari environments to isolate the connection between rank collapse and bootstrapping. We first remove confounding with issues of poor network initialization (Fedus et al., 2020a) and non-stationarity (Igl et al., 2020) by showing that rank collapse occurs even when the Q-network is re-initialized from scratch at the start of each fitting iteration (Figure 7). To show that the problem is not isolated to the control setting, we show evidence of rank collapse in the policy evaluation setting as well. We trained a value network using fitted Q-evaluation for a fixed policy (i.e., using the Bellman operator instead of ), and found that rank drop still occurs (FQE in Figure 8). Finally, we show that by removing bootstrapped updates and instead regressing directly to Monte-Carlo (MC) estimates of the value, the effective rank does not collapse (MC Returns in Figure 8). These results, along with similar findings on other Atari environments (Appendix A.3), our analysis indicates that bootstrapping is at the core of implicit under-parameterization.

Figure 8: Bootstrapping vs. Monte-Carlo evaluation. Trend of for policy evaluation based on bootstrapped updates (FQE) vs Monte-Carlo returns (no bootstrapping). Note that rank-collapse still persists with reinitialization and FQE, but goes away in the absence of bootstrapping.

4 Theoretical Analysis of Implicit Under-Parameterization

In this section, we formally analyze implicit under-parameterization and prove that training neural networks with bootstrapping reduces the effective rank of the Q-network, corroborating the empirical observations in the previous section. We focus on policy evaluation (Figure 8 and Figure A.10), where we aim to learn a Q-function that satisfies for a fixed , for ease of analysis. We also presume a fixed dataset of transitions, , to learn the Q-function.

4.1 Analysis via Kernel Regression

We first study bootstrapping with neural networks through a mathematical abstraction that treats the Q-network as a kernel machine, following the neural tangent kernel (NTK) formalism (Jacot et al., 2018). Building on prior analysis of self-distillation (Mobahi et al., 2020), we assume that each iteration of bootstrapping, the Q-function optimizes the squared TD error to target labels with a kernel regularizer. Intuitively, this regularizer captures the inductive bias from gradient descent, resembling the regularization imposed by gradient descent under NTK (Mobahi et al., 2020). The error is computed on whereas the regularization imposed by a universal kernel with a coefficient of is applied to the Q-values at all state-action pairs as shown in Equation 1.

(1)

The solution to Equation 1 can be expressed as , where is the Gram matrix for a special positive-definite kernel (Duffy, 2015) and denotes the row of corresponding to the input  (Mobahi et al., 2020, Proposition 1). A detailed proof is in Appendix C. When combined with the fitted Q-iteration recursion, setting labels , we recover a recurrence that relates subsequent value function iterates

(2)
(3)

Equation 3 follows from unrolling the recurrence and setting the algorithm-agnostic initial Q-value, , to be . We now show that the sparsity of singular values of the matrix generally increases over fitting iterations, implying that the effective rank of diminishes with more iterations.

Theorem 4.1.

Let be a shorthand for and assume is a normal matrix. Then there exists an infinite, strictly increasing sequence of fitting iterations, starting from , such that, for any two singular-values and of with ,

(4)

Hence, the effective rank of satisfies: . Moreover, if is positive semi-definite, then , i.e. effective rank continuously decreases in each fitting iteration.

We provide a proof of the theorem above as well as present a stronger variant that shows a gradual decrease in the effective rank for fitting iterations outside this infinite sequence in Appendix C. As increases along the sequence of iterations given by , the effective rank of the matrix drops, leading to gradually decreasing expressivity of this matrix as increases. Since linearly maps rewards to the Q-function (Equation 3), drop in expressivity results of in the inability to model the actual .

4.2 Analysis with Deep Linear Networks under Gradient Descent

While Section 4.1 demonstrates rank collapse will occur in a kernel-regression model of Q-learning, it does not illustrate when the rank collapse occurs. To better specify the point in training that rank collapse emerges, we present a complementary derivation motivated by implicit regularization of gradient descent over deep linear neural networks (Arora et al., 2019). Our analysis will show that rank collapse can emerge as the generated target values begin to approach the previous value estimate, in particular, when in the vicinity of the optimal Q-function.

Additional notation and assumptions. We represent the Q-function as a deep linear network with at layers, such that , where , and with for . maps an input to corresponding penultimate layer’ features . Let denotes the weight matrix at the -th step of gradient descent during the -th fitting iteration (Algorithm 1). We define and as the TD error objective in the -th fitting iteration. We study since the rank of features is equal to rank of provided the state-action inputs have high rank.

We assume that the evolution of the network weights is governed by a continuous-time differential equation (Arora et al., 2018) within each fitting iteration . Similar to Arora et al. (2018, 2019), we assume that all except the last-layer weights share singular values (a.k.a. “balancedness”). In this model, we can characterize the evolution of the singular values of the feature matrix , using techniques analogous to Arora et al. (2019):

Proposition 4.1.

The singular values of the feature matrix evolve according to:

(5)

for , where and denote the left and right singular vectors of the feature matrix, , respectively.

Evolution of singular values of on Seaquest

Solving the differential equation (5) tells us that larger singular values will evolve at a exponentially faster rate than smaller singular values (as we also formally show in Appendix D.1) and the difference in their magnitudes disproportionately increase with increasing . This behavior also occurs empirically, illustrated in the figure on the right (also see Figure D.1), where larger singular values are orders of magnitude larger than smaller singular values. Hence, the effective rank of the features, , will decrease with more gradient steps within a fitting iteration .

Explaining implicit under-parameterization across fitting iterations. Since gradient descent within a fitting iteration has an implicit bias towards solutions with low effective rank, we can express the solution obtained at the end of a fitting iteration as (with )

(6)

This provides an abstraction to analyze the evolution of across fitting iterations. Before considering the full bootstrapping setting, we first analyze the self-training setting, when the target values are equal to the previous Q-values, . Since just copying over weights from iteration to attains zero TD error without increasing effective rank, the optimal solution for Equation 6 must attain a smaller value of . This tradeoff between Bellman error and is also observed in practice (c.f. Figure 3 (middle) and Figure 5). Applying this argument inductively implies that decreases with increasing , right from the start (i.e.,  ) with self-training.

Case with bootstrapped updates in RL. In the RL setting, the target values are now given by . Unlike the self-training setting, is not directly expressible as a function of the previous due to additional reward and dynamics transformations, and the rank of the solution to Equation 6 may increase initially. However, as the value function gets closer to the Bellman fixed point, the learning dynamics begins to resemble the self-training regime, since the target values approach the previous value iterate . As learning approaches this regime, the rank decrease phenomenon discussed above begins to occur and under-parameterization emerges. This insight is captured by our next result, which shows rank decrease when target values are sufficiently close to the previous value estimate (e.g., near convergence).

Theorem 4.2.

Suppose target values are close to the previous value estimate , i.e. . Then, there is a constant such that whenever , the solution to Equation 6, , has lower effective rank than :

(7)

The constant depends on the magnitude of and as well as the gap between singular values of ; we provide the complete form in Appendix D.2. One important consequence of this theorem is that rank decrease occurs when the value function is near (but not at) the fixed point, since gradient descent will preferentially choose solutions with decreased effective rank, which can also increase TD error. This is consistent with the empirical evidence in Figure 2, where rank collapse happens right after a peak in performance is achieved.

5 Mitigating Under-Parametrization Improves Deep Q-Learning

We now show that mitigating implicit under-parameterization by preventing rank collapse can improve performance. We place special emphasis on the offline RL setting in this section, since it is particularly vulnerable to the adverse effects of rank collapse.

(a)
(b)
Figure 9: (a): (top) and performance (bottom) of FQI on gridworld in the offline setting with 200 gradient updates per fitting iteration. Note the reduction in rank collapse and higher performance with the regularizer . (b): mitigates the rank collapse in DQN and CQL in the offline RL setting on Atari.

We devise a penalty (or a regularizer) that encourages higher effective rank of the learned features, , to prevent rank collapse. The effective rank function is non-differentiable, so we choose a simple surrogate that can be optimized over deep networks. Since effective rank is maximized when the magnitude of the singular values is roughly balanced, one way to increase effective rank is to minimize the largest singular value of , , while simultaneously maximizing the smallest singular value, . We construct a simple penalty derived from this intuition, formally given by:

(8)

can be computed by invoking the singular value decomposition subroutines in standard automatic differentiation frameworks. We estimate the singular values over the feature matrix computed over a minibatch, and add the resulting value of as a penalty to the TD error objective, with a tradeoff factor .

Does address rank collapse? We first verify whether controlling the minimum and maximum singular values using actually prevents rank collapse. When using this penalty on the gridworld problem (Figure 8(a)), the effective rank does not collapse, instead gradually decreasing at the onset and then plateauing, akin to the evolution of effective rank in supervised learning. In Figure 8(b), we plot the evolution of the effective rank values on two Atari games in the offline setting, Seaquest and Breakout (all games in Appendix A.5), and observe that the addition of the penalty, , also generally leads to increasing ranks.

Does mitigating rank collapse improve performance? We now evaluate the performance of the penalty using DQN (Mnih et al., 2015) and CQL (Kumar et al., 2020b) on Atari dataset from Agarwal et al. (2020) (5% replay data), used in Section 3. Figure 10 summarizes the relative improvement from using the penalty for 16 Atari games. Adding the penalty to DQN improves performance on all 16/16 games with a median improvement of 74.5%; adding it to CQL, a state-of-the-art offline algorithm, improves performance on 11/16 games with median improvement of 14.1%. Prior work has discussed that standard Q-learning methods designed for the online setting, such as DQN, are generally ineffective with small offline datasets (Kumar et al., 2020b; Agarwal et al., 2020). Our results show that mitigating rank collapse makes even such simple methods substantially more effective in this setting, suggesting that rank collapse and the resulting implicit under-parameterization may be an crucial piece of the puzzle in explaining the challenges of offline RL.

Figure 10: DQN and CQL with penalty vs. their standard counterparts in the 5% offline setting on Atari from Section 3. improves DQN on 16/16 and CQL on 11/16 games.

We also evaluated the regularizer in the data-efficient online RL setting, with results in Appendix A.6. This variant achieved median improvement of 20.6% performance with Rainbow (Hessel et al., 2018), however performed poorly with DQN, where it reduced median performance by 11.5%. Thus, while our proposed penalty is effective in many cases spanning both offline and data-efficient online settings, it does not solve the problem fully, and a more sophisticated solution may better prevent the issues with implicit under-parameterization. Nevertheless, our results suggest that mitigation of implicit under-parameterization can improve performance of data-efficient RL.

6 Related Work

Prior work has extensively studied the learning dynamics of Q-learning with tabular and linear function approximation, to study error propagation (Munos, 2003; Farahmand et al., 2010; Chen and Jiang, 2019) and to prevent divergence (De Farias, 2002; Maei et al., 2009; Sutton et al., 2009; Dai et al., 2018), as opposed to deep Q-learning analyzed in this work. Q-learning has been shown to have favorable optimization properties with certain classes of features (Ghosh and Bellemare, 2020), but our work shows that the features learned by a neural net when minimizing TD error do not enjoy such guarantees, and instead suffer from rank collapse. Recent theoretical analyses of deep Q-learning have shown convergence under restrictive assumptions (Yang et al., 2020; Cai et al., 2019; Zhang et al., 2020; Xu and Gu, 2019), but Theorem 4.2 shows that implicit under-parameterization appears when the estimates of the value function approach the optimum, potentially preventing convergence. Several works including Chen and Jiang (2019); Du et al. (2019, 2020); Xie and Jiang (2020) attempted to address complexity of fitted Q-iteration with restricted function classes (and not neural nets), however out work is distinct in that it does not aim to provide complexity guarantees and is focused more on understanding the learning dynamics of deep Q-learning. Xu et al. (2005, 2007) present variants of LSTD (Boyan, 1999; Lagoudakis and Parr, 2003), which model the Q-function as a kernel-machine but do not take into account the regularization from gradient descent, as done in Equation 1, which is essential for implicit under-parameterization.

Igl et al. (2020); Fedus et al. (2020a) argue that non-stationarity arising from distribution shift hinders generalization and recommend periodic network re-initialization. Under-parameterization is not caused by this distribution shift, and we find that network re-initialization does little to prevent rank collapse (Figure 7). Luo et al. (2020) proposes a regularization similar to ours, but in a different setting, finding that more expressive features increases performance of on-policy RL methods. Finally, Yang et al. (2019) study the effective rank of the -values when expressed as a matrix in online RL and find that low ranks for this -matrix are preferable. We analyze a fundamentally different object: the learned features (and illustrate that a rank-collapse of features can hurt), not the -matrix, whose rank is upper-bounded by the number of actions (e.g., 24 for Atari).

7 Discussion

We identified an implicit under-parameterization phenomenon in deep RL algorithms that use bootstrapping, where gradient-based optimization of a bootstrapped objective can lead to a reduction in the expressive power of the value network. This effect manifests as a collapse of the rank of the features learned by the value network, causing aliasing across states and often leading to poor performance. Our analysis reveals that this phenomenon is caused by the implicit regularization due to gradient descent on bootstrapped objectives. We observed that mitigating this problem by means of a simple regularization scheme improves performance of deep Q-learning methods.

While our proposed regularization provides some improvement, devising better mitigation strategies for implicit under-parameterization remains an exciting direction for future work. Our method explicitly attempts to prevent rank collapse, but relies on the emergence of useful features solely through the bootstrapped signal. An alternative path may be to develop new auxiliary losses  (e.g.,  Jaderberg et al., 2016) that learn useful features while passively preventing underparameterization. More broadly, understanding the effects of neural nets and associated factors such as initialization, choice of optimizer, etc. on the learning dynamics of deep RL algorithms, using tools from deep learning theory, is likely to be key towards developing robust and data-efficient deep RL algorithms.

Acknowledgements

We thank Lihong Li, Aaron Courville, Aurick Zhou, Abhishek Gupta, George Tucker, Ofir Nachum, Wesley Chung, Emmanuel Bengio, Zafarali Ahmed, and Jacob Buckman for feedback on an earlier version of this paper. We thank Hossein Mobahi for insightful discussions about self-distillation and Hanie Sedghi for insightful discussions about implicit regularization and generalization in deep networks. We additionally thank Michael Janner, Aaron Courville, Dale Schuurmans and Marc Bellemare for helpful discussions. AK was partly funded by the DARPA Assured Autonomy program, and DG was supported by a NSF graduate fellowship and compute support from Amazon.

Appendices

Appendix A Additional Evidence for Implicit Under-Parameterization

In this section, we present additional evidence that demonstrates the existence of the implicit under-parameterization phenomenon from Section 3. In all cases, we plot the values of computed on a batch size of 2048 i.i.d. sampled transitions from the dataset.

a.1 Offline RL

Figure A.1: Offline DQN on Atari. and performance of DQN on five Atari games in the offline RL setting using the 5% DQN Replay dataset (Agarwal et al., 2020) (marked as DQN) and a larger 20% DQN Replay dataset (Agarwal et al., 2020) (marked as DQN (4x data)). Note that low srank (top row) generally corresponds to worse policy performance (bottom row). Average across 5 runs is showed for each game with individual runs.
Figure A.2: Offline DQN on Atari. and performance of DQN on five Atari games in the offline RL setting using the 20% DQN Replay dataset (Agarwal et al., 2020) (marked as DQN) trained for 1000 iterations. Note that low srank (top row) generally corresponds to worse policy performance (bottom row). Average across 5 runs is showed for each game with individual runs.
Figure A.3: Offline CQL on Atari. and performance of CQL on five Atari games (average across 5 runs) in the offline RL setting using the 5% DQN Replay. Rank values degrade significantly with prolonged training and this corresponds to a sharp drop in performance. Note that lower rank values than a DQN is likely because CQL trains Q-functions with an additional regularizer.
Figure A.4: Offline Control on MuJoCo. and performance of SAC on three Gym environments the offline RL setting. Implicit under-parameterization is conspicuous from the rank reduction, which highly correlates with performance degradation. We use 20% uniformly sampled data from the entire replay experience of an online SAC agent, similar to the 20% setting from Agarwal et al. (2020).
Figure A.5: Offline Control on MuJoCo. and performance of CQL on three Gym environments the offline RL setting. Implicit under-parameterization is conspicuous from the rank reduction, which highly correlates with performance degradation. We use 20% uniformly sampled data from the entire replay experience of an online SAC agent, similar to the 20% setting from Agarwal et al. (2020).

a.2 Data Efficient Online RL

In the data-efficient online RL setting, we verify the presence of implicit under-parameterization on both DQN and Rainbow (Hessel et al., 2018) algorithms when larger number of gradient updates are made per environment step. In these settings we find that more gradient updates per environment step lead to a larger decrease in effective rank, whereas effective rank can increase when the amount of data re-use is reduced by taking fewer gradient steps.

Figure A.6: Online DQN on Atari. and performance of DQN on 5 Atari games in the online RL setting, with varying numbers of gradient steps per environment step (). Rank collapse happens earlier with more gradient steps, and the corresponding performance is poor. This indicates that implicit under-parameterization aggravates as the rate of data re-use is increased.
Figure A.7: Online SAC on MuJoCo. and performance of SAC on three Gym environments the online RL setting, with varying numbers of gradient steps per environment step (). While in the simpler environments, HalfCheetah-v2, Hopper-v2 and Walker2d-v2 we actually observe an increase in the values of effective rank, which also corresponds to good performance with large values in these cases, on the more complex Ant-v2 environment rank decreases with larger , and the corresponding performance is worse with more gradient updates.
Figure A.8: Online Rainbow on Atari. Rainbow on 16 Atari games in the data-efficient online setting, with varying numbers of gradient steps per environment step (). Rank collapse happens earlier with more gradient steps, and the corresponding performance is poor. This plot indicates the multi-step returns, prioritized replay or distributional C51 does not address the implicit under-parameterization issue.
Figure A.9: Data Efficient Rainbow on Atari. data-efficient Rainbow on 5 Atari games in the online setting. The horizontal line shows the performance of the Rainbow agent at the end of 100 iterations where each iteration is 250K environment steps. Data-efficient Rainbow performs poorly compared to online Rainbow.

a.3 Does Bootstrapping Cause Implicit Under-Parameterization?

In this section, we provide additional evidence to support our claim from Section 3 that suggests that bootstrapping-based updates are a key component behind the existence of implicit under-parameterization. To do so, we empirically demonstrate the following points empirically:

  • Implicit under-parameterization occurs even when the form of the bootstrapping update is changed from Q-learning that utilizes a backup operator to a policy evaluation (fitted Q-evaluation) backup operator, that computes an expectation of the target Q-values under the distributions specified by a different policy. Thus, with different bootstrapped updates, the phenomenon still appears.

    Figure A.10: Offline Policy Evaluation on Atari. and performance of offline policy evaluation (FQE) on 5 Atari games in the offline RL setting using the 5% and 20% (marked as “large”) DQN Replay dataset (Agarwal et al., 2020). The rank degradation shows that under-parameterization is not specific to the Bellman optimality operator but happens even when other bootstrapping-based backup operators are combined with gradient descent. Furthermore, the rank degradation also happens when we increase the dataset size.
  • Implicit under-parameterization does not occur when Monte-Carlo regression targets - that compute regression targets for the Q-function by computing a non-parametric estimate the future trajectory return, i.e., , and do not use bootstrapping. In this setting, we find that the values of effective rank actually increase over time and stabilize, unlike the corresponding case for bootstrapped updates. Thus, other factors kept identically the same, implicit under-parameterization happens only when bootstrapped updates are used.

    Figure A.11: Monte Carlo Offline Policy Evaluation. on 5 Atari games in when using Monte Carlo returns for targets and thus removing bootstrapping updates. Rank collapse does not happen in this setting implying that is bootstrapping was essential for under-parameterization. We perform the experiments using and (marked as “large” in the figure) DQN replay dataset from Agarwal et al. (2020).
    Figure A.12: FQE vs. Monte Carlo Offline Policy Evaluation. Trend of for policy evaluation based on bootstrapped updates (FQE) vs Monte-Carlo returns (no bootstrapping). Note that rank-collapse still persists with reinitialization and FQE, but goes away in the absence of bootstrapping. We perform the experiments using DQN replay dataset from Agarwal et al. (2020).

a.4 How Does Implicit Regularization Inhibit Data-Efficient RL?

Figure A.13: fitting error and srank in a one-hot variant of the gridworld environment.

Implicit under-parameterization leads to a trade-off between minimizing the TD error vs. encouraging low rank features as shown in Figure 6. This trade-off often results in decrease in effective rank, at the expense of increase in TD error, resulting in lower performance. Here we present additional evidence to support this.

Figure A.13 shows a gridworld problem with one-hot features, which naturally leads to reduced state-aliasing. In this setting, we find that the amount of rank drop with respect to the supervised projection of oracle computed values is quite small and the regression error to actually decreases unlike the case in Figure 5, where it remains same or even increases. The method is able to learn policies that attain good performance as well. Hence, this justifies that when there’s very little rank drop, for example, 5 rank units in the example on the right, FQI methods are generally able to learn that is able to fit . This provides evidence showing that typical Q-networks learn that can fit the optimal Q-function when rank collapse does not occur.

In Atari, we do not have access to , and so we instead measure the error in fitting the target value estimates, . As rank decreases, the TD error increases (Figure A.14) and the value function is unable to fit the target values, culminating in a performance plateau (Figure A.6).

Figure A.14: TD error vs. Effective rank on Atari. We observe that Huber-loss TD error is often higher when there is a larger implicit under-parameterization, measured in terms of drop in effective rank. The results are shown for the data-efficient online RL setting.

a.5 Trends in Values of Effective Rank With Penalty.

In this section, we present the trend in the values of the effective rank when the penalty is added. In each plot below, we present the value of with and without penalty respectively.

Offline RL: DQN
Figure A.15: Effective rank values with the penalty on DQN. Trends in effective rank and performance for offline DQN. Note that the performance of DQN with penalty is generally better than DQN and that the penalty (blue) is effective in increasing the values of effective rank. We report performance at the end of 100 epochs, as per the protocol set by Agarwal et al. (2020) in Figure 10.
Figure A.16: Effective rank values with the penalty on DQN on a 4x larger dataset. Trends in effective rank and performance for offline DQN with a 4x larger dataset, where distribution shift effects are generally removed. Note that the performance of DQN with penalty is generally better than DQN and that the penalty (blue) is effective in increasing the values of effective rank in most cases. Infact in Pong, where the penalty is not effective in increasing rank, we observe suboptimal performance (blue vs. red).
Offline RL: CQL With Penalty
Figure A.17: Effective rank values with the penalty on CQL. Trends in effective rank and performance for offline DQN. Note that the performance of CQL with penalty is generally better than vanilla CQL and that the penalty (blue) is effective in increasing the values of effective rank. We report performance at the end of 100 epochs, as per the protocol set by Agarwal et al. (2020) in Figure 10.
Offline RL: Performance improvement With Penalty
(a) Offline DQN with .
(b) Offline CQL with .
Figure A.18: Performance improvement of (a) offline DQN and (b) offline CQL with the penalty on 20% Atari dataset, i.e., the dataset referred to as 4x large in Figure 2.

a.6 Data-Efficient Online RL: Rainbow

Rainbow With Penalty: Rank Plots
Figure A.19: Effective rank values with the penalty on Rainbow in the data-efficient online RL setting. Trends in effective rank and performance for online Rainbow, where distribution shift effects are generally removed. Note that the performance of DQN with penalty is generally better than DQN and that the penalty (blue) is effective in increasing the values of effective rank in most cases. Infact in Pong, where the penalty is not effective in increasing rank, we observe suboptimal performance (blue vs. red).
Rainbow With Penalty: Performance
Figure A.20: Performance of Rainbow and Rainbow with the penalty (Equation 8. Note that the penalty improves on the base Rainbow in 12/16 games.

In this section, we present additional results for supporting the hypothesis that preventing rank-collapse leads to better performance. In the first set of experiments, we apply the proposed penalty to Rainbow in the data-efficient online RL setting . In the second set of experiments, we present evidence for prevention of rank collapse by comparing rank values for different runs.

As we will show in the next section, the state-of-the-art Rainbow (Hessel et al., 2018) algorithm also suffers form rank collapse in the data-efficient online RL setting when more updates are performed per gradient step. In this section, we applied our penalty to Rainbow with , and obtained a median 20.66% improvement on top of the base method. This result is summarized below.

Figure A.21: Learning curves with gradient updates per environment step for Rainbow(Blue) and Rainbow with the penalty (Equation 8) (Red) on 16 games , corresponding to the bar plot above. One unit on the x-axis is equivalent to 1M environment steps.

Appendix B Hyperparameters & Experiment Details

b.1 Atari Experiments

We follow the experiment protocol from Agarwal et al. (2020) for all our experiments including hyperparameters and agent architectures provided in Dopamine and report them for completeness and ease of reproducibility in Table B.1. We only use hyperparameter selection over the regularization experiment based on results from 5 Atari games (Asterix, Seaquest, Pong, Breakout and Seaquest). We will also open source our code to further aid in reproducing our results.

Hyperparameter Setting (for both variations)
Sticky actions Yes
Sticky action probability 0.25
Grey-scaling True
Observation down-sampling (84, 84)
Frames stacked 4
Frame skip (Action repetitions) 4
Reward clipping [-1, 1]
Terminal condition Game Over
Max frames per episode 108K
Discount factor 0.99
Mini-batch size 32
Target network update period every 2000 updates
Training environment steps per iteration 250K
Update period every 4 environment steps
Evaluation 0.001
Evaluation steps per iteration 125K
-network: channels 32, 64, 64
-network: filter size , ,
-network: stride 4, 2, 1
-network: hidden units 512
Hardware Tesla P100 GPU
Hyperparameter Online Offline
Min replay size for sampling 20,000 -
Training  (for -greedy exploration) 0.01 -
-decay schedule 250K steps -
Fixed Replay Memory No Yes
Replay Memory size (Online) 1,000,000 steps
Fixed Replay size (5%) 2,500,000 steps
Fixed Replay size (20%) 10,000,000 steps
Replay Scheme Uniform Uniform
Training Iterations 200 500
Table B.1: Hyperparameters used by the offline and online RL agents in our experiments.

Evaluation Protocol. Following Agarwal et al. (2020), the Atari environments used in our experiments are stochastic due to sticky actions, i.e.,  there is 25% chance at every time step that the environment will execute the agent’s previous action again, instead of the agent’s new action. All agents (online or offline) are compared using the best evaluation score (averaged over 5 runs) achieved during training where the evaluation is done online every training iteration using a -greedy policy with . We report offline training results with same hyperparameters over 5 random seeds of the DQN replay data collection, game simulator and network initialization.

Offline Dataset. As suggested by Agarwal et al. (2020), we randomly subsample the DQN Replay dataset containing 50 millions transitions to create smaller offline datasets with the same data distribution as the original dataset. We use the 5% DQN replay dataset for most of our experiments. We also report results using the 20% dataset setting (4x larger) to show that our claims are also valid even when we have higher coverage over the state space.

Optimizer related hyperparameters. For existing off-policy agents, step size and optimizer were taken as published. We used the DQN (Adam) algorithm for all our experiments, given its superior performance over the DQN (Nature) which uses RMSProp, as reported by Agarwal et al. (2020).

Rainbow agent. Our empirical investigations in this paper are based on the Dopamine Rainbow agent (Castro et al., 2018). This is an open source implementation of the original agent (Hessel et al., 2018), but makes several simplifying design choices. The original agent augments DQN through the use of (a) a distributional learning objective, (b) multi-step returns, (c) the Adam optimizer, (d) prioritized replay, (e) double Q-learning, (f) duelling architecture, and (g) noisy networks for exploration. The Dopamine Rainbow agent uses just the first four of these adjustments, which were identified as the most important aspects of the agent in the original analysis of Hessel et al. (2018). For data efficient Rainbow, we incorporate the main changes suggested by van Hasselt et al. (2019) including the change of architecture, larger number of updates per environment step, and use of multi-step returns.

Atari 2600 games used. For all our experiments in Section 3, we used the same set of 5 games as utilized by Agarwal et al. (2020); Bellemare et al. (2017) to present analytical results. For our empirical evaluation in Appendix A.5, we use the set of games employed by Fedus et al. (2020b) which are deemed suitable for offline RL by Gulcehre et al. (2020). Similar in spirit to Gulcehre et al. (2020), we use the set of 5 games used for analysis for hyperparameter tuning for offline RL methods.

5 games subset: Asterix, Qbert, Pong, Seaquest, Breakout

16 game subset: In addition to 5 games above, the following 11 games: Gravitar, James Bond, Ms. Pacman, Space Invaders, Zaxxon, Wizard of Wor, Yars’ Revenge, Enduro, Road Runner, BeamRider, Demon Attack

b.2 Gridworld Experiments

We use the gridworld suite from Fu et al. (2019) to obtain gridworlds for our experiments. All of our gridworld results are computed using the Grid16smoothobs environment, which consists of a 256-cell grid, with walls arising randomly with a probability of 0.2. Each state allows 5 different actions (subject to hitting the boundary of the grid): move left, move right, move up, move down and no op. The goal in this environment is to minimize the cumulative discounted distance to a fixed goal location where the discount factor is given by . The features for this Q-function are given by randomly chosen vectors which are smoothened spatially in a local neighborhood of a grid cell .

We use a deep Q-network with two hidden layers of size , and train it using soft Q-learning with entropy coefficient of 0.1, following the code provided by authors of Fu et al. (2019). We use a first-in-first out replay buffer of size 10000 to store past transitions.

Appendix C Proofs for Section 4.1

In this section, we provide the technical proofs from Section 4.1. We first derive a solution to optimization problem Equation 1 and show that it indeed comes out to have the form described in Equation 3. We first introduce some notation, including definition of the kernel which was used for this proof. This proof closely follows the proof from Mobahi et al. (2020).

Definitions.

For any universal kernel , the Green’s function (Duffy, 2015) of the linear kernel operator given by: is given by the function that satisfies:

(C.1)

where is the Dirac-delta function. Thus, Green’s function can be understood as a kernel that “inverts” the universal kernel to the idehtity (Dirac-delta) matrix. We can then define the matrix as the matrix of vectors evaluated on the training dataset, , however note that the functional can be evaluated for other state-action tuples, not present in .

(C.2)
Lemma C.0.1.

The solution to Equation 1 is given by Equation 3.

Proof.

This proof closely follows the proof of Proposition 1 from (Mobahi et al., 2020). We revisit key aspects the key parts of this proof here.

We restate the optimization problem below, and solve for the optimum to this equation by applying the functional derivative principle.

The functional derivative principle would say that the optimal to this problem would satisfy, for any other function and for a small enough ,

(C.3)

By setting the gradient of the above expression to , we obtain the following stationarity conditions on (also denoting ) for brevity:

(C.4)

Now, we invoke the definition of the Green’s function discussed above and utilize the fact that the Dirac-delta function can be expressed in terms of the Green’s function, we obtain a simplified version of the above relation:

(C.5)

Since the kernel os universal and positive definite, the optimal solution is given by:

(C.6)

Finally we can replace the expression for residual error, using the green’s kernel on the training data by solving for it in closed form, which gives us the solution in Equation 3.

(C.7)

Next, we now state and prove a slightly stronger version of Theorem 4.1 that immediately implies the original theorem.

Theorem C.1.

Let be a shorthand for and assume is a normal matrix. Then there exists an infinite, strictly increasing sequence of fitting iterations, starting from , such that, for any two singular-values and of with ,

(C.8)

Therefore, the effective rank of satisfies: . Furthermore,

(C.9)

Therefore, the effective rank of , , outside the chosen subsequence is also controlled above by the effective rank on the subsequence .

To prove this theorem, we first show that for any two fitting iterations, , if and are positive semi-definite, the ratio of singular values and the effective rank decreases from to . As an immediate consequence, this shows that when is positive semi-definite, the effective rank decreases at every iteration, i.e., by setting (Corollary C.1.1).

To extend the proof to arbitrary normal matrices, we show that for any , a sequence of fitting iterations can be chosen such that is (approximately) positive semi-definite. For this subsequence of fitting iterations, the ratio of singular values and effective rank also decreases. Finally, to control the ratio and effective rank on fitting iterations outside this subsequence, we construct an upper bound on the ratio : , and relate this bound to the ratio of singular values on the chosen subsequence.

Lemma C.1.1 ( decreases when is PSD.).

Let be a shorthand for and assume is a normal matrix. Choose any such that . If and are positive semi-definite, then for any two singular-values and of , such that :

(C.10)

Hence, the effective rank of decreases from to : .

Proof.

First note that is given by:

(C.11)

From hereon, we omit the leading term since it is a constant scaling factor that does not affect ratio or effective rank. Since is normal, admits a complex orthogonal eigendecomposition, where the eigenvalues and singular values are related as . Thus, we can write . And any power of , i.e., , can be expressed as: , and hence, we can express as:

(C.12)

Then, the singular values of can be expressed as

(C.13)

When is positive semi-definite, , enabling the following simplification:

(C.14)

To show that the ratio of singular values decreases from to , we need to show that is an increasing function of when . It can be seen that this is the case, which implies the desired result.

To further show that , we can simply show that , increases with , and this would imply that the cannot increase from to . We can decompose as:

(C.15)

Since decreases over time for all if , the ratio in the denominator of decreases with increasing implying that increases from to . ∎

Corollary C.1.1 ( decreases for PSD matrices.).

Let be a shorthand for . Assuming that is positive semi-definite, for any , such that and that for any two singular-values and of , such that ,

(C.16)

Hence, the effective rank of decreases with more fitting iterations: .

In order to now extend the result to arbitrary normal matrices, we must construct a subsequence of fitting iterations where is (approximately) positive semi-definite. To do so, we first prove a technical lemma that shows that rational numbers, i.e., numbers that can be expressed as , for integers are “dense” in the space of real numbers.

Lemma C.1.2 (Rational numbers are dense in the real space.).

For any real number , there exist infinitely many rational numbers such that can be approximated by upto accuracy.

(C.17)
Proof.

We first use Dirichlet’s approximation theorem (see Hlawka et al. (1991) for a proof of this result using a pigeonhole argument and extensions) to obtain that for any real numbers and , there exist integers and such that and,

(C.18)

Now, since , we can divide both sides by , to obtain:

(C.19)

To obtain infinitely many choices for , we observe that Dirichlet’s lemma is valid only for all values of that satisfy . Thus if we choose an such that where is defined as:

(C.20)

Equation C.20 essentially finds a new value of , such that the current choices of and , which were valid for the first value of do not satisfy the approximation error bound. Applying Dirichlet’s lemma to this new value of hence gives us a new set of and which satisfy the approximation error bound. Repeating this process gives us countably many choices of pairs that satisfy the approximation error bound. As a result, rational numbers are dense in the space of real numbers, since for any arbitrarily chosen approximation accuracy given by , we can obtain atleast one rational number, which is closer to than . This proof is based on Johnson (2016). ∎

Now we utilize Lemmas C.1.1 and C.1.2 to prove Proposition 4.1.

Proof of Proposition 4.1 and Theorem c.1

Recall from the proof of Lemma C.1.1 that the singular values of are given by:

(C.21)

Bound on Singular Value Ratio: The ratio between and can be expressed as