Quantile QT-Opt for Risk-Aware Vision-Based Robotic Grasping
The distributional perspective on reinforcement learning (RL) has given rise to a series of successful Q-learning algorithms, resulting in state-of-the-art performance in arcade game environments. However, it has not yet been analyzed how these findings from a discrete setting translate to complex practical applications characterized by noisy, high dimensional and continuous state-action spaces. In this work, we propose Quantile QT-Opt (Q2-Opt), a distributional variant of the recently introduced distributed Q-learning algorithm [kalashnikov2018scalable] for continuous domains, and examine its behaviour in a series of simulated and real vision-based robotic grasping tasks. The absence of an actor in Q2-Opt allows us to directly draw a parallel to the previous discrete experiments in the literature without the additional complexities induced by an actor-critic architecture. We demonstrate that Q2-Opt achieves a superior vision-based object grasping success rate, while also being more sample efficient. The distributional formulation also allows us to experiment with various risk-distortion metrics that give us an indication of how robots can concretely manage risk in practice using a Deep RL control policy. As an additional contribution, we perform experiments on offline datasets and compare them with the latest findings from discrete settings. Surprisingly, we find that there is a discrepancy between our results and the previous batch RL findings from the literature obtained on arcade game environments.
The new distributional perspective on RL has produced a novel class of Deep Q-learning methods that learn a distribution over the state-action returns, instead of using the expectation given by the traditional value function. These methods, which obtained state-of-the-art results in the arcade game environments [bellemare2017distributional, dabney2018distributional, dabney2018implicit], present several attractive properties.
First, their ability to preserve the multi-modality of the action values naturally accounts for learning from a non-stationary policy, most often deployed in a highly stochastic environment. This ultimately results in a more stable training process and improved performance and sample efficiency. Second, they enable the use of risk-sensitive policies that no longer select actions based on the expected value, but take entire distributions into account. These policies can represent a continuum of risk management strategies ranging from risk-averse to risk-seeking by optimizing for a broader class of risk metrics.
Despite the improvements distributional Q-learning algorithms demonstrated in the discrete arcade environments, it is yet to be examined how these findings translate to practical, real-world applications. Intuitively, the advantageous properties of distributional Q-learning approaches should be particularly beneficial in a robotic setting. The value distributions can have a significant qualitative impact in robotic tasks, usually characterized by highly-stochastic and continuous state-action spaces. Additionally, performing safe control in the face of uncertainty is one of the biggest impediments to deploying robots in the real world, an impediment that RL methods have not yet tackled. In contrast, a distributional approach can allow robots to learn an RL policy that appropriately quantifies risks for the task of interest.
However, given the brittle nature of deep RL algorithms and their often counter-intuitive behaviour [Henderson2017DeepRL], it is not entirely clear if these intuitions would hold in practice. Therefore, we believe that an empirical analysis of distributional Q-learning algorithms in real robotic applications would shed light on their benefits and scalability, and provide essential insight for the robot learning community.
In this paper, we aim to address this need and perform a thorough analysis of distributional Q-learning algorithms in simulated and real vision-based robotic manipulation tasks. To this end, we propose a distributional enhancement of QT-Opt [kalashnikov2018scalable] subbed Quantile QT-Opt (Q2-Opt). The choice of QT-Opt, a recently introduced distributed Q-learning algorithm that operates on continuous action spaces, is dictated by its demonstrated applicability to large-scale vision-based robotic experiments. In addition, by being an actor-free generalization of Q-learning in continuous action spaces, QT-Opt enables a direct comparison to the previous results on the arcade environments without the additional complexities and compounding effects of an actor-critic-type architecture.
In particular, we introduce two versions of Q2-Opt, based on Quantile Regression DQN (QR-DQN) [dabney2018distributional] and Implicit Quantile Networks (IQN) [dabney2018implicit]. The two methods are evaluated on a vision-based grasping task in simulation and the real world. We show that these distributional algorithms achieve state-of-the-art grasping success rate in both settings, while also being more sample efficient. Furthermore, we experiment with a multitude of risk metrics, ranging from risk-seeking to risk-averse, and show that risk-averse policies can bring significant performance improvements. We also report on the interesting qualitative changes that different risk metrics induce in the robots’ grasping behaviour. As an additional contribution, we analyze our distributional methods in a batch RL scenario and compare our findings with an equivalent experiment from the arcade environments [Agarwal2019StrivingFS].
Ii Related Work
Deep learning has shown to be a useful tool for learning visuomotor policies that operate directly on raw images. Examples include various manipulation tasks, where related approaches use either supervised learning to predict the probability of a successful grasp [mahler2017dex, levine2016end, Pinto2015SupersizingSL] or learn a reinforcement-learning control policy [levine2018learning, quillen2018deep, kalashnikov2018scalable].
Distributional Q-Learning algorithms have been so far a separate line of research, mainly evaluated on game environments. These algorithms replace the expected return of an action with a distribution over the returns and mainly vary by the way they parametrize this distribution. Bellemare et al. [bellemare2017distributional] express it as a categorical distribution over a fixed set of equidistant points. Their algorithm, C51, minimizes the KL-divergence to the projected distributional Bellman target. A follow-up algorithm, QR-DQN [dabney2018distributional], approximates the distribution by learning the outputs of the quantile function at a fixed set of points, the quantile midpoints. This latter approach has been extended by IQN [dabney2018implicit], which reparametrized the critic network to take as input any probability and learn the quantile function itself. Besides this extension, their paper also analyses various risk-sensitive policies that the distributional formulation enables. In this work, we apply these advancements to a challenging vision-based real-world grasping task with a continuous action-space.
Closest to our work is D4PG [barth2018distributed], a distributed and distributional version of DDPG [Lillicrap2015ContinuousCW] that achieves superior performance to the non-distributional version in a series of simulated continuous-control environments. In contrast to this work, we analyze a different variant of Q-learning with continuous action spaces, which allows us to focus on actor-free settings that are similar to the previous distributional Q-learning algorithms. Besides, we demonstrate our results on real robots on a challenging vision-based grasping task.
In this paper we consider a standard Markov Decision Process [Puterman1994MarkovDP] formulation , where and denote the state and action spaces, is a deterministic reward function, is the transition function and is the discount factor.
As previously stated, we build our method on top of QT-Opt [kalashnikov2018scalable], a distributed Q-learning algorithm suitable for continuous action spaces. The algorithm trains a parameterized state-action value function which is represented by a neural network with parameters . The cross-entropy method (CEM) [Rubinstein2004TheCM] is used to iteratively optimize and select the best action for a given Q-function:
In order to train the Q-function, a separate process called the “Bellman Updater” samples transition tuples containing the state , action , reward , and next state from a replay buffer and generates Bellman target values according to a clipped Double Q-learning rule [Hasselt2010DoubleQ, Sutton:1998:IRL:551283]:
where , and and are the parameters of two delayed target networks. These target values are pushed to another replay buffer , and a separate training process optimizes the Q-function against a training objective:
where is a divergence metric.
In particular, the cross-entropy loss is chosen for , and the output of the network is passed through a sigmoid activation to ensure that the predicted Q-values are inside the unit interval.
Iv Quantile QT-Opt (Q2-Opt)
In Q2-Opt (Figure 1) the value function no longer predicts a scalar value, but rather a vector that predicts the quantile function output for a vector of input probabilities , with and . Thus the -th element of approximates , where is the CDF of the value distribution belonging to the state action pair . In practice, this means that the neural network has a multi-headed output. However, unlike QT-Opt where CEM optimizes directly over the Q-values, in Quantile QT-Opt, CEM maximizes a scoring function that maps the vector to a score :
Similarly, the target values produced by the “Bellman updater” are vectorized using a generalization of the clipped Double Q-learning rule from QT-Opt:
where is a vector of ones, and, as before, and are the parameters of two delayed target networks. Even though this update rule has not been considered so far in the distributional RL literature, we find it effective in reducing the overestimation in the predictions.
In the following sections, we present two versions of Q2-Opt based on two recently introduced distributional algorithms: QR-DQN and IQN. The main differences between them arise from the inputs , and that are used. To avoid overloading our notation, from now on we omit the parameter subscript in and replace it with an index into these vectors .
Iv-a Quantile Regression QT-Opt (Q2R-Opt)
In Quantile Regression QT-Opt (Q2R-Opt), the vectors in and are fixed. They all contain quantile midpoints of the value distribution. Concretely, is assigned the fixed quantile target with . The scoring function takes the mean of this vector, reducing the quantile midpoints to the expected value of the distribution when is sufficiently large. Because are always fixed we consider them implicit and omit adding them as an argument to and for Q2R-Opt.
The quantile heads are optimized by minimizing the Huber [huber1964] quantile regression loss:
for all the pairwise TD-errors:
Thus, the network is trained to minimize the loss function:
Iv-B Quantile Function QT-Opt (Q2F-Opt)
In Q2F-Opt, the neural network itself approximates the quantile function of the value distribution, and therefore it can predict the inverse CDF for any . Since are no longer fixed, we explicitly include them in the arguments of and . Thus, the TD-errors take the form:
where , and are sampled from independent uniform distributions. Using different input probability vectors also decreases the correlation between the networks. Note that now the length of the prediction and target vectors are determined by the lengths of and . The model is optimized using the same loss function as the one from Equation 8.
Iv-C Risk-Sensitive Policies
The additional information provided by a value distribution compared to the (scalar) expected return gives birth to a broader class of policies that go beyond optimizing for the expected value of the actions. Concretely, the expectation can be replaced with any risk metric, that is any function that maps the random return to a scalar quantifying the risk. In Q2-Opt, this role is played by the function that acts as a risk-metric. Thus the agent can handle the intrinsic uncertainty of the task in different ways depending on the specific form of . It is important to specify that this uncertainty is generated by the environment dynamics () and the (non-stationary) policy collecting the real robot rollouts and that it is not a parametric uncertainty.
We distinguish two methods to construct risk-sensitive policies for Q2R-Opt and Q2F-Opt, each specific to one of the methods. In Q2R-Opt, risk-averse and risk-seeking policies can be obtained by changing the function when selecting actions. Rather than computing the mean of the target quantiles, can be defined as a weighted average over the quantiles . This sum produces a policy that is in between a worst-case and best-case action selector and, for most purposes, it would be preferable in practice over the two extremes. For instance, a robot that would consider only the worst-case scenario would most likely terminate immediately since this strategy, even though it is not useful, does not incur any penalty. Behaviours like this have been encountered in our evaluation of very conservative policies.
In contrast, Q2F-Opt provides a more elegant way of learning risk-sensitive control policies by using risk-distortion metrics [wang_1996]. Recently, Majmidar and Pavone [Majumdar2017HowSA] have argued for the use of risk distortion metrics in robotics. They proposed a set of six axioms that any risk measure should meet to produce reasonable behaviour and showed that risk distortion metrics satisfy all of them. However, to the best of our knowledge, they have not been tried on real robotic applications.
The key idea is to use a policy:
where is an element-wise function that distorts the uniform distribution that is effectively sampled from, and computes the mean of the vector as usual. Functions that are concave induce risk-averse policies, while convex function induce risk-seeking policies.
In our experiments, we consider the same risk distortion metrics used by Dabney et al. [dabney2018implicit]: the cumulative probability weighting (CPW) [Gonzlez1999OnTS], the standard normal CDF-based metric proposed by Wang [Wang00aclass], the conditional value at risk (CVaR) [Rockafellar00optimizationof], Norm [dabney2018implicit], and a power law formula (Pow) [dabney2018implicit]. Concretely, we plug in different parameters for these distortions. CPW(0.71) is known to be a good model human behaviour [Wu1996], Wang, Pow, CVaR and CVaR are risk-averse. Norm() decreases the weight of the distribution’s tails by averaging 3 uniformly sampled . Ultimately, Wang produces risk-seeking behaviour.
Due to the relationships between Q2F-Opt and the literature of risk distortion measures, we focus our risk-sensitivity experiments on the metrics mentioned above and leave the possibility of trying different functions in Q2R-Opt for future work.
Iv-D Model Architecture
To maintain our comparisons with QT-Opt, we use very similar architectures for Q2R-Opt and Q2F-Opt. For Q2R-Opt, we modify the output layer of the standard QT-Opt architecture to be a vector of size , rather than a scalar. For Q2F-Opt, we take a similar approach to [dabney2018implicit], and we embed every using a series of cosine basis functions:
We then perform the Hadamard product between this embedding and the convolutional features. Another difference in Q2F-Opt is that we replace batch normalization [Ioffe:2015:BNA:3045118.3045167] in the final fully-connected layers with layer normalization [Ba2016LayerN]. We notice that this better keeps the sampled values in the range allowed by our MDP formulation. The three architectures are all included in a single diagram in Figure 2.
V Experimental Setup
We consider the problem of vision-based robotic grasping for our evaluations. In our grasping setup, the robot arm is placed at a fixed distance from a bin containing a variety of objects and tasked with grasping any object. The MDP specifying our robotic manipulation task provides a simple binary reward to the agent at the end of the episode: for a failed grasp, and for a successful grasp. To encourage the robot to grasp objects as fast as possible, we use a time step penalty of and a discount factor . The state is represented by a RGB image; the actions are a mixture of continuous 4-DOF tool displacements in , , with azimuthal rotation , and discrete actions to open and close the gripper, as well as to terminate the episode.
In simulation, we grasp from a bin containing 8 to 12 randomly generated procedural objects (Figure 2). For the first global training steps, we use a procedural exploration policy. The scripted policy is lowering the end effector at a random position at the level of the bin and attempts to grasp. After steps, we switch to an -greedy policy with .
In the real world, we train our model from an offline dataset. For evaluation, we attempt 6 consecutive grasps from a bin containing 6 objects without replacement, repeated across 5 rounds. We perform this experiment in parallel on 7 robots, resulting in a total of grasp attempts. All the robots use a similar object setup consisting of two plastic bottles, one metal can, one paper bowl, one paper cup, and one paper cup sleeve. In the results section, we report the success rate over the attempts.
In this section, we present our results on simulated and real environments. In simulation, we perform both online and offline experiments, while for the real world, the training is exclusively offline.
Vi-a Simulated Environment
Figure 3 shows the mean success rate as a function of the global training step together with the standard deviation across five runs for QT-Opt, Q2R-Opt and Q2F-Opt. Because Q2-Opt and QT-Opt are distributed systems, the global step does not directly match the number of environment episodes used by the models during training. Therefore, to understand the sample efficiency of the algorithm, we also include in Figure 4 the success rate as a function of the total number of environment episodes added to the buffer.
We find the results to be aligned with the findings on the arcade-game environments. Q2F-Opt achieves the best grasp success rate out of all the considered methods, stabilizing around 92.8%, while also being significantly more sample efficient than QT-Opt. Q2R-Opt exhibits an intermediary performance. Despite being less sample efficient than Q2F-Opt, it still learns significantly faster than QT-Opt in the late stages of training. The final performance statistics are given in full detail in Table I.
|Model||Mean Success||Success Std||Median Success|
Vi-B Risk-Sensitive Policies
We evaluate a series of risk distortion measures with different parameter settings in simulation. Figure 5 shows the success rate for various measures used in Q2F-Opt. We notice that risk-averse policies (Wang, Pow, CVaR) are generally more stable in the late stages of training and achieve a higher success rate. Pow remarkably achieves 95% grasp success rate. However, being too conservative can also be problematic. Particularly, the CVaR policy becomes more vulnerable to the locally optimal behaviour of stopping immediately (which does not induce any reward penalty). This makes its performance fluctuate throughout training, even though it ultimately obtains a good final success rate. Table II gives the complete final success rate statistics.
Vi-C Real-World Environment
For the real-world evaluation, we train our models offline from a GB dataset of real-world experiences, containing episodes, collected over many months using a multitude of control policies. The chaotic physical interactions specific to real-world environments and the diversity of policies used to gather the experiences make this experiment an ideal scenario for applying distributional RL. Furthermore, this experiment is also of practical importance for robotics since any increase in grasp success rate from offline data reduces the amount of costly online training that has to be performed to obtain a good policy.
|Model||Grasp Success Rate|
We report in Table III the grasp success rate statistics for all the considered models. We find that the best risk-averse version of Q2-Opt achieves an impressive 17.6% higher success rate than QT-Opt. While the real evaluation closely matches the model hierarchy observed in sim, the success rate differences between the models are much more significant.
The risk-averse distortion measures of Q2F-Opt have a significant qualitative impact. They show a tendency to re-position the gripper in positions that are more favourable or to move objects around to make grasping easier. CVaR(0.4), the most conservative metric we tested in the real world, presented a particularly interesting behaviour of intentionally dropping poorly grasped objects to attempt a better re-grasp. The CVaR policy mainly used this technique when attempting to grasp objects from the corners of the bin to move them in a central position. However, a downside of risk-averse policies that we noticed is that, for the difficult to grasp paper cup sleeves, the agent often kept searching for an ideal position without actually attempting to grasp. We believe this is an interesting example of the trade-offs between being conservative and risk-seeking. The only tested risk-seeking policy, using Wang(0.75), made many high-force contacts with the bin and objects, which often resulted in broken gripper fingers and objects being thrown out of the bin. We also showcase some of these qualitative differences between the considered policies in the videos accompanying our paper.
Vi-D Batch RL and Exploitation
Recently, Agarwal et al. [Agarwal2019StrivingFS] have argued that most of the advantages of distributional algorithms come from better exploitation. Their results demonstrated that QR-DQN could achieve in offline training a performance superior to online C51. Since environment interactions are particularly costly in robotics, reproducing these results in a robotics setting would be beneficial. Therefore, we perform an equivalent experiment in simulation and train the considered models on all the transitions collected during training by a QT-Opt agent with a final success rate of 90% (Figure 6).
We note that despite the minor success rate improvements brought by Q2R-Opt and Q2F-Opt, the two models are not even capable of achieving the final success rate of the policy trained from the same data. We hypothesize this is due to the out-of-distribution action problem [Kumar2019StabilizingOQ], which becomes more prevalent in continuous action spaces.
In this work, we have examined the impact that value distributions have on practical robotic tasks. Our proposed method, Q2-Opt, achieved state-of-the-art success rates on simulated and real vision-based robotic grasping tasks, while also being significantly more sample efficient than its non-distributional equivalent, QT-Opt. The success rate improvements brought by Q2-Opt by training only from real offline data could drastically reduce the number of environment interactions required to train a good robot control policy. Additionally, we have shown how safe reinforcement learning control can be achieved through risk-sensitive policies and reported the range of interesting behaviours these policies produce in practice. As a final contribution, we evaluated the proposed distributional methods in a batch RL setting similar to that of Agarwal et al. [Agarwal2019StrivingFS] and showed that, unfortunately, their findings do not translate to the continuous grasping environment presented in this work.
We would like to give special thanks to Ivonne Fajardo and Noah Brown for overseeing the robot operations.
Vii-a Real World Setup
Figure 7 illustrates our workspace setup with the objects used for evaluation. We considered a diverse set of objects containing both rigid and soft bodies, presenting different textures, colours and shapes.
Vii-B Risk-sensitive behaviours
An interesting metric we considered to analyze the behaviour of different risk-sensitive policies is the number of broken gripper fingers throughout the entire evaluation process. Figure 8 plots these numbers. Even though we do not have a statistically significant number of samples, we believe this data provides minimal insight into the behaviour of these policies.
Vii-C Offline Evaluation
Besides the dataset obtained from the training of a model, we are also analyzing our distributional methods on two more fixed datasets containing transitions:
Collected by a scripted stochastic exploration policy with success rate.
Produced by a policy with grasp success rate.
Figure 9 plots the results for the first scenario, while Figure 10 for the second. Perhaps surprisingly, the models achieve a higher success rate on the scripted exploration dataset than on the dataset collected during training. Maybe even more surprisingly, the models fail to learn on the dataset collected by the almost optimal policy. These results suggest the transitions dataset needs to have high entropy to learn effectively offline.