Quantile QTOpt for RiskAware VisionBased Robotic Grasping
Abstract
The distributional perspective on reinforcement learning (RL) has given rise to a series of successful Qlearning algorithms, resulting in stateoftheart performance in arcade game environments. However, it has not yet been analyzed how these findings from a discrete setting translate to complex practical applications characterized by noisy, high dimensional and continuous stateaction spaces. In this work, we propose Quantile QTOpt (Q2Opt), a distributional variant of the recently introduced distributed Qlearning algorithm [kalashnikov2018scalable] for continuous domains, and examine its behaviour in a series of simulated and real visionbased robotic grasping tasks. The absence of an actor in Q2Opt allows us to directly draw a parallel to the previous discrete experiments in the literature without the additional complexities induced by an actorcritic architecture. We demonstrate that Q2Opt achieves a superior visionbased object grasping success rate, while also being more sample efficient. The distributional formulation also allows us to experiment with various riskdistortion metrics that give us an indication of how robots can concretely manage risk in practice using a Deep RL control policy. As an additional contribution, we perform experiments on offline datasets and compare them with the latest findings from discrete settings. Surprisingly, we find that there is a discrepancy between our results and the previous batch RL findings from the literature obtained on arcade game environments.
I Introduction
The new distributional perspective on RL has produced a novel class of Deep Qlearning methods that learn a distribution over the stateaction returns, instead of using the expectation given by the traditional value function. These methods, which obtained stateoftheart results in the arcade game environments [bellemare2017distributional, dabney2018distributional, dabney2018implicit], present several attractive properties.
First, their ability to preserve the multimodality of the action values naturally accounts for learning from a nonstationary policy, most often deployed in a highly stochastic environment. This ultimately results in a more stable training process and improved performance and sample efficiency. Second, they enable the use of risksensitive policies that no longer select actions based on the expected value, but take entire distributions into account. These policies can represent a continuum of risk management strategies ranging from riskaverse to riskseeking by optimizing for a broader class of risk metrics.
Despite the improvements distributional Qlearning algorithms demonstrated in the discrete arcade environments, it is yet to be examined how these findings translate to practical, realworld applications. Intuitively, the advantageous properties of distributional Qlearning approaches should be particularly beneficial in a robotic setting. The value distributions can have a significant qualitative impact in robotic tasks, usually characterized by highlystochastic and continuous stateaction spaces. Additionally, performing safe control in the face of uncertainty is one of the biggest impediments to deploying robots in the real world, an impediment that RL methods have not yet tackled. In contrast, a distributional approach can allow robots to learn an RL policy that appropriately quantifies risks for the task of interest.
However, given the brittle nature of deep RL algorithms and their often counterintuitive behaviour [Henderson2017DeepRL], it is not entirely clear if these intuitions would hold in practice. Therefore, we believe that an empirical analysis of distributional Qlearning algorithms in real robotic applications would shed light on their benefits and scalability, and provide essential insight for the robot learning community.
In this paper, we aim to address this need and perform a thorough analysis of distributional Qlearning algorithms in simulated and real visionbased robotic manipulation tasks. To this end, we propose a distributional enhancement of QTOpt [kalashnikov2018scalable] subbed Quantile QTOpt (Q2Opt). The choice of QTOpt, a recently introduced distributed Qlearning algorithm that operates on continuous action spaces, is dictated by its demonstrated applicability to largescale visionbased robotic experiments. In addition, by being an actorfree generalization of Qlearning in continuous action spaces, QTOpt enables a direct comparison to the previous results on the arcade environments without the additional complexities and compounding effects of an actorcritictype architecture.
In particular, we introduce two versions of Q2Opt, based on Quantile Regression DQN (QRDQN) [dabney2018distributional] and Implicit Quantile Networks (IQN) [dabney2018implicit]. The two methods are evaluated on a visionbased grasping task in simulation and the real world. We show that these distributional algorithms achieve stateoftheart grasping success rate in both settings, while also being more sample efficient. Furthermore, we experiment with a multitude of risk metrics, ranging from riskseeking to riskaverse, and show that riskaverse policies can bring significant performance improvements. We also report on the interesting qualitative changes that different risk metrics induce in the robots’ grasping behaviour. As an additional contribution, we analyze our distributional methods in a batch RL scenario and compare our findings with an equivalent experiment from the arcade environments [Agarwal2019StrivingFS].
Ii Related Work
Deep learning has shown to be a useful tool for learning visuomotor policies that operate directly on raw images. Examples include various manipulation tasks, where related approaches use either supervised learning to predict the probability of a successful grasp [mahler2017dex, levine2016end, Pinto2015SupersizingSL] or learn a reinforcementlearning control policy [levine2018learning, quillen2018deep, kalashnikov2018scalable].
Distributional QLearning algorithms have been so far a separate line of research, mainly evaluated on game environments. These algorithms replace the expected return of an action with a distribution over the returns and mainly vary by the way they parametrize this distribution. Bellemare et al. [bellemare2017distributional] express it as a categorical distribution over a fixed set of equidistant points. Their algorithm, C51, minimizes the KLdivergence to the projected distributional Bellman target. A followup algorithm, QRDQN [dabney2018distributional], approximates the distribution by learning the outputs of the quantile function at a fixed set of points, the quantile midpoints. This latter approach has been extended by IQN [dabney2018implicit], which reparametrized the critic network to take as input any probability and learn the quantile function itself. Besides this extension, their paper also analyses various risksensitive policies that the distributional formulation enables. In this work, we apply these advancements to a challenging visionbased realworld grasping task with a continuous actionspace.
Closest to our work is D4PG [barth2018distributed], a distributed and distributional version of DDPG [Lillicrap2015ContinuousCW] that achieves superior performance to the nondistributional version in a series of simulated continuouscontrol environments. In contrast to this work, we analyze a different variant of Qlearning with continuous action spaces, which allows us to focus on actorfree settings that are similar to the previous distributional Qlearning algorithms. Besides, we demonstrate our results on real robots on a challenging visionbased grasping task.
Iii Background
In this paper we consider a standard Markov Decision Process [Puterman1994MarkovDP] formulation , where and denote the state and action spaces, is a deterministic reward function, is the transition function and is the discount factor.
As previously stated, we build our method on top of QTOpt [kalashnikov2018scalable], a distributed Qlearning algorithm suitable for continuous action spaces. The algorithm trains a parameterized stateaction value function which is represented by a neural network with parameters . The crossentropy method (CEM) [Rubinstein2004TheCM] is used to iteratively optimize and select the best action for a given Qfunction:
(1) 
In order to train the Qfunction, a separate process called the “Bellman Updater” samples transition tuples containing the state , action , reward , and next state from a replay buffer and generates Bellman target values according to a clipped Double Qlearning rule [Hasselt2010DoubleQ, Sutton:1998:IRL:551283]:
(2) 
where , and and are the parameters of two delayed target networks. These target values are pushed to another replay buffer , and a separate training process optimizes the Qfunction against a training objective:
(3) 
where is a divergence metric.
In particular, the crossentropy loss is chosen for , and the output of the network is passed through a sigmoid activation to ensure that the predicted Qvalues are inside the unit interval.
Iv Quantile QTOpt (Q2Opt)
In Q2Opt (Figure 1) the value function no longer predicts a scalar value, but rather a vector that predicts the quantile function output for a vector of input probabilities , with and . Thus the th element of approximates , where is the CDF of the value distribution belonging to the state action pair . In practice, this means that the neural network has a multiheaded output. However, unlike QTOpt where CEM optimizes directly over the Qvalues, in Quantile QTOpt, CEM maximizes a scoring function that maps the vector to a score :
(4) 
Similarly, the target values produced by the “Bellman updater” are vectorized using a generalization of the clipped Double Qlearning rule from QTOpt:
(5) 
where is a vector of ones, and, as before, and are the parameters of two delayed target networks. Even though this update rule has not been considered so far in the distributional RL literature, we find it effective in reducing the overestimation in the predictions.
In the following sections, we present two versions of Q2Opt based on two recently introduced distributional algorithms: QRDQN and IQN. The main differences between them arise from the inputs , and that are used. To avoid overloading our notation, from now on we omit the parameter subscript in and replace it with an index into these vectors .
Iva Quantile Regression QTOpt (Q2ROpt)
In Quantile Regression QTOpt (Q2ROpt), the vectors in and are fixed. They all contain quantile midpoints of the value distribution. Concretely, is assigned the fixed quantile target with . The scoring function takes the mean of this vector, reducing the quantile midpoints to the expected value of the distribution when is sufficiently large. Because are always fixed we consider them implicit and omit adding them as an argument to and for Q2ROpt.
The quantile heads are optimized by minimizing the Huber [huber1964] quantile regression loss:
(6) 
for all the pairwise TDerrors:
(7) 
Thus, the network is trained to minimize the loss function:
(8) 
IvB Quantile Function QTOpt (Q2FOpt)
In Q2FOpt, the neural network itself approximates the quantile function of the value distribution, and therefore it can predict the inverse CDF for any . Since are no longer fixed, we explicitly include them in the arguments of and . Thus, the TDerrors take the form:
(9) 
where , and are sampled from independent uniform distributions. Using different input probability vectors also decreases the correlation between the networks. Note that now the length of the prediction and target vectors are determined by the lengths of and . The model is optimized using the same loss function as the one from Equation 8.
IvC RiskSensitive Policies
The additional information provided by a value distribution compared to the (scalar) expected return gives birth to a broader class of policies that go beyond optimizing for the expected value of the actions. Concretely, the expectation can be replaced with any risk metric, that is any function that maps the random return to a scalar quantifying the risk. In Q2Opt, this role is played by the function that acts as a riskmetric. Thus the agent can handle the intrinsic uncertainty of the task in different ways depending on the specific form of . It is important to specify that this uncertainty is generated by the environment dynamics () and the (nonstationary) policy collecting the real robot rollouts and that it is not a parametric uncertainty.
We distinguish two methods to construct risksensitive policies for Q2ROpt and Q2FOpt, each specific to one of the methods. In Q2ROpt, riskaverse and riskseeking policies can be obtained by changing the function when selecting actions. Rather than computing the mean of the target quantiles, can be defined as a weighted average over the quantiles . This sum produces a policy that is in between a worstcase and bestcase action selector and, for most purposes, it would be preferable in practice over the two extremes. For instance, a robot that would consider only the worstcase scenario would most likely terminate immediately since this strategy, even though it is not useful, does not incur any penalty. Behaviours like this have been encountered in our evaluation of very conservative policies.
In contrast, Q2FOpt provides a more elegant way of learning risksensitive control policies by using riskdistortion metrics [wang_1996]. Recently, Majmidar and Pavone [Majumdar2017HowSA] have argued for the use of risk distortion metrics in robotics. They proposed a set of six axioms that any risk measure should meet to produce reasonable behaviour and showed that risk distortion metrics satisfy all of them. However, to the best of our knowledge, they have not been tried on real robotic applications.
The key idea is to use a policy:
where is an elementwise function that distorts the uniform distribution that is effectively sampled from, and computes the mean of the vector as usual. Functions that are concave induce riskaverse policies, while convex function induce riskseeking policies.
In our experiments, we consider the same risk distortion metrics used by Dabney et al. [dabney2018implicit]: the cumulative probability weighting (CPW) [Gonzlez1999OnTS], the standard normal CDFbased metric proposed by Wang [Wang00aclass], the conditional value at risk (CVaR) [Rockafellar00optimizationof], Norm [dabney2018implicit], and a power law formula (Pow) [dabney2018implicit]. Concretely, we plug in different parameters for these distortions. CPW(0.71) is known to be a good model human behaviour [Wu1996], Wang, Pow, CVaR and CVaR are riskaverse. Norm() decreases the weight of the distribution’s tails by averaging 3 uniformly sampled . Ultimately, Wang produces riskseeking behaviour.
Due to the relationships between Q2FOpt and the literature of risk distortion measures, we focus our risksensitivity experiments on the metrics mentioned above and leave the possibility of trying different functions in Q2ROpt for future work.
IvD Model Architecture
To maintain our comparisons with QTOpt, we use very similar architectures for Q2ROpt and Q2FOpt. For Q2ROpt, we modify the output layer of the standard QTOpt architecture to be a vector of size , rather than a scalar. For Q2FOpt, we take a similar approach to [dabney2018implicit], and we embed every using a series of cosine basis functions:
We then perform the Hadamard product between this embedding and the convolutional features. Another difference in Q2FOpt is that we replace batch normalization [Ioffe:2015:BNA:3045118.3045167] in the final fullyconnected layers with layer normalization [Ba2016LayerN]. We notice that this better keeps the sampled values in the range allowed by our MDP formulation. The three architectures are all included in a single diagram in Figure 2.
V Experimental Setup
We consider the problem of visionbased robotic grasping for our evaluations. In our grasping setup, the robot arm is placed at a fixed distance from a bin containing a variety of objects and tasked with grasping any object. The MDP specifying our robotic manipulation task provides a simple binary reward to the agent at the end of the episode: for a failed grasp, and for a successful grasp. To encourage the robot to grasp objects as fast as possible, we use a time step penalty of and a discount factor . The state is represented by a RGB image; the actions are a mixture of continuous 4DOF tool displacements in , , with azimuthal rotation , and discrete actions to open and close the gripper, as well as to terminate the episode.
In simulation, we grasp from a bin containing 8 to 12 randomly generated procedural objects (Figure 2). For the first global training steps, we use a procedural exploration policy. The scripted policy is lowering the end effector at a random position at the level of the bin and attempts to grasp. After steps, we switch to an greedy policy with .
In the real world, we train our model from an offline dataset. For evaluation, we attempt 6 consecutive grasps from a bin containing 6 objects without replacement, repeated across 5 rounds. We perform this experiment in parallel on 7 robots, resulting in a total of grasp attempts. All the robots use a similar object setup consisting of two plastic bottles, one metal can, one paper bowl, one paper cup, and one paper cup sleeve. In the results section, we report the success rate over the attempts.
Vi Results
In this section, we present our results on simulated and real environments. In simulation, we perform both online and offline experiments, while for the real world, the training is exclusively offline.
Via Simulated Environment
Figure 3 shows the mean success rate as a function of the global training step together with the standard deviation across five runs for QTOpt, Q2ROpt and Q2FOpt. Because Q2Opt and QTOpt are distributed systems, the global step does not directly match the number of environment episodes used by the models during training. Therefore, to understand the sample efficiency of the algorithm, we also include in Figure 4 the success rate as a function of the total number of environment episodes added to the buffer.
We find the results to be aligned with the findings on the arcadegame environments. Q2FOpt achieves the best grasp success rate out of all the considered methods, stabilizing around 92.8%, while also being significantly more sample efficient than QTOpt. Q2ROpt exhibits an intermediary performance. Despite being less sample efficient than Q2FOpt, it still learns significantly faster than QTOpt in the late stages of training. The final performance statistics are given in full detail in Table I.
Model  Mean Success  Success Std  Median Success 

QTOpt  0.903  0.005  0.903 
Q2ROpt (Ours)  0.923  0.006  0.924 
Q2FOpt (Ours)  0.928  0.001  0.928 
ViB RiskSensitive Policies
We evaluate a series of risk distortion measures with different parameter settings in simulation. Figure 5 shows the success rate for various measures used in Q2FOpt. We notice that riskaverse policies (Wang, Pow, CVaR) are generally more stable in the late stages of training and achieve a higher success rate. Pow remarkably achieves 95% grasp success rate. However, being too conservative can also be problematic. Particularly, the CVaR policy becomes more vulnerable to the locally optimal behaviour of stopping immediately (which does not induce any reward penalty). This makes its performance fluctuate throughout training, even though it ultimately obtains a good final success rate. Table II gives the complete final success rate statistics.
Model  Success  Std  Median 

Q2FOpt CVAR(0.25)  0.933  0.013  0.941 
Q2FOpt CVAR(0.4)  0.938  0.008  0.938 
Q2FOpt CPW(0.71)  0.928  0.003  0.925 
Q2FOpt Wang(0.75)  0.898  0.012  0.893 
Q2FOpt Wang(0.75)  0.942  0.007  0.944 
Q2FOpt Pow(2.0)  0.950  0.004  0.952 
ViC RealWorld Environment
For the realworld evaluation, we train our models offline from a GB dataset of realworld experiences, containing episodes, collected over many months using a multitude of control policies. The chaotic physical interactions specific to realworld environments and the diversity of policies used to gather the experiences make this experiment an ideal scenario for applying distributional RL. Furthermore, this experiment is also of practical importance for robotics since any increase in grasp success rate from offline data reduces the amount of costly online training that has to be performed to obtain a good policy.
Model  Grasp Success Rate 

QTOpt  70.00% 
Q2ROpt  79.50% 
Q2FOpt  82.00% 
Q2FOpt Pow(2)  83.81% 
Q2FOpt CVaR(0.4)  85.23% 
Q2FOpt Wang(0.75)  87.60% 
Q2FOpt Wang(0.75)  78.10% 
Q2FOpt Norm(3)  80.47% 
Q2FOpt CPW(0.71)  75.71% 
We report in Table III the grasp success rate statistics for all the considered models. We find that the best riskaverse version of Q2Opt achieves an impressive 17.6% higher success rate than QTOpt. While the real evaluation closely matches the model hierarchy observed in sim, the success rate differences between the models are much more significant.
The riskaverse distortion measures of Q2FOpt have a significant qualitative impact. They show a tendency to reposition the gripper in positions that are more favourable or to move objects around to make grasping easier. CVaR(0.4), the most conservative metric we tested in the real world, presented a particularly interesting behaviour of intentionally dropping poorly grasped objects to attempt a better regrasp. The CVaR policy mainly used this technique when attempting to grasp objects from the corners of the bin to move them in a central position. However, a downside of riskaverse policies that we noticed is that, for the difficult to grasp paper cup sleeves, the agent often kept searching for an ideal position without actually attempting to grasp. We believe this is an interesting example of the tradeoffs between being conservative and riskseeking. The only tested riskseeking policy, using Wang(0.75), made many highforce contacts with the bin and objects, which often resulted in broken gripper fingers and objects being thrown out of the bin. We also showcase some of these qualitative differences between the considered policies in the videos accompanying our paper.
ViD Batch RL and Exploitation
Recently, Agarwal et al. [Agarwal2019StrivingFS] have argued that most of the advantages of distributional algorithms come from better exploitation. Their results demonstrated that QRDQN could achieve in offline training a performance superior to online C51. Since environment interactions are particularly costly in robotics, reproducing these results in a robotics setting would be beneficial. Therefore, we perform an equivalent experiment in simulation and train the considered models on all the transitions collected during training by a QTOpt agent with a final success rate of 90% (Figure 6).
We note that despite the minor success rate improvements brought by Q2ROpt and Q2FOpt, the two models are not even capable of achieving the final success rate of the policy trained from the same data. We hypothesize this is due to the outofdistribution action problem [Kumar2019StabilizingOQ], which becomes more prevalent in continuous action spaces.
Vii Conclusion
In this work, we have examined the impact that value distributions have on practical robotic tasks. Our proposed method, Q2Opt, achieved stateoftheart success rates on simulated and real visionbased robotic grasping tasks, while also being significantly more sample efficient than its nondistributional equivalent, QTOpt. The success rate improvements brought by Q2Opt by training only from real offline data could drastically reduce the number of environment interactions required to train a good robot control policy. Additionally, we have shown how safe reinforcement learning control can be achieved through risksensitive policies and reported the range of interesting behaviours these policies produce in practice. As a final contribution, we evaluated the proposed distributional methods in a batch RL setting similar to that of Agarwal et al. [Agarwal2019StrivingFS] and showed that, unfortunately, their findings do not translate to the continuous grasping environment presented in this work.
Acknowledgment
We would like to give special thanks to Ivonne Fajardo and Noah Brown for overseeing the robot operations.
Appendix
Viia Real World Setup
Figure 7 illustrates our workspace setup with the objects used for evaluation. We considered a diverse set of objects containing both rigid and soft bodies, presenting different textures, colours and shapes.
ViiB Risksensitive behaviours
An interesting metric we considered to analyze the behaviour of different risksensitive policies is the number of broken gripper fingers throughout the entire evaluation process. Figure 8 plots these numbers. Even though we do not have a statistically significant number of samples, we believe this data provides minimal insight into the behaviour of these policies.
ViiC Offline Evaluation
Besides the dataset obtained from the training of a model, we are also analyzing our distributional methods on two more fixed datasets containing transitions:

Collected by a scripted stochastic exploration policy with success rate.

Produced by a policy with grasp success rate.
Figure 9 plots the results for the first scenario, while Figure 10 for the second. Perhaps surprisingly, the models achieve a higher success rate on the scripted exploration dataset than on the dataset collected during training. Maybe even more surprisingly, the models fail to learn on the dataset collected by the almost optimal policy. These results suggest the transitions dataset needs to have high entropy to learn effectively offline.