Supervised Policy Update
We propose a new sample-efficient methodology, called Supervised Policy Update (SPU), for deep reinforcement learning. Starting with data generated by the current policy, SPU optimizes over the proximal policy space to find a non-parameterized policy. It then solves a supervised regression problem to convert the non-parameterized policy to a parameterized policy, from which it draws new samples. There is significant flexibility in setting the labels in the supervised regression problem, with different settings corresponding to different underlying optimization problems. We develop a methodology for finding an optimal policy in the non-parameterized policy space, and show how Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) can be addressed by this methodology. In terms of sample efficiency, our experiments show SPU can outperform PPO for simulated robotic locomotion tasks.
Supervised Policy Update
Quan Ho Vuong Yiming Zhang Keith W. Ross New York University Abu Dhabi New York University New York University Shanghai email@example.com, firstname.lastname@example.org, email@example.com
noticebox[b]Preprint. Work in progress.\end@float
The policy gradient problem in deep reinforcement learning can be informally defined as seeking a parameterized policy that produces a high expected reward . The parameterized policy is realized with a neural network, and stochastic gradient descent with back propagation is used to optimize the parameters. An issue that plagues traditional policy gradient methods is poor sample efficiency schulman2015trust (); wang2016sample (); wu2017scalable (); schulman2017proximal (). In algorithms such as REINFORCE williams1992simple (), new samples are needed for every small gradient step. In environments for which generating trajectories is expensive (such as robotic environments), sample efficiency is of central concern. The sample efficiency problem can be informally stated as follows: Beginning with the current policy , and using only trajectories from , try to obtain a new policy which improves on as much as possible.
Several papers have addressed the sample efficiency problem by considering candidate new policies that are close to the original policy schulman2015trust (); wu2017scalable (); achiam2017constrained (); schulman2017proximal (). Intuitively, if the candidate policy is far from the original policy , then information from those samples (states visited, actions taken, and the estimated advantage values) would lose their relevance. This guideline seems reasonable in principle, but requires a notion of closeness of two policies. One natural approach is to define a distance or divergence between the current policy and the candidate new policy , and then attempt to solve the constrained optimization problem:
Here the objective in (1) attempts to maximize the improvement in performance of the updated policy compared to the current policy, and the constraint (2) ensures that the resulting policy is near the policy that was used to generate the data. The is a hyper-parameter that can be possibly annealed over time.
We propose a new methodology, called Supervised Policy Update (SPU), for the sample efficiency problem. Starting with data generated by the current policy, SPU optimizes over the proximal policy space to find a non-parameterized policy. It then solves a supervised regression problem to convert the non-parameterized policy to a parameterized policy, from which it draws new samples. There is significant flexibility in setting the labels in the supervised regression problem, with different settings corresponding to different underlying optimization problems. We develop a general methodology for finding an optimal policy in the non-parameterized policy space, and show how Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) can be studied using this methodology. In terms of sample efficiency, our experiments show SPU can outperform PPO for simulated robotic locomotion tasks.
We consider a Markov Decision Process (MDP) , where is the state space, is the action space, , is the reward function, is the probability of transitioning to from after taking action , and is the initial state distribution over . Let denote a policy, let be the set of all policies, and let the expected discounted reward be:
where is a discount factor, is a sample trajectory, and is the expectation with respect to the probability of under policy . Let be the advantage function for policy schulman2015high (). Deep reinforcement learning considers a set of parameterized policies , where each parameterized policy is defined by a neural network called the policy network. In this paper, we will consider optimizing over the parameterized policies in as well as over the non-parameterized policies in .
One popular approach to maximizing over is to apply stochastic gradient ascent. The gradient of evaluated at a specific can be shown to be williams1992simple (); sutton2000policy (); schulman2015high ():
To obtain an estimate of the gradient, we can sample finite-length trajectories , , from , and approximate (4) as:
where is the length of the trajectory, and is an approximation of obtained from a critic network. Using the approximate advantage in the gradient estimator introduces a bias but has the effect of lowering the variance (konda2000actor, ; mnih2016asynchronous, ).
for the the future state probability distribution for policy , and denote for the probability distribution over the action space when in state and using policy . Further denote for the KL divergence from distribution to , and denote
for the “aggregated KL divergence”.
2.1 Approximations for the Sample Efficiency Problem
For the sample efficiency problem, the objective is typically approximated using samples generated from schulman2015trust (); achiam2017constrained (); schulman2017proximal () (Although importance sampling can be used to form an unbiased estimator, the estimator has many product terms which can lead to numerical instabilities degris2012off ()). One of two different approaches is typically used to approximate using samples from . The first approach is to make a first order approximation of in the vicinity of peters2008natural (); peters2008reinforcement (); schulman2015trust ():
where is the sample estimate (5). The second approach, which applies to all policies and not just to policies , is to approximate the state distribution with , giving the approximation (achiam2017constrained, ; schulman2017proximal, ):
To estimate the expectation in (8), as in (5), we generate trajectories of finite length from , create estimates of the advantage values using a critic network, and then form a sample average. There is a well-known bound for the approximation (8) kakade2002approximately (); achiam2017constrained (). Furthermore, the approximation matches to first order with respect to the parameter achiam2017constrained ().
3 Related Work
Natural gradient was proposed by Amari amari1998natural () and first applied to policy gradients by Kakade kakade2002natural (). Instead of following the direction of the gradient in the Euclidean space, Natural Policy Gradient method (NPG) attempts to follow the direction of steepest descent in the policy space, which is typically a high-dimensional manifold. This is done by pre-multiplying the policy gradient term with the inverse of the Fisher information matrix.
The goal of TRPO schulman2015trust (); peters2008natural (); peters2008reinforcement (); advanced_pg_joshua () is to solve the sample efficiency problem (1)-(2) with , i.e., use the weighted KL-divergence for the policy proximity constraint (2). TRPO addresses this problem in the parameter space . First, it uses the first order approximation (7) to approximate and also makes a similar second-order approximation to approximate . Second, it uses samples from to form estimates of these two approximations. Third, using these estimates (which are functions of ), it solves for the optimal . The optimal is a function of and of , the sample average of the Hessian evaluated at . TRPO takes an additional step by limiting the magnitude of the update to ensure (i.e., checking to see if the sample-average estimate of the proximity constraint is met without the second-order approximation).
Actor-Critic using Kronecker-Factored Trust Region (ACKTR) wu2017scalable () proposed using Kronecker-factored approximation curvature (K-FAC) to update both the policy gradient and critic terms, giving a more computationally efficient method of calculating the natural gradients. ACER linearizes the KL divergence constraint and maintains an average policy network to enforce the KL divergence constraint rather than using the current policy , leading to significant performance improvement for actor critic method ACER ().
PPO schulman2017proximal () takes a very different approach TRPO. In order to obtain the new policy , PPO seeks to maximize the objective:
In the process in going from to , PPO makes many gradient steps while only using the data from . It has been shown to have excellent sample-efficiency performance. To gain some insight into the PPO objective, note that without the clipping, it is simply the approximation (8) (while also removing the discounting and using a finite horizon). The additional clipping is analogous to the constraint (2) in that it has the goal of keeping close to . Indeed, the clipping can be seen as an attempt at keeping from becoming neither much larger than nor much smaller than . Thus, although the PPO objective does not squarely fit into the optimization framework (1)-(2), it is quite similar in spirit.
4 Optimizing in the Policy Space
As mentioned in Section 3, TRPO uses first- and second-order approximations to reformulate the sample efficiency probem (1)-(2) as a constrained optimization problem in the paramter space , and then finds the parameter that optimizes the approximated problem.
The approach proposed in this paper is to first determine (or partially determine) the optimal policy in the larger non-parameterized policy space . We refer to such an optimal policy as the optimal target policy, and to as the optimal targets. After determining the targets in the non-parameterized space, we then try to find a parameterized policy in that is close to the targets.
In this section, we consider finding the optimal target policy. Specifically, we consider the MDP problem:
Note that is not restricted to the set of parameterized policies . Also note, as is common practice, we are using an approximation for the objective function. Specifically, we are using the approximation (8). However, unlike TRPO, we are not approximating the constraint (2).
4.1 Solving TRPO MDP Problems in the Policy Space
4.1.1 Aggregated TRPO Problem
in which we used the identity
The following result provides the structure of the TRPO policy in the policy space:
As a consequence, for any two actions and we have . This result indicates that, for a fixed , the optimal solution in the policy space has the targets grow exponentially with respect to the advantage values. Therefore if is larger than , then the target for will be exponentially larger than the target for .
This is the standard maximum entropy problem ziebart2008maximum (); schulman2017equivalence (); haarnoja2018soft () in reinforcement learning. Its solution is given by (15) with chosen so that the constraint (17) is met with equality
Instead of constraining the forward KL divergence, one could instead constrain the backward KL divergence, i.e. . In this case, the optimization problem again decomposes. The optimal targets are then obtained by solving a simple optimization problem.
4.1.2 Solving Disaggregated TRPO MDP Problem in the Policy Space
An alternative way of formulating the TRPO problem is to require for all states , rather than using the aggregate constraint . In fact, (schulman2015trust, ) states that this alternative “disaggregated-constraint” version of the problem is preferable to the aggregated version, but that (schulman2015trust, ) uses the aggregated version for mathematical convenience. It turns out that when optimizing in the policy space, it is easier to solve the disaggregated version than the aggregated version. Indeed, as in the proof of Theorem 1, the optimization problem of maximizing subject to for all decomposes into fully separate optimization problems, one for each state :
Note that in this case the constraint (20) uses whereas the corresponding “aggregated” problem uses the more complicated . Owing to this simplification, we can explicitly calculate the optimal Lagrange multiplier (as a function of ).
There is an optimal policy to the disaggregated-constraints TRPO problem which takes the same form as the optimal policy given in Theorem 1. However, in this case, for each given , we can explicitly obtain and by solving two non-linear equations for the two unknowns and . The first equation is obtained from the constraint and the second from . (See Appendix A).
Note that even in this disaggregated version of the problem, the optimal policy again has the exponential structure for each fixed state .
4.2 Solving the PPO-inspired Problem in the Policy Space
Recall from Section 3 that the the clipping in PPO can be seen as an attempt at keeping from becoming neither much larger than nor much smaller than . In this subsection, we consider the general problem (10)-(11) with the constraint function
This problem can be solved explicitly:
For each fixed , re-order the actions so that is non-decreasing in . There is an optimal policy to the PPO-inspired problem which takes the form
where the action is given by and where is set so that .
Note how strikingly different the optimal policy for the TRPO problem (aggregated or disaggregated) is from the optimal solution for the PPO-problem. In the former (Theorems 1 and 2), the targets grow exponentially as a function of whereas in the latter (Theorem 3), the targets are bounded.
5 Supervised Policy Update
We now introduce SPU, a new sample-efficient methodology for deep reinforcement learning. SPU focuses on the non-parameterized policy space , first determining targets that the non-parametrized policy should have. Once the targets are determined, it uses supervised regression to find a parameterized policy that nearly meets the targets. Since there is significant flexibility in how the targets can be defined, SPU is versatile and can also provide good sample efficiency performance.
In SPU, to advance from to we perform the following steps:
As usual, we first sample trajectories using policy , giving sample data , . Here is again an estimate of the advantage value , which is obtained from an auxiliary critic network. (For notational simplicity, we henceforth index the samples with rather than with corresponding to the th sample in the th trajectory.)
For each , using the advantage we define a specific target for . For example, as discussed below, we can define , where is the optimal policy of one of the constrained MDP problems in Section 4. Alternatively, as discussed below, we can hand engineer target functions.
We then fit the policy network to the labeled data , . Specifically, we solve a supervised regression problem, minimizing:
where is a loss function such as the loss.
After a fixed number of passes through the data to minimize , the resulting becomes our .
Thus SPU proceeds by solving a series of supervised learning problems, one for each policy step . Note that SPU does not use traditional policy gradient steps in the parameter space. Instead, SPU focuses on moving from policy to policy in the non-parameterized policy space, where each new target policy is approximately realized by the policy network by solving a supervised regression problem.
In minimizing , we have considered two approaches. The first is to initialize the policy network with small random weights; the second is to initialize the policy network with . For both approaches, we have tried using regularization (by putting aside a portion of the labeled data , , for estimating the validation error.) We have found that initializing with provides the best performance; and when initializing with , regularization does not seem to help.
5.1 TRPO-inspired targets with disaggregated constraints
where for simplicity we assume . To derive targets for this case, we estimate the expectation in the objective (27) using the samples and estimated advantage values generated from . With and replacing the expectation with its sampled value gives:
First consider the case . In this case we are going to want to make as large as possible. It is easily seen that if then , , is a feasible solution to (31)-(33). Now suppose . At the optimal solution, we will have for all for some . Thus, for all . Substituting this into the two constraints in (31)-(33) and doing some algebra gives:
The equations (34)-(35) can be readily solved for and ; we then set the target . Now consider the case . In this case we are going to want to make as small as possible. If , then we can set . Otherwise, we can again determine and by solving (34)-(35) with the restriction .
In summary, for TRPO with disaggregated constraints, for each , the target can be obtained by solving two equations with two variables. The procedure is repeated for each of the samples to obtain the target values. In the Appendix we show how the targets can be obtained for the aggregated KL constraint.
5.2 PPO-inspired targets
The optimal solution to this problem gives the targets for all and for all . We refer to these targets as the “default targets”.
5.3 Engineered target functions
With the default targets, for all values of , including small values, the corresponding target either equals or . This is counter-intuitive, as we would expect the targets to be close to 1 for small values of . Motivated by the methodology in Section 4 and by the default PPO-inspired targets, we engineer three classes of alternative target functions with the properties: (i) ; (ii) the target is a non-decreasing function of . All three classes of target functions outperform PPO on the MuJoCo domain, demonstrating the robustness of SPU with respect to how the target is computed. Below are the exact forms of these functions and their respective plots.
Target Function 1:
where is clipped such that .
Target Function 2:
Target Function 3:
where is a tunable hyper-parameter.
6 Experimental Results
We tested each algorithm on seven MuJoCo todorov2012mujoco () simulated robotics tasks implemented in OpenAI Gym openaigym (). We only compare the performance of our algorithms against PPO since PPO has become a popular baseline. Thus, our results here can be extrapolated to get a rough estimate of how our algorithms perform against other recent algorithms such as A3C schulman2015trust (); a3c (). The performance of PPO is obtained by running the publicly available OpenAI implementation openai_baselines (). Except for the hyper-parameters in the target functions, we used the same hyper-parameter values as the PPO paper schulman2017proximal () and kept them fixed across different algorithmic settings.
|Target Function||Improvement over PPO|
|default targets||8 %|
|default targets with gradient clipping||12 %|
|Target Function 1||16 %|
|Target Function 2||17 %|
|Target Function 3||%|
As in schulman2017proximal (), each run of an algorithm in one environment is trained with one million time-steps and is scored by averaging the episodic reward over the last 100 episodes. We set the score of the random policy to 0 and use the performance of the random policy to measure the performance improvement of an algorithm over PPO. The relative performance in 7 environments is averaged to produce a single scalar that represents the overall score for one algorithm. For each environment, the relative performance is also averaged over 5 different starting seeds. The source code will be released after the blind review process.
As shown in Table 1, SPU with Target Function 3 provides a improvement over PPO. Not only is the final average reward of SPU better than PPO, it also has higher sample efficiency, as measured by the number of time-steps taken to reach a particular performance level. Figure 2 illustrates that Target Function 1 consistently achieves higher reward than PPO in the latter half of training in 6 out of the 7 MuJoCo environments. PPO makes 10 passes (epochs) through the samples to update the policy. To ensure fair computational comparison, for all 5 target functions, SPU also makes 10 passes through the samples.
As shown in Figure 3, if we increase the number of passes to 20 (but not the number of samples taken), SPU with Target Function 1’s improvement climbed from to while PPO with 20 passes only performs better than PPO with 10 passes. Note that performance does not improve for all environments when increasing the number of passes. For PPO, performance actually declines significantly for three of the seven environments, whereas for SPU it only declines for one environment. By limiting the number of epochs in SPU (early stopping), we prevent overfitting to the targets. For completeness and reproducibility, in the Appendix, we discuss the implementation details and list hyper-parameters values.
We developed a novel policy-space methodology, which can be used to compare and contrast various sample-efficient reinforcement learning algorithms, including PPO and different versions of TRPO. The methodology can also be used to study many other forms of constraints, such as constraining the aggregated and disaggregated reverse KL-divergence. We also proposed a new sample-efficient class of algorithms called SPU, for which there is significant flexibility in how we set the targets.
As compared to PPO, our experimental results show that SPU with simple target functions can lead to improved sample-efficiency performance without increasing wall-clock time. In the future, it may be possible to achieve further gains with yet-to-be-explored classes of target functions, annealing the targets, and changing the number of passes through the data.
We would like to acknowledge the extremely helpful support by the NYU Shanghai High Performance Computing Administrator Zhiguo Qi.
Appendix A Solving for the optimal policy in the disaggregated-constraints TRPO problem (Theorem 2)
Appendix B TRPO-Inspired Targets with Aggregated Constraint
Because we do not have an estimate for for all and (and because is usually huge for deep reinforcement learning problems), we cannot easily calculate the expectations in (40)-(41). We instead use the samples , , from to approximate the expectations with their sample averages, resulting in the following optimization problem:
where , is the number of sampled trajectories, and is the trajectory time-step number of the th sample. Letting , , then the above optimization problem becomes
After we solve the above optimization problem for an optimal solution , we set the targets
Solving the above optimization problem is a topic for further research. We conjecture that it can be solved as quickly as the conjugate gradient method in TRPO. One possible approach is to first fix allocations , with and solve the following disaggregated problem for each :
Each of these problems is the disaggregated TRPO problem, which can be rapidly solved, as discussed in Section 5.1. Let denote the optimal value for the th disaggregated problem. Then the resulting optimiziation problem becomes maximize subject to . This problem can then be solved in a hierachical manner.
Appendix C Performance Graph For Target Function 2 and Target Function 3
Appendix D Implementation Details and Hyperparameters
As in schulman2017proximal (), the policy is parameterized by a fully-connected feed-forward neural network with two hidden layers, each with 64 units and tanh nonlinearities. The policy outputs the mean of a Gaussian distribution with state-independent variable standard deviations, following schulman2015trust (); benchmark_drl_continuous (). The action dimensions are assumed to be independent. The probability of an action is given by the multivariate Gaussian probability distribution function. The baseline used in the advantage value calculation is parameterized by a similarly sized neural network, trained to minimize the MSE between the sampled states TD returns and the their predicted values. To calculate the advantage values, we use Generalized Advantage Estimation schulman2015GAE (). States are normalized by dividing the running mean and dividing by the running standard deviation before being fed to any neural networks. The advantage values are normalized by dividing the batch mean and dividing by the batch standard deviation before being used for policy update.
|Number of timesteps||1e6|
|Optimizer Learning Rate||3e-4|
|Optimizer Learning Rate Anneal Schedule||Linearly to 0|
|Optimizer Adam Epsilon||1e-5|
|Timesteps ber batch||2048|
|Number of full passes||10|
|Target Functions||Value of|
|default targets with gradient clipping||0.48|
|Target Function 1||0.84|
|Target Function 1 20 passes||0.48|
|Target Function 2||2.19|
|Target Function 3||0.8,|
|PPO with 20 passes||0.12|
-  John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
-  Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu, and Nando de Freitas. Sample efficient actor-critic with experience replay. arXiv preprint arXiv:1611.01224, 2016.
-  Yuhuai Wu, Elman Mansimov, Roger B Grosse, Shun Liao, and Jimmy Ba. Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In Advances in neural information processing systems, pages 5285–5294, 2017.
-  John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
-  Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. In Reinforcement Learning, pages 5–32. Springer, 1992.
-  Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In International Conference on Machine Learning, pages 22–31, 2017.
-  John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
-  Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
-  Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in neural information processing systems, pages 1008–1014, 2000.
-  Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937, 2016.
-  Thomas Degris, Martha White, and Richard S Sutton. Off-policy actor-critic. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 179–186, 2012.
-  Jan Peters and Stefan Schaal. Natural actor-critic. Neurocomputing, 71(7-9):1180–1190, 2008.
-  Jan Peters and Stefan Schaal. Reinforcement learning of motor skills with policy gradients. Neural networks, 21(4):682–697, 2008.
-  Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In ICML, volume 2, pages 267–274, 2002.
-  Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
-  Sham M Kakade. A natural policy gradient. In Advances in neural information processing systems, pages 1531–1538, 2002.
-  Joshua Achiam. Advanced policy gradient methods. http://rll.berkeley.edu/deeprlcourse/f17docs/lecture_13_advanced_pg.pdf. Accessed: 2018-05-24.
-  Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Rémi Munos, Koray Kavukcuoglu, and Nando de Freitas. Sample efficient actor-critic with experience replay. CoRR, abs/1611.01224, 2016.
-  Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pages 1433–1438. Chicago, IL, USA, 2008.
-  John Schulman, Xi Chen, and Pieter Abbeel. Equivalence between policy gradients and soft q-learning. arXiv preprint arXiv:1704.06440, 2017.
-  Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
-  Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 5026–5033. IEEE, 2012.
-  Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. CoRR, abs/1606.01540, 2016.
-  Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. CoRR, abs/1602.01783, 2016.
-  Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, and Yuhuai Wu. Openai baselines. https://github.com/openai/baselines, 2017.
-  Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. CoRR, abs/1604.06778, 2016.
-  John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. CoRR, abs/1506.02438, 2015.