A General Family of Robust Stochastic Operators for Reinforcement Learning

# A General Family of Robust Stochastic Operators for Reinforcement Learning

Yingdong Lu
IBM Research AI
T.J. Watson Research Center
Yorktown Heights, NY 10598
yingdong@us.ibm.com
&Mark S. Squillante
IBM Research AI
T.J. Watson Research Center
Yorktown Heights, NY 10598
mss@us.ibm.com
&Chai Wah Wu
IBM Research AI
T.J. Watson Research Center
Yorktown Heights, NY 10598
cwwu@us.ibm.com
###### Abstract

We consider a new family of operators for reinforcement learning with the goal of alleviating the negative effects and becoming more robust to approximation or estimation errors. Various theoretical results are established, which include showing on a sample path basis that our family of operators preserve optimality and increase the action gap. Our empirical results illustrate the strong benefits of our family of operators, significantly outperforming the classical Bellman operator and recently proposed operators.

A General Family of Robust Stochastic Operators for Reinforcement Learning

Yingdong Lu IBM Research AI T.J. Watson Research Center Yorktown Heights, NY 10598 yingdong@us.ibm.com Mark S. Squillante IBM Research AI T.J. Watson Research Center Yorktown Heights, NY 10598 mss@us.ibm.com Chai Wah Wu IBM Research AI T.J. Watson Research Center Yorktown Heights, NY 10598 cwwu@us.ibm.com

\@float

noticebox[b]Preprint. Work in progress.\end@float

## 1 Introduction

Reinforcement learning has a rich history within the machine learning community to solve a wide variety of decision making problems in environments with unknown and unstructured dynamics. Through iterative application of a convergent operator, value-based reinforcement learning generates successive refinements of an initial value function. Q-learning Watkins (1989) is a particular reinforcement learning technique in which the value iteration computations consist of evaluating the corresponding Bellman equation without a model of the environment.

While Q-learning continues to be broadly and successfully used to determine the optimal actions of an agent in reinforcement learning, the development of new Q-learning approaches that improve convergence speed, accuracy and robustness remains of great interest. One approach might be based on having the agent learn optimal actions through the use of optimality conditions which are weaker than the Bellman equation such that value iteration continues to converge to an action-value function associated with an optimal policy, while at the same time increasing the separation between the value function and -function limits. Exploiting these weaker conditions for optimality could lead to alternatives to the classical Bellman operator that improve convergence speed, accuracy and robustness in reinforcement learning, especially in the company of estimation or approximation errors.

A recent study by Bellemare et al. Bellemare et al. (2016) considers the problem of identifying new alternatives to the Bellman operator based on having the new operators satisfy the properties of optimality-preserving, namely convergence to an optimal action policy, and gap-increasing, namely convergence to a larger deviation between the -values of optimal actions and suboptimal actions. Here the former ensures optimality while the latter can help the learning algorithm determine the optimal actions faster, more easily, and with less errors of mislabeling suboptimal actions. The authors propose a family of operators based on two inequalities between the proposed operator and the Bellman operator, and they show that the proposed family satisfies the properties of optimality-preserving and gap-increasing. Then, after empirically demonstrating the benefits of the proposed operator, the authors Bellemare et al. (2016) raise open fundamental questions with respect to the possibility of weaker conditions for optimality, the statistical efficiency of the proposed operator, and the possibility of a maximally efficient operator.

At the heart of the problem is a fundamental tradeoff between violating the preservation of optimality and increasing the action gap. Although the benefits of increasing the action gap in the presence of approximation or estimation errors are well known Farahmand (2011), increasing the action gap beyond a certain region in a deterministic sense can lead to violations of optimality preservation, thus resulting in value iterations that may not converge to optimal solutions. Our approach is intuitively based on the idea that the action gap can be increased beyond this region for individual value iterations as long as the overall value iterations are controlled in a probabilistic manner that ensures the preservation of optimality in a stochastic sense. In devising a family of operator endowed with these properties, we provide a more general approach that yields greater robustness to approximation or estimation errors.

In this paper we propose a general family of robust stochastic operators, which subsumes the family of operators in Bellemare et al. (2016) as a strict subset, and we address many of the open fundamental questions raised in Bellemare et al. (2016). Our approach is applicable to arbitrary -value approximation schemes and is based on support to devalue suboptimal actions while preserving the set of optimal policies in a stochastic sense. This makes it possible to increase the action gap between the -values of optimal and suboptimal actions to a greater extent beyond the aforementioned deterministic region, which can be critically important in practice because of the advantages of increasing the action gap in the company of approximation or estimation errors Farahmand (2011). Since the value-iteration sequence generated under our family of stochastic operators will be based on realizations of random variables, our theoretical results include establishing the fact that the random variables converge almost surely Billingsley (1999) to the same limit that produces the optimal actions. In addition, our theoretical results include showing that our robust stochastic operators are optimality-preserving and gap-increasing on a sample-path basis Billingsley (1999), establishing that our family of operators significantly broadens the set of weaker conditions for optimality over those in Bellemare et al. (2016), and showing that a statistical ordering of the key components of our operators leads to a corresponding ordering of the action gaps. Key implications of these results include: the search space for the maximally efficient operator should be an infinite dimensional space of random variables, instead of the finite space alluded to in Bellemare et al. (2016); our statistical ordering results can lead to order relationships among the operators in terms of action gaps, which can in turn lead to maximally efficient operators.

We subsequently apply our robust stochastic operators to obtain empirical results for a wide variety of problems in the OpenAI Gym framework Brockman et al. (2016), and then compare these empirical results against those under both the classical Bellman operator and the consistent Bellman operator from Bellemare et al. (2016). These experimental results consistently show that our robust stochastic operators outperform both the Bellman operator and the consistent Bellman operator.

## 2 Preliminaries

We consider a standard reinforcement learning (RL) framework (see, e.g., Bertsekas and Tsitsiklis (1996)) in which a learning agent interacts with a stochastic environment. This interaction is modeled as a discrete-time discounted Markov Decision Process (MDP) given by , where is the set of states, is the set of actions, is the transition probability kernel, is the reward function mapping state-action pairs to a bounded subset of , and is the discount factor. Let and denote the set of bounded real-valued functions over and , respectively. For , we define and use the same definition for and , and so on. Let always denotes the next state random variable. For the current state in which action is taken, i.e., , we denote by the conditional transition probability for the next state and we define to be the expectation with respect to .

A stationary policy defines the distribution of control actions given the current state , which reduces to a deterministic policy when the conditional distribution renders a constant action for state ; with slight abuse of notation, we always write policy . The stationary policy induces a value function and an action-value function where defines the expected discounted cumulative reward under policy starting in state , and satisfies the Bellman equation

 Qπ(x,a)=R(x,a)+γEPQπ(x′,π(x′)). (1)

Our goal is to determine a policy that achieves the optimal value function , which also produces the optimal action-value function . Define the Bellman operator pointwise as

 TBQ(x,a):=R(x,a)+γEPmaxb∈AQ(x′,b), (2)

or equivalently . The Bellman operator is known (see, e.g., Bertsekas and Tsitsiklis (1996)) to be a contraction mapping in supremum norm whose unique fixed point coincides with the optimal action-value function, namely

 Q∗(x,a)=R(x,a)+γEPmaxb∈AQ∗(x′,b),

or equivalently . This in turn indicates that the optimal policy can be obtained by

 π∗(x)=argmaxa∈AQ∗(x,a),∀x∈X.

As noted by Bellemare et al. Bellemare et al. (2016) and illustrated through a simple example, the optimal state-action value function obtained through the Bellman operator does not always describe the value of stationary policies. Although these nonstationary effects cause no problems when the MDP can be solved exactly, such nonstationary effects in the presence of estimation or approximation errors, which may lead to small differences between the optimal state-action value function and the suboptimal ones, can result in errors in identifying the optimal actions. To address issues of nonstationarity of this and related forms arising in practice, Bellemare et al. Bellemare et al. (2016) propose the so-called consistent Bellman operator defined as

 TCQ(x,a):=R(x,a)+γEP[\mathds1{x≠x′}maxb∈AQ(x′,b)+\mathds1{x=x′}Q(x,a)], (3)

where denotes the indicator function. The consistent Bellman operator preserves a local form of stationarity by redefining the action-value function such that, if an action is taken from the state and the next state , then action is taken again. Bellemare et al. Bellemare et al. (2016) proceed to show that the consistent Bellman operator yields the optimal policy , and in particular is both optimality-preserving and gap-increasing, each of which is defined as follows.

###### Definition 2.1 (Bellemare et al. (2016)).

An operator for is optimality-preserving if for any and with , then exists, is unique, , and

 Q∗(x,a)

Moreover, an operator for is gap-increasing if for all , and with and , then

 liminfk→∞[Vk(x)−Qk(x,a)]≥V∗(x)−Q∗(x,a). (5)

The operator property of optimality-preserving is important because it ensures that at least one optimal action remains optimal and that suboptimal actions remain suboptimal. As suggested above, the operator property of gap-increasing is important from the perspective of robustness when the inequality (5) is strict for at least one . In particular, as the action gap of an operator increases while remaining optimality-preserving, the end result is greater robustness to approximation or estimation errors Farahmand (2011).

## 3 Robust Stochastic Operator

In this section we propose a general family of robust stochastic operators and then establish that this general family of operators are optimality-preserving and gap-increasing. We further show that our family of robust stochastic operators are strictly broader and more general than the family of consistent Bellman operators, while also addressing some of the fundamental open questions raised in Bellemare et al. (2016) concerning weaker conditions for optimality-preserving, statistical efficiency of new operators, and maximally efficient operators. It is important to note, as also emphasized in Bellemare et al. (2016), that our approach can be extended to variants of the Bellman operator such as SARSA Rummery and Niranjan (1994), policy evaluation Sutton (1988) and fitted -iteration Ernst et al. (2005).

For all , , and , a sequence of independent nonnegative random variables with finite support and expectation , we define

 TβkQ(x,a):=R(x,a)+γEPmaxb∈AQ(x′,b)−βk(Vk(x)−Qk(x,a)), (6)

or equivalently . Then the members of the general family of robust stochastic operators include the defined over all probability distributions for the sequences with . Furthermore, we define to be the general family of robust stochastic operators comprising all such that there exists a sequence of and the following inequalities hold

 TBQ(x,a)−βk(Vk(x)−Qk(x,a))≤TQ(x,a)≤TBQ(x,a),∀x∈X,a∈A. (7)

It is obvious that these are strictly weaker conditions than those identified in Bellemare et al. (2016); and since realizations of can clearly take on values outside of , the family of operators subsumes the family of operators identified in Bellemare et al. (2016). Our theorem establishes that the general family of robust stochastic operators are also optimality-preserving and gap-increasing.

###### Theorem 3.1.

Let be the Bellman operator defined in (2) and the robust stochastic operator defined in (6). Considering the sequence with , the conditions of optimality-preserving and gap-increasing hold almost surely and therefore the the operator is both optimality-preserving and gap-increasing almost surely. Moreover, all operators in the family are optimality-preserving and gap-increasing almost surely.

###### Proof.

We first consider aspects of the optimality-preserving properties of the family of robust stochastic operators. Since the conditions of Lemma A.1 due to Bellemare et al. Bellemare et al. (2016) hold, it follows that the sequence converges asymptotically on a sample path basis Billingsley (1999) such that , for all . Defining and applying a derivation from the proof of Theorem 2 in Bellemare et al. (2016) for each sample path, we can conclude that

 ~Q(x,a) ≤limsupk→∞TBQk(x,a)=limsupk→∞[R(x,a)+γEPmaxb∈AQk(x′,b)] ≤R(x,a)+γEP[maxb∈Alimsupk→∞Qk(x′,b)]=TB~Q(x,a). (8)

holds almost surely Billingsley (1999).

Meanwhile, we have

 Qk+1(x,a)=TβkQk(x,a)=TBQk−βk[Vk(x)−Qk(x,a)].

Taking conditional expectation with respect to the filtration , renders

 E[Qk+1(x,a)|Fk]=E[TβkQk(x,a)|Fk]=TBQk−¯¯¯β[Vk(x)−Qk(x,a)],

or equivalently

 Qk(x,a)+E[Qk+1(x,a)−Qk(x,a)|Fk]=E[TβkQk(x,a)|Fk]=TBQk−¯¯¯β[Vk(x)−Qk(x,a)].

Since , we know from Lemma A.1 that converges on each sample path. Hence, for any on each sample path, we observe that when is sufficiently large. We therefore have, for sufficiently large ,

 Qk(x,a)+ϵ≥E[TβkQk(x,a)|Fk]=TBQk−¯¯¯β[Vk(x)−Qk(x,a)].

Because is arbitrary, we can conclude that

 Qk(x,a)≥E[TβkQk(x,a)|Fk]=TBQk−¯¯¯β[Vk(x)−Qk(x,a)]

holds for large . Taking the limit superior on both sides, we obtain

 ~Q(x,a)≥TB~Q(x,a)−¯β~V(x)+¯¯¯β~Q(x,a), (9)

which in combination with (3) leads to the conclusion that almost surely.

Next, to prove that is gap-increasing, the above arguments render on a sample path basis, and thus (5) is equivalent to

 limsupk→∞Qk(x,a)≤Q∗(x,a)

almost surely. This inequality follows on a sample path basis from by definition and Lemma A.1, and therefore we have the desired result for the operators . Furthermore, it can be readily verified that the above arguments can be similarly applied to cover all of the operators in .

Lastly, from the above results of (5) and almost surely, it follows that (4) also holds almost surely for as well as all operators in , thus completing the proof. ∎

The definition of and Theorem 3.1 significantly enlarges the set of optimality-preserving and gap-increasing operators identified in Bellemare et al. (2016). In particular, our new sufficient conditions for optimality-preserving operators implies that significant deviation from the Bellman operator is possible without loss of optimality. More importantly, the definition of and Theorem 3.1 implies that the search space for maximally efficient operators should be an infinite dimensional space of random variables, instead of the finite dimensional space that is alluded to in Bellemare et al. (2016). We now establish results on certain statistical properties for the sequences within our general family of robust stochastic operators, which offer key relational insights into important orderings of different operators in in terms of their action gaps. This can then be exploited in searching for and attempting to find maximally efficient operators in practice.

###### Theorem 3.2.

Suppose and are respectively updated with two different robust stochastic operators and that are distinguished by and satisfying and ; namely and . Then we have .

###### Proof.

The desired result can be readily seen from

 Var[Qk+1] =E[Var[Qk+1|Qk]+Var[E[Qk+1|Qk]] =Var[βk]E[(Vk(x)−Qk(x,a))2]+%Var[E[Qk+1|Qk]]

and

 Var[^Qk+1] =E[Var[^Qk+1|Qk]+Var[E[Qk+1|Qk]] =Var[^βk]E[(Vk(x)−Qk(x,a))2]+Var[E[Qk+1|Qk]].

The theorem concludes that a larger variance for in fact leads to a larger variance for . We know that, in the limit, the optimal action will maintain its state-action value function. Then, when is sufficiently large, we can expect that the state-value function for the optimal action will be very close to the optimal value. In this case, a larger variance implies that the smaller sub-optimal values will have larger probability, and thus they can be understood to have a larger action gap.

The results of Theorem 3.2 are consistent with our observations from the numerical experiments in Section 4 where the operator associated with the sequence drawn from a uniform distribution outperforms the operator associated with the constant sequence .

## 4 Experimental Results

Within the general RL framework of interest, we consider a standard, yet generic, form for -learning so as to cover the various experimental programs examined in this section. Specifically, for all , , and an operator of interest , we consider the sequence of action-value -functions based on the following generic update rule:

 Qk+1(x,a)=(1−αk)Qk(x,a)+αkTQk(x,a) (10)

where is the learning rate for iteration . Our empirical comparisons comprise the Bellman operator , the consistent Bellman operator , and instances of our family of robust stochastic operators , denoted hereafter as RSO.

We conduct numerous experiments on several well-known problems using the OpenAI Gym framework Brockman et al. (2016), namely Mountain Car, Acrobot, Cart Pole and Lunar Lander. This collection of problems span a wide variety of RL examples with different characteristics, dimensions, parameters, and so on; in each case, the state space is continuous and discretized to a finite set of states. For every problem, the specific Q-learning algorithms considered are defined as in (10) where the appropriate operator of interest , or is substituted for ; at each timestep, (10) is applied to a single point of the -function at the current state and action. The experiments for every problem from the OpenAI Gym were run using the existing code found at Vilches (); Alzantot () exactly as is with the sole change comprising the replacement of the Bellman operator in the code with corresponding implementations of either the consistent Bellman operator or RSO; see Appendix B for the corresponding python code. Multiple experimental trials are run for each problem, where we ensured the setting of the random starting state to be the same in each experimental trial for all three types of operators by initializing them with the same random seed. We observe that for different problems and different variants of the Q-learning algorithm, simply replacing the Bellman operator or the consistent Bellman operator with the Robust Stochastic operator generally results in improved performance.

### 4.1 Mountain Car

This problem is first discussed in Moore (1990). The state vector is -dimensional with a total of three possible actions, and the score represents the number of timesteps needed to solve the problem. We ran experimental trials over episodes for training, each of which consists of up to steps; then the problem is solved for episodes using the policy obtained from the Q-function training. In both cases, the goal is to minimize the score. The RSO considered in each experimental trial consists of uniformly distributed over .

For the training phase, Figure 0(a) plots the score, averaged over moving windows of episodes across the trials, as a function of the number of episodes. We observe that the average score under the RSO exhibits much better performance than under the Bellman operator or the consistent Bellman operator. Moreover, as can be seen from the smoothness of the curves in Figure 0(a), the standard deviation is relatively small for all three operators. For the testing phase, the average score and the standard deviation of the score over the experimental trials, each comprising episodes, are respectively given by: and for the Bellman operator; and for the consistent Bellman operator; and and for the RSO. Here we observe that both the average score and its standard deviation under the RSO exhibit better performance than under the Bellman operator or the consistent Bellman operator.

### 4.2 Acrobot

This problem is first discussed in Sutton (1996). The state vector is -dimensional with three actions possible in each state, and the score represents the number of timesteps needed to solve the problem. We ran experimental trials over many episodes, with a goal of minimizing the score. The RSO considered in each experimental trial consists of .

Figure 0(b) plots the score, averaged over moving windows of episodes across the trials, as a function of the number of episodes. We observe that the average score under the RSO exhibits much better performance than under the Bellman operator or the consistent Bellman operator. Furthermore, as can be seen from the smoothness of the curves in Figure 0(b), the standard deviation is relatively small for all three operators.

### 4.3 Cart Pole

This problem is first discussed in Barto et al. (1983). The state vector is -dimensional with two actions possible in each state, and the score represents the number of steps where the cart pole stays upright before either falling over or going out of bounds. We ran experimental trials over many episodes, each of which consists of up to steps with a goal of maximizing the score. When the score is above 195, the problem is considered solved. Two RSOs are considered for each experimental trial, namely one in which is uniformly distributed over and another in which is fixed to be .

Figure 1(a) plots the score, averaged over moving windows of episodes across the trials, as a function of the number of episodes. A plot of the corresponding standard deviation, taken over the same number of score values, is presented in Figure 1(b). We observe that both the average score and its standard deviation under the RSOs exhibit better performance than under the Bellman operator or the consistent Bellman operator. In particular, the average score over the last episodes across the trials is under the RSO with in comparison with and under the Bellman and consistent Bellman operators, respectively; the corresponding standard deviations of the scores are , and for the RSO, Bellman operator and consistent operator, respectively. We also observe that the RSO with tends to performs better than the RSO with fixed over the sequences of episodes.

### 4.4 Lunar Lander

This problem is discussed in Brockman et al. (2016). The state vector is -dimensional with a total of four possible actions, and the physics of the problem is known to be more difficult than the foregoing problems. The score represents the cumulative reward comprising positive points for successful degrees of landing and negative points for fuel usage and crashing. We ran experimental trials over many episodes, each of which consists of up to steps with a goal of maximizing the score. The RSO considered in each experimental trial consists of uniformly distributed over .

For the training phase, Figure 2(a) plots the score, averaged over moving windows of episodes across the trials, as a function of the number of episodes. We observe that the average score under the RSO exhibits better performance than under the Bellman operator or the consistent Bellman operator. Moreover, as can be seen from the smoothness of the curves in Figure 2(a), the standard deviation is relatively small for all three operators. For the testing phase, the average score over the experimental trials, each comprising episodes, is respectively given by: for the Bellman operator; for the consistent Bellman operator; and for the RSO. (Once again, the standard deviation is comparable across all three operators.) Here we observe once again that the average score under the RSO exhibits better performance than under the Bellman operator or the consistent Bellman operator. The improved performance under the RSO can be explained by Figure 2(b) that shows the distribution of scores for both the RSO and the consistent Bellman operator. Here we observe that the distribution for the RSO is shifted further to the right.

## 5 Conclusions

We proposed and analyzed a new general family of robust stochastic operators for reinforcement learning, which subsumes the classical Bellman operator and a recently proposed family of operators. Our goal was to provably preserve optimality while significantly increasing the action gap, thus providing robustness with respect to approximation or estimation errors. We establish and discuss fundamental theoretical results for our general family of robust stochastic operators. In addition, our collection of empirical results – based on several well-known problems within the OpenAI Gym framework spanning a wide variety of reinforcement learning examples with diverse characteristics –  consistently demonstrates and quantifies the significant performance improvements obtained with our operators over existing operators. We believe our work can lead to opportunities to find maximally efficient operators in practice.

## References

• (1) M. Alzantot. Solution of mountaincar OpenAI Gym problem using Q-learning.
• Barto et al. (1983) A. G. Barto, R. S. Sutton, and C. W. Anderson. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, SMC-13(5):834–846, Sept. 1983.
• Bellemare et al. (2016) M. G. Bellemare, G. Ostrovski, A. Guez, P. S. Thomas, and R. Munos. Increasing the action gap: New operators for reinforcement learning. In Proc. Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, pages 1476–1483. AAAI Press, 2016.
• Bertsekas and Tsitsiklis (1996) D. Bertsekas and J. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996.
• Billingsley (1999) P. Billingsley. Convergence of Probability Measures. Wiley, New York, Second edition, 1999.
• Brockman et al. (2016) G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym. CoRR, abs/1606.01540, 2016.
• Ernst et al. (2005) D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6:503–556, 2005.
• Farahmand (2011) A. Farahmand. Action-gap phenomenon in reinforcement learning. Advances in Neural Information Processing Systems, 24, 2011.
• Moore (1990) A. Moore. Efficient Memory-Based Learning for Robot Control. PhD thesis, University of Cambridge, Cambridge, U.K., 1990.
• Rummery and Niranjan (1994) G. Rummery and M. Niranjan. On-line Q-learning using connectionist systems. Technical report, Cambridge University, 1994.
• Sutton (1988) R. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3(1):9–44, 1988.
• Sutton (1996) R. S. Sutton. Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding. Advances in Neural Information Processing Systems, 8:1038–1044, 1996.
• (13) V. M. Vilches. Basic reinforcement learning tutorial 4: Q-learning in OpenAI Gym.
• Watkins (1989) C. Watkins. Learning from Delayed Rewards. PhD thesis, University of Cambridge, Cambridge, U.K., 1989.

## Appendix A Theoretical Results

###### Lemma A.1 (Bellemare et al. Bellemare et al. (2016)).

Let and be the policy greedy with respect to . Let be an operator with the properties that, for all and ,

1. , and

2. .

Consider the sequence with and let . Then the sequence converges, and furthermore, for all ,

 limk→∞Vk(x)≤V∗(x).

## Appendix B Python Code

We tested the various operators of interest on several RL problems and algorithms. For our empirical comparisons, the existing code that updates the Q-learning value based on the Bellman operator is replaced with the corresponding code for the and operators. In particular, the following snippets of code describe how this is generically implemented for the original operator together with the added and operators, respectively.

def UpdateQBellman(self,currentState,action,nextState,reward,alpha,gamma):
Qvalue=self.Q[currentState,action]
rvalue=reward+gamma*max([self.Q[nextState,a] for a in self.actionsSet])
self.Q[currentState,action] += alpha*(rvalue - Qvalue)

def UpdateQConsistent(self,currentState,action,nextState,reward,alpha,gamma):
Qvalue=self.Q[currentState,action]
rvalue=reward+gamma*(max([self.Q[nextState,a] for a in self.actionsSet])
if currentState != nextState else Qvalue)
self.Q[currentState,action] += alpha*(rvalue - Qvalue)

def UpdateQRSO(self,currentState,action,nextState,reward,alpha,gamma,beta):
Qvalue=self.Q[currentState,action]
rvalue=reward+(gamma*(max([self.Q[nextState,a] for a in self.actionsSet]))
-beta*(max([self.Q[currentState,a] for a in self.actionsSet])-Qvalue))
self.Q[currentState,action] += alpha*(rvalue - Qvalue)

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters