Reinforcement Learning Models of Human Behavior: Reward Processing in Mental Disorders

Reinforcement Learning Models of Human Behavior:
Reward Processing in Mental Disorders

Baihan Lin
Columbia University &Guillermo Cecchi
IBM Research
\ANDDjallel Bouneffouf
IBM Research
&Jenna Reinen
IBM Research
&Irina Rish
IBM Research

Drawing an inspiration from behavioral studies of human decision making, we propose here a general parametric framework for a reinforcement learning problem, which extends the standard Q-learning approach to incorporate a two-stream framework of reward processing with biases biologically associated with several neurological and psychiatric conditions, including Parkinson’s and Alzheimer’s diseases, attention-deficit/hyperactivity disorder (ADHD), addiction, and chronic pain. For AI community, the development of agents that react differently to different types of rewards can enable us to understand a wide spectrum of multi-agent interactions in complex real-world socioeconomic systems. Empirically, the proposed model outperforms Q-Learning and Double Q-Learning in artificial scenarios with certain reward distributions and real-world human decision making gambling tasks. Moreover, from the behavioral modeling perspective, our parametric framework can be viewed as a first step towards a unifying computational model capturing reward processing abnormalities across multiple mental conditions and user preferences in long-term recommendation systems.


Reinforcement Learning Models of Human Behavior:
Reward Processing in Mental Disorders

  Baihan Lin Columbia University Guillermo Cecchi IBM Research Djallel Bouneffouf IBM Research Jenna Reinen IBM Research Irina Rish IBM Research


noticebox[b]Preprint. Under review.\end@float

1 Introduction

In order to better understand and model human decision-making behavior, scientists usually investigate reward processing mechanisms in healthy subjects [1]. However, neurodegenerative and psychiatric disorders, often associated with reward processing disruptions, can provide an additional resource for deeper understanding of human decision making mechanisms. Furthermore, from the perspective of evolutionary psychiatry, various mental disorders, including depression, anxiety, ADHD, addiction and even schizophrenia can be considered as “extreme points” in a continuous spectrum of behaviors and traits developed for various purposes during evolution, and somewhat less extreme versions of those traits can be actually beneficial in specific environments (e.g., ADHD-like fast-switching attention can be life-saving in certain environments, etc.). Thus, modeling decision-making biases and traits associated with various disorders may actually enrich the existing computational decision-making models, leading to potentially more flexible and better-performing algorithms. Herein, we focus on reward-processing biases associated with several mental disorders, including Parkinson’s and Alzheimer’s disease, ADHD, addiction and chronic pain. Our questions are: is it possible to extend standard reinforcement learning algorithms to mimic human behavior in such disorders? Can such generalized approaches outperform standard reinforcement learning algorithms on specific tasks?

We show that both questions can be answered positively. We build upon the Q-Learning, a state-of-art approach to RL problem, and extend it to a parametric version which allows to split the reward information into positive stream and negative stream with various reward-processing biases known to be associated with particular disorders. For example, it was shown that (unmedicated) patients with Parkinson’s disease appear to learn better from negative rather than from positive rewards [2]; another example is addictive behaviors which may be associated with an inability to forget strong stimulus-response associations from the past, i.e. to properly discount past rewards [3], and so on. More specifically, we propose a parametric model which introduces weights on incoming positive and negative rewards, and on reward histories, extending the standard parameter update rules in Q Learning; tuning the parameter settings allows us to better capture specific reward-processing biases.

1.1 Neuroscience Motivation

Cellular computation of reward and reward violation. Decades of evidence has linked dopamine function to reinforcement learning via neurons in the midbrain and its connections in the basal ganglia, limbic regions, and cortex. Firing rates of dopamine neurons computationally represent reward magnitude, expectancy, and violations (prediction error) and other value-based signals [4]. This allows an animal to update and maintain value expectations associated with particular states and actions. When functioning properly, this helps an animal develop a policy to maximize outcomes by approaching/choosing cues with higher expected value and avoiding cues associated with loss or punishment. The mechanism is conceptually similar to reinforcement learning widely used in computing and robotics [5], suggesting mechanistic overlap in humans and AI. Evidence of Q-learning and actor-critic models have been observed in spiking activity in midbrain dopamine neurons in primates [6] and in the human striatum using the BOLD signal [7].

Positive vs. negative learning signals. Phasic dopamine signaling represents bidirectional (positive and negative) coding for prediction error signals [8], but underlying mechanisms show differentiation for reward relative to punishment learning [9]. Though representation of cellular-level aversive error signaling has been debated [10], it is widely thought that rewarding, salient information is represented by phasic dopamine signals, whereas reward omission or punishment signals are represented by dips or pauses in baseline dopamine firing [4]. These mechanisms have downstream effects on motivation, approach behavior, and action selection. Reward signaling in a direct pathway links striatum to cortex via dopamine neurons that disinhibit the thalamus via the internal segment of the globus pallidus and facilitate action and approach behavior. Alternatively, aversive signals may have an opposite effect in the indirect pathway mediated by D2 neurons inhibiting thalamic function and ultimately action, as well [11]. Manipulating these circuits through pharmacological measures or disease has demonstrated computationally-predictable effects that bias learning from positive or negative prediction error in humans [2], and contribute to our understanding of perceptible differences in human decision making when differentially motivated by loss or gain [12].

Clinical Implications. Highlighting the importance of using computational models to understand predict disease outcomes, many symptoms of neurological and psychiatric disease are related to biases in learning from positive and negative feedback [13]. Studies in humans have shown that when reward signaling in the direct pathway is over-expressed, this may enhance the value associated with a state and incur pathological reward-seeking behavior, like gambling or substance use. Conversely, when aversive error signals are enhanced, this results in dampening of reward experience and increased motor inhibition, causing symptoms that decrease motivation, such as apathy, social withdrawal, fatigue, and depression. Further, it has been proposed that exposure to a particular distribution of experiences during critical periods of development can biologically predispose an individual to learn from positive or negative outcomes, making them more or less susceptible to risk for brain-based illnesses [14]. These points distinctly highlight the need for a greater understanding of how intelligent systems differentially learn from rewards or punishments, and how experience sampling may impact reinforcement learning during influential training periods.

2 Related work

In this section, we review prior work in several areas which contributed to the ideas of this paper.

Reward Processing in Mental Disorders. The literature on the reward processing abnormalities in particular neurological and psychiatric disorders is quite extensive; below we summarize some of the recent developments in this fast-growing field. It is well-known that the neuromodulator dopamine plays a key role in reinforcement learning processes. Parkinson’s disease (PD) patients, who have depleted dopamine in the basal ganglia, tend to have impaired performance on tasks that require learning from trial and error. For example, [2] demonstrate that off-medication PD patients are better at learning to avoid choices that lead to negative outcomes than they are at learning from positive outcomes, while dopamine medication typically used to treat PD symptoms reverses this bias. Alzheimer’s disease (AD) is the most common cause of dementia in the elderly and, besides memory impairment, it is associated with a variable degree of executive function impairment and visuospatial impairment. As discussed in [1], AD patients have decreased pursuit of rewarding behaviors, including loss of appetite; these changes are often secondary to apathy, associated with diminished reward system activity. Furthermore, poor performance on certain tasks is correlated with memory impairments. Frontotemporal dementia (bvFTD) typically involves a progressive change in personality and behavior including disinhibition, apathy, eating changes, repetitive or compulsive behaviors, and loss of empathy [1], and it is hypothesized that those changes are associated with abnormalities in reward processing. For example, changes in eating habits with a preference for sweet, carbohydrate rich foods and overeating in bvFTD patients can be associated with abnormally increased reward representation for food, or impairment in the negative (punishment) signal associated with fullness. Authors in [15] suggest that the strength of the association between a stimulus and the corresponding response is more susceptible to degradation in Attention-deficit/hyperactivity disorder (ADHD) patients, which suggests problems with storing the stimulus-response associations. Among other functions, storing the associations requires working memory capacity, which is often impaired in ADHD patients. In [3], it is demonstrated that patients suffering from addictive behavior have heightened stimulus-response associations, resulting in enhanced reward-seeking behavior for the stimulus which generated such association. In [16], it is suggested that chronic pain results in a hypodopaminergic (low dopamine) state that impairs motivated behavior, resulting into a reduced drive in chronic pain patients to pursue the rewards. Decreased reward response may underlie a key system mediating the anhedonia and depression, which are common in chronic pain. A variety of computational models was proposed for studying the disorders of reward processing in specific disorders, including, among others [2, 17, 18, 19, 3, 20]. However, none of the above studies is proposing a unifying model that can represent a wide range of reward processing disorders.

Computational Models of Reward Processing in Mental Disorders. A wide range of models was proposed for studying the disorders of reward processing. For example, [2] presented some evidence for a mechanistic account of how the human brain implicitly learns to make choices leading to good outcomes, while avoiding those leading to bad ones. Consistent results across two tasks (a probabilistic one and a deterministic one), in both medicated and non-medicated Parkinson’s patients, provide substantial support for a dynamic dopamine model of cognitive reinforcement learning. In [17], the authors review the evolving bvFTD literature and propose a simple, testable network-based working model for understanding bvFTD. Using a computational multilevel approach, a study presented in [18] suggests that ADHD is associated with impaired gain modulation in systems that generate increased behavioral variability. This computational, multilevel approach to ADHD provides a framework for bridging gaps between descriptions of neuronal activity and behavior, and provides testable predictions about impaired mechanisms. Based on the dopamine hypotheses of cocaine addiction and the assumption of decreased brain reward system sensitivity after long-term drug exposure, the work by [19] proposes a computational model for cocaine addiction. By utilizing average reward temporal difference reinforcement learning, this work incorporates the elevation of basal reward threshold after long-term drug exposure into the model of drug addiction proposed by [3]. The proposed model is consistent with the animal models of drug seeking under punishment. In the case of non-drug reward, the model explains increased impulsivity after long-term drug exposure.

In the study by [20], a simple heuristic model is developed to simulate individuals’ choice behavior by varying the level of decision randomness and the importance given to gains and losses. The findings revealed that risky decision-making seems to be markedly disrupted in patients with chronic pain, probably due to the high cost that pain and negative mood impose on executive control functions. Patients’ behavioral performance in decision-making tasks, such as the Iowa Gambling Task (IGT), is characterized by selecting cards more frequently from disadvantageous than from advantageous decks, and by switching more often between competing responses, as compared with healthy controls.

To the best of our knowledge, this work is the first one to propose a generalized version of Reinforcement Learning algorithm which incorporates a range of reward processing biases associated with various mental disorders and shows how different parameter settings of the proposed model lead to behavior mimicking a wide range of impairments in multiple neurological and psychiatric disorders. Most importantly, our reinforcement learning algorithm based on generalization of Q-Learning outperforms the baseline method on multiple artificial scenarios.

3 Problem Setting

3.1 Reinforcement Learning

Reinforcement learning defines a class of algorithms solving problems modeled as a Markov decision process (MDP) [5]. A Markov decision problem is usually denoted by the tuple , where is a set of possible states, is a set of actions , is a transition function defined by ,, where and , is a reward function , is a discount factor that specifies how much long term reward is kept.

The goal in an MDP is to maximize the discounted long term reward received. Usually the infinite-horizon objective is considered:


Solutions come in the form of policies , which specify what action the agent should take in any given state deterministically or stochastically. One way to solve this problem is through Q-learning with function approximation [21]. The Q-value of a state-action pair, , is the expected future discounted reward for taking action in state . A common method to handle very large state spaces is to approximate the function as a linear function of some features. Let denote relevant features of the state-action pair . Then, we assume , where is an unknown vector to be learned by interacting with the environment. Every time the reinforcement learning agent takes action from state , obtains immediate reward and reaches new state , the parameter is updated using

difference (2)

where is the learning rate. -greedy is a common strategy used for exploration. That is, during the training phase, a random action is played with a probability of and the action with maximum Q-value is played otherwise. The agent follows this strategy and updates the parameter according to Equation (2) until the Q-value converge or for a large number of time-steps.

4 Human Q-Learning

We will now introduce a more general formulation of Q-Learning incorporating the reward signals from a positive and a negative stream. We propose Human Q-Learning (HQL), outlined in Algorithm 1, which updates the Q values using four weight parameters: and are the weights of the previously accumulated positive and negative rewards, respectively, while and represent the weights on the positive and negative rewards at the current iteration. In our algorithm, we have two Q tables that we are using and which respectively record the positive and negative feedback.

1:  For each episode t do
2:    Initialize s
3:    Repeat
5:     action , observe ,
8:    until s is terminal
Algorithm 1 Human Q-Learning (HQL)

4.1 Reward Processing Models with Different Biases

In this section we describe how specific constraints on the model parameters in the proposed algorithm can yield different reward processing biases discussed earlier, and introduce several instances of the HQL model, with parameter settings reflecting particular biases. The parameter settings are summarized in Table 1, where we use list our models associated with specific disorders.

It is important to underscore that the above models should be viewed as only a first step towards a unifying approach to reward processing disruptions, which requires further extensions, as well as tuning and validation on human subjects. Our main goal is to demonstrate the promise of our parametric approach at capturing certain decision-making biases, as well as its computational advantages over the standard Q-Learning algorithm, due to the increased generality and flexibility facilitated by multi-parametric formulation.

Note that the standard HQL (SQL) approach correspond to setting the four (hyper)parameters used in our model to 1. We also introduce two variants which only learns from one stream of rewards: positive Q-Learning (PQL) and negative Q-Learning (NQL) by setting either or to zero. Next, we introduce the model which incorporates some mild forgetting of the past rewards or losses, using 0.5 weights, just as an example, and calibrating the other models with respect to this one; we refer to this model as M for “moderate” forgetting, which serves here as a proxy for somewhat “normal” reward processing, without extreme reward-processing biases associated with disorders. We will use the subscript to denote the parameters of this model.

“Addiction” (ADD)
“Alzheimer’s” (AD)
“Chronic pain” (CP)
“Parkinson’s” (PD)
“moderate” (M)
Standard HQL (SQL) 1 1 1 1
Positive HQL (PQL) 1 1 0 0
Negative HQL (NQL) 0 0 1 1
Table 1: Algorithms Parameters

We will now introduced several models inspired by certain reward-processing biases in a range of mental disorders. It is important to note that, despite using disorder names for these models, we are not claiming that they provide accurate models of the corresponding disorders, but rather disorder-inspired versions of our general parametric family of models.

Recall that PD patients are typically better at learning to avoid negative outcomes than at learning to achieve positive outcomes [2]; one way to model this is to over-emphasize negative rewards, by placing a high weight on them, as compared to the reward processing in healthy individuals. Specifically, we will assume the parameter for PD patients to be much higher than normal (e.g., we use here), while the rest of the parameters will be in the same range for both healthy and PD individuals. Patients with bvFTD are prone to overeating which may represent increased reward representation. To model this impairment in bvFTD patients, the parameter of the model could be modified as follow: (e.g., as shown in Table 1), where is the parameter of the bvFTD model has, and the rest of these parameters are equal to the normal one. To model apathy in patients with Alzheimer’s, including downplaying rewards and losses, we will assume that the parameters and are somewhat smaller than normal, and (e.g, set to 0.1 in Table 1), which models the tendency to forget both positive and negative rewards. Recall that ADHD may be involve impairments in storing stimulus-response associations. In our ADHD model, the parameters and are smaller than normal, and , which models forgetting of both positive and negative rewards. Note that while this model appears similar to Alzheimer’s model described above, the forgetting factor will be less pronounced, i.e. the and parameters are larger than those of the Alzheimer’s model (e.g., 0.2 instead of 0.1, as shown in Table 1). As mentioned earlier, addiction is associated with inability to properly forget (positive) stimulus-response associations; we model this by setting the weight on previously accumulated positive reward (“memory” ) higher than normal, , e.g. , while . We model the reduced responsiveness to rewards in chronic pain by setting so there is a decrease in the reward representation, and so the negative rewards are not forgotten (see table 1).

Of course, the above models should be treated only as first approximations of the reward processing biases in mental disorders, since the actual changes in reward processing are much more complicated, and the parameteric setting must be learned from actual patient data, which is a nontrivial direction for future work. Herein, we simply consider those models as specific variations of our general method, inspired by certain aspects of the corresponding diseases, and focus primarily on the computational aspects of our algorithm, demonstrating that the proposed parametric extension of Q-Learning can learn better than the baseline Q-Learning due to added flexibility.

5 Empirical Results

Empirically, we evaluated the algorithms in two settings: the gambling game of a simple Markov Decision Process (MDP) and a real-life Iowa Gambling Task (IGT) [22]. There is considerable randomness in the reward, and predefined multimodality in the reward distributions of each state-action pairs, and as a result we will see that indeed Q-learning performs poorly. In all experiments, the discount factor was set to be 0.95. The exploration is included with -greedy algorithm with set to be 0.05. The learning rate was polynomial , which was shown in previous work to be better in theory and in practice [23]. All experiments were performed and averaged for 100 runs, and over 500 steps of decision making actions from the initial state, performed on a machine with four CPU cores. In order to evaluate the performances of the algorithms, we need a scenario-independent measure which is not dependent on the specific selections of reward distribution parameters and pool of algorithms being considered. The final cumulative rewards might be subject to outliers because they are scenario-specific. The ranking of each algorithms might be subject to selection bias due to different pools of algorithms being considered. The pairwise comparison of the algorithms, however, is independent of the selection of scenario parameters and selection of algorithms. For example, in the 100 randomly generated scenarios, algorithm X beats Y for times while Y beats X times. We may compare the robustness of each pairs of algorithms with the proportion .

MDP example. In this simple MDP example, a player starts from initial state A, choose between two actions: go left to reach state B, or go right to reach state C. Both states B and C reveals a zero rewards. From state B, the player has only one action to reach state D which reveals draws of rewards from a distribution . From state C, the player has only one action to reach state E which reveals draws of rewards from a distribution . The reward distributions of states D and E are both multimodal distributions (for instance, the reward can be drawn from a bi-modal distribution of two normal distributions with probability and with ). In the simulations, is set to be 50. The left action (go to state B) by default is set to have an expected payout lower than the right action. However, the reward distributions can be spread across both the positive and negative domains. For HQL, the reward is separated into a positive stream (if the revealed reward is positive) and a negative stream (if the revealed reward is negative).111The raw data and code to reproduce all the numerical simulations can be downloaded at

Figure 1 shows an example scenario where the reward distributions, percentage of choosing the better action (go right), cumulative rewards and the changes of two Q-tables over the number of iterations, drawn with standard errors over 100 runs. Each trial consisted of a synchronous update of all 500 actions. With polynomial learning rates, we see Human Q-learning converges much more quickly than Q-Learning.

To better evaluate the robustness of the algorithms, we simulated 100 randomly generated scenarios of bi-modal distributions, where the reward distributions can be drawn from two normal distribution with means as random integers uniformly drawn from -100 to 100, standard deviations as random integers uniformly drawn from 0 to 20, and sampling distribution uniformly drawn from 0 to 1 (assigning to one normal distribution and to the other one). Each scenario was repeated 100 times. Table 3 summarizes the pairwise comparisons between Q-Learning (QL), Double Q-Learning (DQL) [24], Standard Human Q-Learning (SQL), Positive Q-Learning (PQL) and Negative Q-Learning (NQL), with the row labels as the algorithm X and column labels as algorithm Y giving in each cell denoting X beats Y times and Y beats X times. Among the five algorithms, SQL Q never seems to fail catastrophically by maintaining an overall advantages over the other algorithms (with the highest average winning percentage of 0.68 while all others below 0.50). HQL seems to benefit from the sensitivity to two streams of rewards instead of collapsing them into estimating the means as in Q-Learning.

To explore the variants of HQL representing different mental disorders, we also performed the same experiments on the 7 disease models proposed in section 4.1. Table 3 summarizes their pairwise comparisons with SQL, DQL and QL, where the average wins are computed averaged against three standard baseline models. Overall, PD (“Parkinson’s”), CP (“Chronic Pain”) and M (“moderate”) performs relatively well when in this environments. With the same algorithmic framework as the mental agents, the standard HQL (SQL) can distinguish against most mental agents with the largest marginals (0.81 chance of beating a certain mental agents, while DQL with 0.65 and QL with 0.58). The variation of behaviors also suggest the proposed framework can potentially cover a wide spectrum of behavior by simply tuning the four hyperparameters.

Figure 1: Example bi-modal MDP scenario where HQL performs better than QL and DQL.
Figure 2: Short-term learning curves of different mental agents in IGT scheme 1.

Iowa Gambling Task.The original Iowa Gambling Task (IGT) studies decision making where the participant needs to choose one out of four card decks (named A, B, C, and D), and can win or lose money with each card when choosing a deck to draw from [25], over around 100 actions. In each round, the participants receives feedback about the win (the money he/she wins), the loss (the money he/she loses), and the combined gain (win minus lose). In the MDP setup, from initial state I, the player select one of the four deck to go to state A, B, C, or D, and reveals positive reward (the win), negative reward (the loss) and combined reward simultaneously. Decks A and B by default is set to have an expected payout (-25) lower than the better decks, C and D (+25). For QL and DQL, the combined reward is used to update the agents. For HQL, PQL and NQL, the positive and negative streams are fed and learned independently given the and .

There are two major payoff schemes in IGT. In the traditional payoff scheme, the net outcome of every 10 cards from the bad decks (i.e., decks A and B) is -250, and +250 in the case of the good decks (i.e., decks C and D). There are two decks with frequent losses (decks A and C), and two decks with infrequent losses (decks B and D). All decks have consistent wins (A and B to have +100, while C and D to have +50) and variable losses (summarized in Table 4, where scheme 1 [26] has a more variable losses for deck C than scheme 2 [27]). 222The raw data and descriptions of Iowa Gambling Task can be downloaded at [22].

We performed the each scheme for 100 times over 500 actions. Among the variants of HQL and baselines QL and DQL, CP (“chronic pain”) performs best in scheme 1 with the final cumulative rewards of 2689641.25 over 500 draws of cards, followed by NQL (2685174.5) and QL (2673854.75). This is consistent to the clinical implication of chronic pain patients which tend to forget about positive reward information (as modeled by a smaller ) and lack of drive to pursue rewards (as modeled by a smaller ). In scheme 2, SQL performs best with the final score of 2724046.5, followed by NQL (2700618.5) and QL (2689553.5). These examples suggest that the proposed framework has the flexibility to map out different behavior trajectories in real-life decision making (such as IGT). Figure 2 demonstrated the short-term (in 100 actions) and long-term behaviors of different mental agents, which matches clinical discoveries. For instance, ADD (“addiction”) quickly learns about the actual values of each decks (as reflected by the short-term curve) but in the long-term still sticks with the decks with a larger wins (despite also with even larger losses). At around 20 actions, ADD performs better than QL and DQL in learning about the decks with the better gains.

Table 2: Standard agents
QL - 46 : 54 34:66 72 : 28 44 : 56
DQL 54:46 - 34:66 59:41 50:50
SQL 66:34 66:34 - 77:23 62:38
PQL 28:72 41:59 23:77 - 45:55
NQL 56:44 50:50 38:62 55:45 -
avg wins (%) 0.49 0.49 0.68 0.34 0.50
Table 3: Mental agents
SQL ADD ADHD AD CP bvFTD PD M avg wins (%)
29:71 QL 60:40 65:35 73:27 43:57 75:25 38:62 49:51 0.58
22:78 DQL 54:46 80:20 81:19 61:39 77:23 52:48 53:47 0.65
- SQL 78:22 94:6 95:5 67:33 89:11 66:34 81:19 0.81
- avg wins (%) 0.36 0.20 0.17 0.40 0.16 0.48 0.39 -
MDP Task with 100 randomly generated scenarios of Bi-modal reward distributions
Decks win per card loss per card expected value scheme
A (bad) +100 Frequent: -150 (p=0.1), -200 (p=0.1), -250 (p=0.1), -300 (p=0.1), -350 (p=0.1) -25 1
B (bad) +100 Infrequent: -1250 (p=0.1) -25 1
C (good) +50 Frequent: -25 (p=0.1), -75 (p=0.1),-50 (p=0.3) +25 1
D (good) +50 Infrequent: -250 (p=0.1) +25 1
A (bad) +100 Frequent: -150 (p=0.1), -200 (p=0.1), -250 (p=0.1), -300 (p=0.1), -350 (p=0.1) -25 2
B (bad) +100 Infrequent: -1250 (p=0.1) -25 2
C (good) +50 Infrequent: -50 (p=0.5) +25 2
D (good) +50 Infrequent: -250 (p=0.1) +25 2
Table 4: Iowa Gambling Task schemes

6 Conclusion

This research proposes a novel parametric family of algorithms for RL problem, extending the classical Q Learning to model a wide range of potential reward processing biases. Our approach draws an inspiration from extensive literature on decision-making behavior in neurological and psychiatric disorders stemming from disturbances of the reward processing system, and demonstrates high flexibility of our multi-parameter model which allows to tune the weights on incoming two-stream rewards and memories about the prior reward history. Our preliminary results support multiple prior observations about reward processing biases in a range of mental disorders, thus indicating the potential of the proposed model and its future extensions to capture reward-processing aspects across various neurological and psychiatric conditions. The contribution of this research is two-fold: from the AI perspective, we propose a more powerful and adaptive approach to RL, outperforming state-of-art QL in certain reward distributions; from the neuroscience perspective, this work is the first attempt at general, unifying model of reward processing and its disruptions across a wide population including both healthy subjects and those with mental disorders, which has a potential to become a useful computational tool for neuroscientists and psychiatrists studying such disorders. Among the directions for future work, we plan to investigate the optimal parameters in a series of computer games evaluated on different criteria, for example, longest survival time vs. highest final score. Further work includes exploring the multi-agent interactions given different reward processing bias. These discoveries can help build more interpretable real-world RL systems. On the neuroscience side, the next steps would include further tuning and extending the proposed model to better capture observations in modern literature, as well as testing the model on both healthy subjects and patients with specific mental conditions.


  • [1] David C Perry and Joel H Kramer. Reward processing in neurodegenerative disease. Neurocase, 21(1):120–133, 2015.
  • [2] Michael J Frank, Lauren C Seeberger, and Randall C O’reilly. By carrot or by stick: cognitive reinforcement learning in parkinsonism. Science, 306(5703):1940–1943, 2004.
  • [3] A David Redish, Steve Jensen, Adam Johnson, and Zeb Kurth-Nelson. Reconciling reinforcement learning models with behavioral extinction and renewal: implications for addiction, relapse, and problem gambling. Psychological review, 114(3):784, 2007.
  • [4] W. Schultz, P. Dayan, and P. R. Montague. A Neural Substrate of Prediction and Reward. Science, 275(5306):1593–1599, mar 1997.
  • [5] Richard S Sutton, Andrew G Barto, et al. Introduction to reinforcement learning, volume 135. MIT press Cambridge, 1998.
  • [6] Hannah M. Bayer and Paul W. Glimcher. Midbrain Dopamine Neurons Encode a Quantitative Reward Prediction Error Signal. Neuron, 47(1):129–141, jul 2005.
  • [7] John O’Doherty, Peter Dayan, Johannes Schultz, Ralf Deichmann, Karl Friston, and Raymond J. Dolan. Dissociable Roles of Ventral and Dorsal Striatum in Instrumental. Science, 304(16 April):452–454, 2004.
  • [8] Andrew S Hart, Robb B Rutledge, Paul W Glimcher, and Paul E M Phillips. Phasic Dopamine Release in the Rat Nucleus Accumbens Symmetrically Encodes a Reward Prediction Error Term. Journal of Neuroscience, 34(3):698–704, 2014.
  • [9] Ben Seymour, Tania Singer, and Ray Dolan. The neurobiology of punishment. Nature Reviews Neuroscience, 8(4):300–311, apr 2007.
  • [10] Peter Dayan and Yael Niv. Reinforcement learning: the good, the bad and the ugly. Current opinion in neurobiology, 18(2):185–196, 2008.
  • [11] Michael J Frank and Randall C O’Reilly. A Mechanistic Account of Striatal Dopamine Function in Human Cognition: Psychopharmacological Studies With Cabergoline and Haloperidol. Behavioral Neuroscience, 120(3):497–517, 2006.
  • [12] Amos Tversky and Daniel Kahneman. The Framing of Decisions and the Psychology of Choice. Science, 211(4481):453–458, 1981.
  • [13] Tiago V Maia and Michael J Frank. From reinforcement learning models to psychiatric and neurological disorders. Nature Neuroscience, 14(2):154–162, 2011.
  • [14] Avram J. Holmes and Lauren M. Patrick. The Myth of Optimality in Clinical Neuroscience. Trends in Cognitive Sciences, 22(3):241–257, feb 2018.
  • [15] Marjolein Luman, Catharina S Van Meel, Jaap Oosterlaan, Joseph A Sergeant, and Hilde M Geurts. Does reward frequency or magnitude drive reinforcement-learning in attention-deficit/hyperactivity disorder? Psychiatry research, 168(3):222–229, 2009.
  • [16] Anna MW Taylor, Susanne Becker, Petra Schweinhardt, and Catherine Cahill. Mesolimbic dopamine signaling in acute and chronic pain: implications for motivation, analgesia, and addiction. Pain, 157(6):1194, 2016.
  • [17] William W Seeley, Juan Zhou, and Eun-Joo Kim. Frontotemporal dementia: what can the behavioral variant teach us about human brain organization? The Neuroscientist, 18(4):373–385, 2012.
  • [18] Tobias U Hauser, Vincenzo G Fiore, Michael Moutoussis, and Raymond J Dolan. Computational psychiatry of adhd: neural gain impairments across marrian levels of analysis. Trends in neurosciences, 39(2):63–73, 2016.
  • [19] Amir Dezfouli, Payam Piray, Mohammad Mahdi Keramati, Hamed Ekhtiari, Caro Lucas, and Azarakhsh Mokri. A neurocomputational model for cocaine addiction. Neural computation, 21(10):2869–2893, 2009.
  • [20] Leonardo Emanuel Hess, Ariel Haimovici, Miguel Angel Muñoz, and Pedro Montoya. Beyond pain: modeling decision-making deficits in chronic pain. Frontiers in behavioral neuroscience, 8, 2014.
  • [21] D.P. Bertsekas and J.N. Tsitsiklis. Neuro-dynamic programming. Athena Scientific, 1996.
  • [22] Helen Steingroever, Daniel J Fridberg, Annette Horstmann, Kimberly L Kjome, Veena Kumari, Scott D Lane, Tiago V Maia, James L McClelland, Thorsten Pachur, Preethi Premkumar, et al. Data from 617 healthy participants performing the iowa gambling task: A" many labs" collaboration. Journal of Open Psychology Data, 3(1):340–353, 2015.
  • [23] Eyal Even-Dar and Yishay Mansour. Learning rates for q-learning. Journal of Machine Learning Research, 5(Dec):1–25, 2003.
  • [24] Hado V Hasselt. Double q-learning. In Advances in Neural Information Processing Systems, pages 2613–2621, 2010.
  • [25] Antoine Bechara, Antonio R Damasio, Hanna Damasio, and Steven W Anderson. Insensitivity to future consequences following damage to human prefrontal cortex. Cognition, 50(1-3):7–15, 1994.
  • [26] Daniel J Fridberg, Sarah Queller, Woo-Young Ahn, Woojae Kim, Anthony J Bishara, Jerome R Busemeyer, Linda Porrino, and Julie C Stout. Cognitive mechanisms underlying risky decision-making in chronic cannabis users. Journal of mathematical psychology, 54(1):28–38, 2010.
  • [27] Annette Horstmann, Arno Villringer, and Jane Neumann. Iowa gambling task: There is more to consider than long-term outcome. using a linear equation model to disentangle the impact of outcome and frequency of gains and losses. Frontiers in Neuroscience, 6:61, 2012.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description