Modelling Stock-market Investors as Reinforcement Learning Agents [Correction]

# Modelling Stock-market Investors as Reinforcement Learning Agents [Correction]

## Abstract

Decision making in uncertain and risky environments is a prominent area of research. Standard economic theories fail to fully explain human behaviour, while a potentially promising alternative may lie in the direction of Reinforcement Learning (RL) theory. We analyse data for 46 players extracted from a financial market online game and test whether Reinforcement Learning (Q-Learning) could capture these players behaviour using a risk measure based on financial modeling. Moreover we test an earlier hypothesis that players are “naïve” (short-sighted). Our results indicate that a simple Reinforcement Learning model which considers only the selling component of the task captures the decision-making process for a subset of players but this is not sufficient to draw any conclusion on the population. We also find that there is not a significant improvement of fitting of the players when using a full RL model against a myopic version, where only immediate reward is valued by the players. This indicates that players, if using a Reinforcement Learning approach, do so naïvely.

## 1Introduction

One of the most challenging fields of research is human decision-making. Understanding the processes involved and trying to predict or replicate behaviours has been, historically, the ultimate goal of many disciplines. Economics for example, has a long tradition of trying to formalise human behaviour into descriptive or normative models. These models have been employed for several years (e.g. Expected Utility model [1]) but have been proven to be inadequate [2], giving rise to new research areas like behavioural and experimental economics. Psychology as well, is natively concerned with decision-making. Sequential decision problems have been used to evaluate people’s risk attitude, in order to predict actual risk proneness in real life scenarios [6]. While economics and psychology are focused on the high-level manifestations and implications of decision-making, neuroscience aims at understanding the biological machinery and the neural processes behind human (or animal) behaviour [9].

Recently these fields of research have started to collaborate, contributing to the rise of an emerging multi-disciplinary field called neuroeconomics [16]. This discipline approaches the problem from several perspectives and on different levels of abstraction. RL is a theoretical framework [21], extensively used in neuroeconomics literature for addressing a wide array of problems involving learning in partially observable environments [22]. RL is based on the concept of reward/punishment for the actions taken. The agents act in an environment of which they possess only partial knowledge. To be able to achieve the best behaviour, i.e. maximise their reward, the agents have to learn through experience and update their beliefs. Learning happens as a result of the agent’s interpretation of the interactions with its surroundings and the consequences of a “reward” feedback signal. The ability of this framework to model and therefore understand behavioural data and its underlying neural implications, is of pivotal importance in decision making [28].

RL can accurately capture human and animal learning patterns and has been proven effective at describing the functioning of some areas of the human brain, like the basal ganglia, and the functions of neurotransmitters such as dopamine [29]. One of the most remarkable similarities between biological functioning and RL models is the one about Temporal Difference (TD) error [21] and the activation of mid-brain dopamine neurons [35]. These findings supported the notion that TD Learning is implemented in the brain with dopaminergic neurons in the striatum [29], making it a reasonable first choice for a modelling attempt. Humans and animals are very advanced signal detectors whose behaviour is susceptible to changes in the rewards resulting from their choices [51]. Both neuroscience and psychology have extensively employed tasks in which the exploration-exploitation trade-off was of crucial importance [53]. It is crucial for the individuals to maximise their reward using the information at their disposal but to do so advantageously they need to learn which actions lead to better rewards. Decision making in uncertain environments is a challenging problem because of the competing importance of the two strategies: exploitation is, of course, the best course of action, but only when enough knowledge about the quality of the actions is available, while exploration increases the knowledge about the environment.

A complicated task that encompasses all these features is stocks selection in financial markets, where investors have to choose among hundreds of possible securities to create their portfolio. Stock trends are non-monotonic because they are not guaranteed to achieve a global maximum and the future distribution of reward is intrinsically stochastic. After purchasing a stock, investors are faced with the decisions on when to sell it (Market timing problem [56]). To be able to achieve the best return from their investments, people need to be careful in considering how to maximise their profit in the long term and not only in a single episode. We speculate that RL is part of the decision making process for investors. This speculation is supported by Choi et al. [57], who studied individual investors decisions on 401(k) savings plans. Over the years, investors could decide to increase or decrease the percentage of their salary to commit to this retirement plan. Their results suggest that investors’ decisions are influenced by personal experiences: they show that those investors who have experienced a positive reward from previous investment in their 401(k) fund, tend to increase the investment in that particular product, compared to those who experienced a lower reward outcome. This kind of behaviour follows a “naïve reinforcement learning” and is in contrast with the disposition effect [65](the unwillingness of investors to sell “losing” investments). Huang et al. investigated how personal experience in investments affects future decisions about the selection of stocks [58]. They used data that spans from 1991 to 1996, from a large discount broker. Again, the pattern of repurchasing financial product which yielded positive return was found. As Huang suggests, by understanding the way past experience affects investors’ decisions, it might be possible to make predictions about the financial markets involved. RL has also been used, with promising results, to develop Stock Market Trading Systems [60] and to build Agent Based Stock Market Simulations [64]. While these works use RL to predict future prices, they do not try to describe human behaviour. With these notions as background we decided to investigate and try to model human choices in a stochastic, non-stationary environment. We hypothesise that RL is a component of decision making and to test this we compare two RL models against a purely random one. Our modelling attempts are based on two assumptions. First, we assume that risk is a proxy of the internal representation of the actions for some players. To test this we use a measure of systematic risk widely used in finance and economics to categorise the different choices into three discrete classes. We also assume that the reward signal is based on the cash income arising from the sales an investor makes. This assumption follows a widely researched behaviour referred to as “disposition effect” in literature [65], the tendency of individual investors to sell stocks which increased in value since when they were purchased, while holding onto the stocks which lost value. This phenomenon is stronger for individual investors but it also exhibited by institutional investors, such as mutual funds and corporations [67]. Following these indications we mapped the sell transactions to a reward signal to fit our models. Finally, we hypothesise that not all players are short-sighted, to test this we compare a full RL model (3 free parameters) against a myopic model (2 free parameters, no gamma). The difference is that the latter can be considered a naïve RL as it does not take into account future rewards, it only seeks to maximise immediate rewards.

## 2Method

### 2.1Dataset

The dataset has been extracted from the publicly accessible online trading simulation game VirtualTrader1, which is managed by IEX Media Group BV in the Netherlands. Players can subscribe for free and start playing the game with an assigned virtual cash budget of 100k GBP. The players will then pick the stocks they prefer from the FTSE100 stock index pool (107 stocks at the time of data collection) and create their own portfolio. These competitors are ranked according to the return of their investment. This is composed of “holdings” and “cash”. The former represent the shares possessed by a player while the latter is the amount of money not invested (i.e. deriving from sold stocks or never invested). The simulation follows real world data evolution, for example price fluctuations and price splits. The delay is usually in the order of 10-15 minutes and the player can access a visual representation of the stocks time series. All the transactions are stored for each player. For this study we considered transactions that span from the 1st of January 2014 to the 31st of May 2014. This time period has been chosen because at that time the player ranking was determining the winner of the monthly prize giveaway. Two possible rewards can be identified: a psychological one, consisting of the ranking position and a tangible one being the prize for the highest achiever.

The transactions have been stored in a database in order to be manipulated and used to fit models with different combination of free parameters. The rows are structured in 6 fields: Date, Type, Stock, Volume, Price and Total. The dataset initially contained about 100k transactions that were reduced to about 1.4k. This was due to preprocessing, which removed the many instances of inactive players who played only at the beginning and/or at the end of the time frame considered. In the final version of the dataset there are 46 players. The average amount of transactions per player is 30. The player who played the most during the six months performed 107 transactions. We considered the full amount of transactions each player operated in the game.

### 2.2Reinforcement Learning Setup

We adopted a widely used off-policy RL framework called Q-learning [21]. The learning rule of this model is:

where represents the value of action while in state , at time . is the step-size parameter and controls the rate of learning. is the discount factor and represents how far-sighted the model is, It encodes how much a future reward is worth at time . When only immediate rewards are taken into account by the player.

To test this framework the task has been mapped as follows. There are two states (win, loss) calculated according to the profit of the player (details in equations Equation 6 and Equation 7). These two states reflect the dichotomy rooted in the Prospect Theory’s value function gain/loss spectrum [73].

Since all players begin with the same initial budget our calculation of the profit uses the returns accumulated by selling stocks. This choice reduces the scope of the model, focusing on the cash component of the players assets. This will be referred to as the “Sell” model. The actions are mapped to the stocks available for trading.

In order to avoid dimensionality issues, 107 stocks for 2 states give rise to 214 potential actions, we decided to classify the stocks in 3 classes of risk using a widely used financial modelling measure, . The acronym stands for Capital Asset Pricing Model, a model developed by Sharpe[74] used to explain the relationship between the expected return of a security and its risk . In this report we will refer to financial volatility measure as :

This financial modelling measure quantifies the volatility of a security in comparison to the market or a reference benchmark [75]. Relatively safe investments like utilities stocks (e.g. gas and power) have a low , while high-tech stocks (e.g. Nasdaq or MSH Morgan Stanley High-Tech) have a high .

As an example, the of the index of reference (that represents the portion of the market considered) is exactly 1. A indicates that the asset has a lower volatility compared to the market or low correlation of the asset price movements compared to the market. While if it signifies an investment with higher volatility compared to the benchmark. Following the previous example, high-tech securities with a could yield better returns compared to their benchmark index, when the market is going up. This also poses more risk because in case the market loses value, the security would lose value at a higher rate than the index.

is considered a measure of the systematic risk and can be estimated by regression. Considering the returns of an asset and the returns of the corresponding benchmark :

has been calculated for each stock in the FTSE100 at the time of the game by considering daily returns in the year between 1st June 2013 and 31st May 2014. The measure associated with each stock is used to rank them and subdivide them in three classes, containing respectively 36, 36 and 35 stocks each.

Reward at time step is defined by the gain (or loss) made in a sell transaction. Buying transactions are kept into account to track players portfolios and to calculate the price difference. They were not used as actions, but we might extend our modelling scenario by integrating a “Buy” model in the future and consider purchase actions by changing the reward scheme. The reward is calculated as:

where and are the volume and the price of the stock traded at the i-th time step. The second term of the difference is a weighted average of the stock prices at previous times.

To avoid numerical instabilities, reward has been flattened with a sigmoid function into the range . Specifically a hyperbolic tangent has been used, with to capture most of the variability of the rewards, only flattening the extreme values. This choice is in line with prospect theory value function which is concave for gains and convex for losses [73].

As in this study we focused on the sell subset of the players interactions, the states are based solely on profit, which in turn is based on the reward of the sell transactions. The profit and states are defined as:

The RL framework is composed of a learning model (Equation 1) and an action-selection model that is responsible for picking the best action. In our setup the former is Q-learning and the latter is Soft-Max. An action is picked at time with probability:

where is the number of available actions (i.e. 3 in this study) and is the inverse temperature parameter and represents the greediness of the action-selection model. In the limit the actions become equiprobable and the model reverts to random. Higher values of approximate a greedy model which picks the best known action (fully exploitative). The full model has 3 free parameters: (step-size parameter or learning rate), (exploration-exploitation trade-off) and (discount factor). For this study we used a bounded gradient descent search with 27 combinations of initial guess points. These are the combination of values of the free parameters from where the search starts. By having different entry points we hope to reduce the chance of the search getting stuck in a local minimum solution. The search has been performed with the following boundaries:

• (for the myopic model )

The entry points are the combinations rising from the following values:

The search results have been obtained on python 2.7.9 and scipy.optimize.minimize with scipy 0.17.1.

### 2.3Model Testing Routine

Maximum Likelihood Estimate has been used as a measure of the model fitness, following Daw’s comprehensive analysis of methodology [76]. MLE is the appropriate method to assess model performance because it evaluates which set of model parameters are more likely to generate the data using a probabilistic approach. Data likelihood is a powerful method because it keeps into account the presence of noise in the choices. It does so by using probability estimates for the potential actions.

Given a model and its corresponding set of parameters the likelihood function is defined as , where is the dataset (the list of choices and the associated rewards). Applying Bayes’ rule:

The left hand side of the proportionality is the posterior probability distribution over the free parameters, given the data. This quantity is proportional to the product of the likelihood of the data, given the parameters and the prior probability of the parameters. Treating the latter as flat we obtain that the most probable value for (the best set of free parameters) is the Maximum Likelihood Estimate (MLE), that is the set of parameter which maximises the likelihood function, and it is commonly denoted . The likelihood function is maximised through the following process. At each timestep, for every action, the observation model (Soft-Max) estimates a probability. These probabilities are then multiplied together. To avoid numerical problems that could arise when multiplying probabilities, the sum of their logarithm is calculated instead. The negative of this value, also known as Negative Log-Likelihood, is then used. The aim is then to minimise this quantity, which is the equivalent of maximising the likelihood function, . In the future we will refer to the Negative Log-Likelihood as MLE for simplicity, keeping in mind that lower values represent better fit.

The values of MLE generated represent the goodness of fit of the model with its associated set of parameters. To compare the selected model with a random model and for statistical significance we adopted the Likelihood Ratio Test [77]. This statistical test uses the likelihood ratio to compare two nested models and takes into account the different number of free parameters of the two. It encapsulates this information, when testing for significance, using the difference of the two amounts as degrees of freedom for the chi-square () test. Since the test statistic is distributed it is straightforward to estimate the p-value associated with the value.

The baseline for comparison is a random model which has 0 parameters as there is no learning involved and the action-selection policy is random ( chance of picking any of the three stock bins). The first comparison is between this random model and the simpler of the proposed models, which has only two free parameters (). This setup represent the naïve learning procedure that could explain investors’ behaviour showed in literature[57]. The full model, with all the three free parameters, has also been tested against the myopic model to assess whether some players are better fitted by a more complex version of the framework. Finally the model goodness of fit has been evaluated with the adopted action classification (based on risk) against 500 randomly generated stock classifications. This has been done to test the assumption that players internally classify the stock range into discrete degrees of risk.

## 3Results

Results for the test of the hypothesis that RL is a component of decision making are shown in Figure 1. The best set of parameters was found according to MLE through gradient descent search. The best model MLE has been compared to the random model MLE using the Likelihood Ratio Test [77]. The random model MLE is easily estimated as:

where is the number of transactions for each player in the dataset. As shown in Figure 1 (a) and (b) 15 of the players in our dataset is better fitted by a myopic RL model as opposed to a random model. In Figure 1 (c) and (d) we report an improvement in the fitting for some players using a full RL model against the myopic (nested) version of the model. This improvement is not reflected in the comparison of the full RL model against the random model, as shown in Figure 1 (e) and (f). Most of the players that can be fitted with our models are well represented by a myopic model. These results follow what found by Choi et al.[57] and Huang et al. [58]. We made the assumption that players, when faced with the choice to trade many stocks (107 for this task), internally model these in discrete groups of risk using readily available information such as stock historical prices and returns, which in turn are used to estimate their volatility (). To test whether this assumption holds true for the players in our dataset, we ran the simpler version of our model on the risk-ranked discretisation and on 500 independent and randomly scrambled discretisations. The results are shown in Figure 2 and are generated using Bayesian Information Criterion (BIC) as a measure for comparison of fitness and Binomial Proportion Confidence Interval calculated with Clopper-Pearson method using Matlab 8.4.0.150421 (R2014b) function binofit. The BIC has been used as the Likelihood Ratio Test can only be used to compare nested models, while in this case the comparison is between models with the same number of parameters that are tested on different data arrangements. This procedure estimates the probability that the ranked discretisation is better than the 500 scrambled discretisations (, where is the BIC for the s-th scrambled). The results shown in Figure 2 are for 99 confidence interval. The results shown are only for those 7 players who are fitted significantly by the myopic model. As shown in Figure 2, all the players are well above the chance threshold. This indicates that risk based on historical data could be considered a proxy for action selection for the players who are well fitted by our RL myopic model.

## 4Conclusion

We investigated a publicly available dataset consisting of trading transactions operated by players of an investment game. We based the discretisation of the actions on the assumption that risk can capture the internal modelling that players operate when facing this task. This assumption was shown to hold true and be statistically significant for a subset of the players, 31 out of 46 and specifically for the 7 players who are best fitted by a RL model. This could signify that the remaining players might use other types of discretisation techniques based on different measures (or a combination of them) or they do not use technical analysis but fundamental analysis (e.g. using financial statements and reports). In this work, we investigated a model which combines two versions of a Reinforcement Learning framework using Q-learning as an update rule and Soft-Max as action-selection policy on a discretised action space according to the risk measure . It is possible that different model combinations, which use different learning rules or different measures of risk, fit the players population in our dataset better. It is also likely that, by restricting our focus on the sell model, we missed some features of what constitutes the reward signal that players receive. In the full version of the game, in fact, players might try to maximise both holdings and cash simultaneously, in order to compete in the ranking.

The myopic model is a nested version with only two free parameters, representing the learning rate () and the degree of greediness (). The full version extends the simpler model with a discount factor () which regulates how much of the future rewards is taken into account when updating the values of present state-action pairs. 15 of the players are well fitted by a RL model with and there is no significant improvement of fitting by extending this model including gamma as a free parameter. Previous literature pointed in the direction of investors being naïve (short-sighted) [57] and these results, albeit for a subset of the dataset, confirm this indication. The hypothesis that RL is a component of the decision making process for some investors is not confirmed as either version of the tested model (short or far-sighted) is statistically better than chance only for a subset of the players. This subset, within this population, is not large enough to draw a statistically meaningful positive result. By means of a Binomial Proportion Confidence Interval calculated with Clopper-Pearson method we get a negative result for the entire population within a 99 confidence interval (Figure 3 in the Appendix). While this exploratory study gives some perspectives on how Reinforcement Learning can be used to model learning and action-selection for investing problems, future work will focus on different models and risk classification techniques as well as on a deeper investigation of the typical parameters of the best performing players and the correlation of different strategies and performance of stock trading together with a study of different RL models.

## Acknowledgment

The authors would like to thank their colleagues of the Sheffield Neuroeconomics interdisciplinary research group for insightful discussion.

## AAppendix

This manuscript is a correction to the article “Modelling Stock-market Investors as Reinforcement Learning Agents” by the same authors, issued in the proceedings of the 2015 IEEE Conference on Evolving and Adaptive Intelligent Systems. The corrections include fixing a bug in the script which estimated the probabilities used in the calculation of model fitness. In the previous work we applied some constraints and used a different measure: the number of transactions considered for each player was capped at 25 and the measure of risk used to rank the stocks and classify them into discrete categories was defined by the authors as:

where is the financial modelling measure of volatility of a security used in the present work, is the standard deviation of the j-th stock and the is the largest standard deviation in the stock pool. This measure of riskiness was used as it was believed to take into account the graphical interpretation of the fluctuations in time series () and the overall trend of the security compared to the market ().

### Footnotes

1. http://www.virtualtrader.co.uk - Copyright IEX Media Group BV

### References

1. J. Von Neumann and O. Morgenstern, Theory of games and economic behavior. 1947.
2. C. Stramer, “Developments in Non-Expected Utility Theory: The Hunt for a Descriptive Theory of Choice under Risk.” pp. 332–382, 2000.
3. S. Frederick, G. Loewenstein, and T. O’Donoghue, “Time Discounting and time preference: a critical review.” Journal of Economic Literature, vol. XL, pp. 351–401, 2002.
4. A. Tversky and D. Kahneman, “Judgment under uncertainty : heuristics and biases.” Science, vol. 185, no. 4157, pp. 1124–1131, 1974.
5. D. Kahneman, J. L. Knetsch, and R. H. Thaler, “Anomalies: The Endowment Effect, Loss Aversion, and Status Quo Bias.” Journal of Economic Perspectives, vol. 5, no. 1, pp. 193–206, 1991.
6. U. Hoffrage, A. Weber, R. Hertwig, and V. M. Chase, “How to Keep Children Safe in Traffic: Find the Daredevils Early.” Journal of Experimental Psychology: Applied, vol. 9, no. 4, pp. 249–260, 2003.
7. T. J. Pleskac, “Decision making and learning while taking sequential risks.” Journal of Experimental Psychology: Learning, vol. 34, no. 1, pp. 167–185, 2008.
8. T. S. Wallsten, T. J. Pleskac, and C. W. Lejuez, “Modeling Behavior in a Clinically Diagnostic Sequential Risk-Taking Task.Psychological Review,, vol. 112, no. 4, pp. 862–880, 2005.
9. K. H. Britten, W. T. Newsome, M. N. Shadlen, S. Celebrini, and J. a. Movshon, “A relationship between behavioral choice and the visual responses of neurons in macaque MT.Visual neuroscience, vol. 13, no. 1, pp. 87–100, 1996.
10. K. H. Britten, M. N. Shadlen, W. T. Newsome, and J. a. Movshon, “The analysis of visual motion: a comparison of neuronal and psychophysical performance.” The Journal of neuroscience: the official journal of the Society for Neuroscience, vol. 12, no. 12, pp. 4745–4765, 1992.
11. J. I. Gold and M. N. Shadlen, “Neural computations that underlie decisions about sensory stimuli.” Trends in Cognitive Sciences, vol. 5, no. 1, pp. 10–16, 2001.
12. ——, “Banburismus and the brain: Decoding the relationship between sensory stimuli, decisions, and reward.” Neuron, vol. 36, no. 2, pp. 299–308, 2002.
13. ——, “The neural basis of decision making.Annual review of neuroscience, vol. 30, pp. 535–574, 2007.
14. M. N. Shadlen, K. H. Britten, W. T. Newsome, and J. a. Movshon, “A computational analysis of the relationship between neuronal and behavioral responses to visual motion.” J Neurosci, vol. 16, no. 4, pp. 1486–1510, 1996.
15. M. N. Shadlen and W. T. Newsome, “Motion perception: seeing and deciding.Proceedings of the National Academy of Sciences of the United States of America, vol. 93, no. 2, pp. 628–633, 1996.
16. =2plus 43minus 4 A. G. Sanfey, G. Loewenstein, S. M. McClure, and J. D. Cohen, “Neuroeconomics: cross-currents in research on decision-making.Trends in cognitive sciences, vol. 10, no. 3, pp. 108–16, mar 2006. =0pt
17. =2plus 43minus 4 P. W. Glimcher and A. Rustichini, “Neuroeconomics: the consilience of brain and decision.Science (New York, N.Y.), vol. 306, no. 5695, pp. 447–52, oct 2004. =0pt
18. =2plus 43minus 4 G. Loewenstein, S. Rick, and J. D. Cohen, “Neuroeconomics.Annual review of psychology, vol. 59, pp. 647–72, jan 2008. [Online]. =0pt
19. D. J. Barraclough, M. L. Conroy, and D. Lee, “Prefrontal cortex and decision making in a mixed-strategy game.Nature neuroscience, vol. 7, no. 4, pp. 404–410, 2004.
20. A. Soltani, D. Lee, and X. J. Wang, “Neural mechanism for stochastic behaviour during a competitive game.” Neural Networks, vol. 19, no. 8, pp. 1075–1090, 2006.
21. R. S. Sutton and A. G. Barto, Reinforcement Learning: An introduction, A Bradford Book, Ed.1em plus 0.5em minus 0.4emMIT Press, Cambridge, MA, 1998.
22. =2plus 43minus 4 K. Doya, “Reinforcement learning: Computational theory and biological mechanisms.” HFSP Journal, vol. 1, no. 1, p. 30, 2007. =0pt
23. G. Tesauro, “TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play.” Neural Computation, vol. 6, no. 2, pp. 215–219, 1994.
24. A. Abbeel, P, Coates, A, Morgan, Q, Ng, “An Application of Reinforcement Learning to Aerobatic Helicopter Flight.Advances in Neural Information Processing Systems 19: Proceedings of the 2006 Conference, 1-9. 2006.
25. R. Hafner and M. Riedmiller, “Neural reinforcement learning controllers for a real robot application.” Proceedings - IEEE International Conference on Robotics and Automation, no. April, pp. 2098–2103, 2007.
26. M. a. Walker, “An Application of Reinforcement Learning to Dialogue Strategy Selection in a Spoken Dialogue System for Email.” Journal of Artificial Intelligence Research, vol. 12, p. 387, 2000.
27. =2plus 43minus 4 V. Mnih, K. Kavukcuoglu, D. Silver, A. a. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning.” Nature, vol. 518, no. 7540, pp. 529–533, 2015. =0pt
28. =2plus 43minus 4 P. Dayan and N. D. Daw, “Decision theory, reinforcement learning, and the brain.Cognitive, affective & behavioral neuroscience, vol. 8, no. 4, pp. 429–53, dec 2008. =0pt
29. =2plus 43minus 4 W. Schultz, P. Dayan, and P. R. Montague, “A neural substrate of prediction and reward.Science (New York, N.Y.), vol. 275, no. 5306, pp. 1593–9, mar 1997. =0pt
30. K. Doya, “Complementary roles of basal ganglia and cerebellum in learning and motor control.” Current Opinion in Neurobiology, vol. 10, pp. 732–739, 2000.
31. N. D. Daw and K. Doya, “The computational neurobiology of learning and reward.” Current Opinion in Neurobiology, vol. 16, no. 2, pp. 199–204, 2006.
32. C. J. C. H. Watkins,, “Learning from delayed rewards.” University of Cambridge , 1989.
33. P. Werbos, “A menu of designs for reinforcement learning over time. In Neural Networks for Control .” Neural Networks for Control, MIT Press, Cambridge, Massachusetts, pp. 67–95, 1990.
34. O. Hikosaka, K. Nakamura, and H. Nakahara, “Basal ganglia orient eyes to reward.Journal of neurophysiology, vol. 95, no. 2, pp. 567–584, 2006.
35. A. Schultz, W, Romo, R, Ljungberg, T, Mirenowicz, J, Hollerman, JR, and Dickson, “Reward-related signals carried by dopamine neurons.” in Models of Information Processing in the Basal Ganglia, M. Cambridge, Ed., 1995, pp. 233–248.
36. R. E. Suri and W. Schultz, “Learning of sequential movements by neural network model with dopamine-like reinforcement signal.” Experimental Brain Research, vol. 121, no. 3, pp. 350–354, 1998.
37. P. Waelti, a. Dickinson, and W. Schultz, “Dopamine responses comply with basic assumptions of formal learning theory.Nature, vol. 412, no. 6842, pp. 43–48, 2001.
38. T. Satoh, S. Nakai, T. Sato, and M. Kimura, “Correlated coding of motivation and outcome of decision by dopamine neurons.The Journal of neuroscience : the official journal of the Society for Neuroscience, vol. 23, no. 30, pp. 9913–9923, 2003.
39. H. Nakahara, H. Itoh, R. Kawagoe, Y. Takikawa, and O. Hikosaka, “Dopamine Neurons Can Represent Context-Dependent Prediction Error.” Neuron, vol. 41, no. 2, pp. 269–280, 2004.
40. G. Morris, A. Nevet, D. Arkadir, E. Vaadia, and H. Bergman, “Midbrain dopamine neurons encode decisions for future action.Nature neuroscience, vol. 9, no. 8, pp. 1057–1063, 2006.
41. A. Barto, “Adaptive critics and the basal ganglia.” in Models of Information Processing in the Basal Ganglia, M. Cambridge, Ed., 1995, pp. 215–232.
42. P. R. Montague, P. Dayan, and T. J. Sejnowski, “A framework for mesencephalic dopamine systems based on predictive Hebbian learning.The Journal of neuroscience : the official journal of the Society for Neuroscience, vol. 16, no. 5, pp. 1936–1947, 1996.
43. K. Samejima, Y. Ueda, K. Doya, and M. Kimura, “Representation of action-specific reward values in the striatum.Science (New York, N.Y.), vol. 310, no. 5752, pp. 1337–1340, 2005.
44. R. Kawagoe, Y. Takikawa, and O. Hikosaka, “Expectation of reward modulates cognitive signals in the basal ganglia.Nature neuroscience, vol. 1, no. 5, pp. 411–416, 1998.
45. ——, “Reward-predicting activity of dopamine and caudate neurons–a possible mechanism of motivational control of saccadic eye movement.Journal of neurophysiology, vol. 91, no. 2, pp. 1013–1024, 2004.
46. W. Schultz, “Getting formal with dopamine and reward.” Neuron, vol. 36, no. 2, pp. 241–263, 2002.
47. J. J. Day, M. F. Roitman, R. M. Wightman, and R. M. Carelli, “Associative learning mediates dynamic shifts in dopamine signaling in the nucleus accumbens.Nature neuroscience, vol. 10, no. 8, pp. 1020–1028, 2007.
48. S. E. Hyman, R. C. Malenka, and E. J. Nestler, “Neural mechanisms of addiction: the role of reward-related learning and memory.Annual review of neuroscience, vol. 29, pp. 565–598, 2006.
49. D. Joel, Y. Niv, and E. Ruppin, “Actor-critic models of the basal ganglia: new anatomical and computational perspectives.” Neural Networks, vol. 15, no. 4-6, pp. 535–547, 2002.
50. J. R. Wickens, J. C. Horvitz, R. M. Costa, and S. Killcross, “Dopaminergic mechanisms in actions and habits.The Journal of neuroscience : the official journal of the Society for Neuroscience, vol. 27, no. 31, pp. 8181–8183, 2007.
51. A. a. Stocker and E. P. Simoncelli, “Noise characteristics and prior expectations in human visual speed perception.Nature neuroscience, vol. 9, no. 4, pp. 578–585, 2006.
52. K. Körding, “Decision theory: what “should” the nervous system do?Science (New York, N.Y.), vol. 318, no. 5850, pp. 606–610, 2007.
53. N. D. Daw, J. P. O’Doherty, P. Dayan, B. Seymour, and R. J. Dolan, “Cortical substrates for exploratory decisions in humans.Nature, vol. 441, no. 7095, pp. 876–879, 2006.
54. =2plus 43minus 4 G. Luksys, W. Gerstner, and C. Sandi, “Stress, genotype and norepinephrine in the prediction of mouse behavior using reinforcement learning.Nature neuroscience, vol. 12, no. 9, pp. 1180–6, sep 2009. =0pt
55. R. Frey and R. Hertwig, “Sell in May and Go Away ? Learning and Risk Taking in Nonmonotonic Decision Problems.” Journal of Experimental Psychology, vol. 41, no. 1, pp. 193–208, 2015.
56. S. Benartzi and R. H. Thaler, “Heuristics and Biases in Retirement Savings Behavior.” Journal of Economic Perspectives, vol. 21, no. 3, pp. 81–104, 2007.
57. =2plus 43minus 4 J. Choi and D. Laibson, “Reinforcement learning and savings behavior.” The Journal of Finance, vol. 64, no. 6, 2009. =0pt
58. =2plus 43minus 4 X. Huang, “Industry Investment Experience and Stock Selection.” Available at SSRN 1786271, no. November, 2012. =0pt
59. T. Odean, “Are Investors Reluctant to Realize Their Losses ?” vol. LIII, no. 5, pp. 1775–1798, 1998.
60. =2plus 43minus 4 Y. Chen, S. Mabu, K. Hirasawa, and J. Hu, “Trading rules on stock markets using genetic network programming with sarsa learning.” Proceedings of the 9th annual conference on Genetic and evolutionary computation GECCO 07, vol. 12, p. 1503, 2007. =0pt
61. =2plus 43minus 4 J. Lee, “Stock price prediction using reinforcement learning.” Industrial Electronics. Proceedings. ISIE 2001. IEEE International Symposium on, vol. 1, pp. 690–695, 2001. =0pt
62. =2plus 43minus 4 J. O, J. Lee, J. Lee, and B. Zhang, “Adaptive stock trading with dynamic asset allocation using reinforcement learning.” Information Sciences, vol. 176, no. 15, pp. 2121–2147, 2006. =0pt
63. J. Moody and M. Saffell, “Learning to trade via direct reinforcement.” IEEE Transactions on Neural Networks, vol. 12, no. 4, pp. 875–889, 2001.
64. =2plus 43minus 4 A. V. Rutkauskas and T. Ramanauskas, “Building an artificial stock market populated by reinforcement‐learning agents.” Journal of Business Economics and Management, vol. 10, no. 4, pp. 329–341, 2009. =0pt
65. =2plus 43minus 4 B. M. Barber and T. Odean, “The Behavior of Individual Investors.” Handbook of the Economics of Finance, vol. 2, PB, pp. 1533–1570, 2013 =0pt
66. B. M. Barber, Y. T. Lee, Y. J. Liu, and T. Odean, “Is the aggregate investor reluctant to realise losses? Evidence from Taiwan.” European Financial Management, vol. 13, no. 3, pp. 423–447, 2007.
67. =2plus 43minus 4 P. Brown, N. Chappel, R. Da Silva Rosa, and T. Walter, “The Reach of the Disposition Effect: Large Sample Evidence Across Investor Classes.” International Review of Finance, vol. 6, no. 1-2, p. 43, 2006. =0pt
68. A. Frazzini, “The disposition effect and underreaction to news.” Journal of Finance, vol. 61, no. 4, pp. 2017–2046, 2006.
69. M. Grinblatt and M. Keloharju, “What Makes Investors Trade?The Journal Of Finance, vol. 56 (2), no. 2, pp. 549–578, 2001.
70. C. Heath and M. Lang, “Psychological Factors and Stock Option Exercise.” Quarterly Journal of Economics vol. 114, no. 2, pp. 601–627, 1999.
71. T. Odean, “Do Investors Trade Too Much?American Economic Review, vol. 89 (5), pp. 1279-1298, 1998.
72. Z. Shapira and I. Venezia, “Patterns of behavior of professionally managed and independent investors”, Journal of Banking and Finance, vol. 25, no. 8, pp. 1573–1587, 2001.
73. Kahneman, D., and Tversky, A. “Prospect Theory: An Analysis of Decision under Risk.”, Econometrica: Journal of the Econometric Society, vol. 47(3), 263–291, 1979.
74. W. Sharpe, “Capital Asset Prices: A Theory of Market Equilibrium Under Conditions of Risk,” The Journal of Finance, vol. XIX, no. 3, pp. 425–442, 1964.
75. S. Beninga, Financial Modeling, 2000.
76. =2plus 43minus 4 N. D. Daw, “Trial-by-trial data analysis using computational models,” in Decision Making, Affect, and Learning: Attention and Performance XXIII, 2011 =0pt
77. J. P. Huelsenbeck and K. a. Crandall, “Phylogeny Estimation and Hypothesis Testing Using Maximum Likelihood,” Annual Review of Ecology and Systematics, vol. 28, no. 1, pp. 437–466, 1997.
78. R. M. Costa, “Plastic Corticostriatal Circuits for Action Learning What’s Dopamine Got to Do with It?,” Annals of the New York Academy of Sciences, pp. 172-191, 2007
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minumum 40 characters