An intelligent financial portfolio trading strategy using deep Qlearning
Abstract
Portfolio traders strive to identify dynamic portfolio allocation schemes so that their total budgets are well allocated through the investment horizon. This study proposes a novel portfolio trading strategy in which an intelligent agent is trained to identify an optimal trading action by using an algorithm called deep Qlearning. This study formulates a portfolio trading process as a Markov decision process in which the agent can learn about the financial market environment, and it identifies a deep neural network structure as an approximation of the Qfunction. To ensure applicability to realworld trading, we devise three novel techniques that are both reasonable and implementable. First, the agent’s action space is modeled as a combinatorial action space of trading directions with prespecified trading sizes for each asset. Second, we introduce a mapping function that can replace an initiallydetermined action that may be infeasible with a feasible action that is reasonably close to the original, ideal action. Last, we introduce a technique by which an agent simulates all feasible actions in each state and learns about these experiences to derive a multiasset trading strategy that best reflects financial data. To validate our approach, we conduct backtests for two representative portfolios and demonstrate superior results over the benchmark strategies.
keywords:
Portfolio trading, Reinforcement learning, Deep Qlearning, Deep neural network, Markov decision process1 Introduction
A goal of financial portfolio trading is maximizing the trader’s monetary wealth by allocating capital to a basket of assets in a portfolio over the periods during the investment horizon. Thus, portfolio trading is the most important investment practice in the buyside financial industry. Portfolio traders strive to establish trading strategies that can properly allocate capital to financial assets in response to timevarying market conditions. Typical objective functions for trading strategy optimization include expected returns and the Sharpe ratio (i.e., riskadjusted returns). In addition to optimizing an objective function, a trading strategy should achieve a reasonable turnover rate so that it is applicable to realworld financial trading. If the turnover rate is not reasonable, transaction costs hurt overall trading performance.
Portfolio trading is an optimization problem that involves a sequential decisionmaking process across multiple rebalancing periods. In this process, the stochastic components of timevarying market variables should be considered. Thus, the problem of deriving an optimal portfolio trading strategy has traditionally been formulated as a stochastic optimization problem [Consigli and Dempster, 1998, Golub et al., 1995, Kouwenberg, 2001]. To handle these stochastic components over multiple periods, most related studies have developed heuristic methods [Brock et al., 1992, Chen and Yu, 2017, Chen and Chen, 2016, Derigs and Nickel, 2003, Leigh et al., 2002, Papailias and Thomakos, 2015, Zhang et al., 2015, Zhu and Zhou, 2009]. In addition to heuristic methods, reinforcement learning (RL) is another popular approach to solving stochastic optimization problems. In RL methods, an intelligent agent optimizes its trading strategy by attempting various trading actions and revising its trading action policy according to the rewards gained from the financial environment [Almahdi and Yang, 2017, 2019, Bertoluzzo and Corazza, 2012, Casqueiro and Rodrigues, 2006, Dempster and Leemans, 2006, Eilers et al., 2014, Moody and Saffell, 2001, Moody et al., 1998, Neuneier, 1996, 1998, O et al., 2006, Pendharkar and Cusatis, 2018, Zhang et al., 2015].
RL methods have recently experienced a new age with the advancement of deep neural network (DNN). Combined with DNN, RL has evolved into the socalled deep RL (DRL) method. The deep Qlearning (DQL) algorithm, one of the DRL methods, derives an optimal policy by approximating a Qfunction that represents the values of the actions in each state with a DNN. Applying such methods to derive a trading strategy allows agents to learn about the complex financial environment through their experiences within the environment and then optimize their trading strategies based on these experiences. In addition, these methods have the important advantage that learning agents can update their trading strategies based on their experiences on future trading days. Instead of simply maintaining trading strategies derived from historical data, learning agents can adapt their strategies using their observed experiences on each real trading day [Wang et al., 2016]. With these advantages and the increasing popularity and superior performance of DRL algorithms, many studies have applied DRL to derive optimal trading strategies [Deng et al., 2016, Jeong and Kim, 2019, Jiang et al., 2017, Xiong et al., 2018].
By applying DRL to portfolio trading, a learning agent can understand a complex financial environment and derive an intelligent trading strategy from this complex financial environment. Previous studies have been conducted to apply DRL algorithms to various portfolio trading problem settings. However, from our perspective, this line of study is yet to mature in terms of practical applicability. First, many studies focus on singleasset trading [Deng et al., 2016, Jeong and Kim, 2019, Xiong et al., 2018]. Because most traders generally trade more than one security, additional decisionmaking steps are necessary even though single asset trading rules are derived. Given the benefits of having a multiasset trading strategy, our study focuses on multiasset portfolio trading. Second, previous studies on multiasset portfolio trading often have limited practicality owing to their less practical action spaces [Jiang et al., 2017]. In response to this impracticality, our study defines an intuitive trading action set that enables the trading strategy to be applicable to realworld trading.
This study proposes an approach for deriving a multiasset portfolio trading strategy using DQL. Unlike studies on singleasset trading strategies using DQL [Jeong and Kim, 2019], this study focuses on a multiasset trading strategy. Unlike studies on multiasset trading strategies using DRL [Jiang et al., 2017], we focus on improving practical aspects of trading actions. In the action space used in this study, each action includes trading directions corresponding to each asset in a portfolio, and each trading direction comprises either holding each asset or buying or selling each asset at a prespecified trading size. This discrete action space setting is similar to that of [Xiong et al., 2018], but this study uses a multiasset action space rather than a single asset, as in [Xiong et al., 2018]. Although a recent study [Pendharkar and Cusatis, 2018] argues that optimizing a trading strategy based on a discrete action space has a negative effect, we find that our discrete action space modeling allows for a lower turnover rate and is more practical than continuous action space modeling is.
To develop a practical multiasset trading strategy, this study tackles a few challenging aspects. First, setting a discrete action space may lead to infeasible actions, and, thus, we may derive an unreasonable trading strategy (i.e., a strategy with frequent and pointless portfolio weight changes that only leads to more transaction costs) as a result of handling these infeasible actions. To address this issue, we introduce a mapping function that enables the agent to prevent the selection of unreasonable actions by mapping infeasible actions onto similar and valuable actions. By applying this mapping function, we can derive a reasonable trading strategy in the practical action space. Second, although we use years of financial data, these data may not provide enough training data for the DRL agent to learn a multiasset trading strategy in the financial environment. There is a limit to increasing the amount of data, so we need to make the agent gains more experience within the training data and learns as much as possible. Thus, we achieve sufficient learning by simulating all feasible actions in each state and then updating the agent’s trading strategy using the learning experiences from the simulation results. This technique allows the agent to gain and learn enough experience to derive a multiasset portfolio trading strategy.
The rest of this paper is organized as follows. In Section 2, we first review the related literature and present the differences between our study and previous studies. Section 3 describes the definition of our problem, and Section 4 introduces our approach for deriving an intelligent trading strategy. In Section 5, we provide experimental results to validate the advantages of our approach. Finally, we conclude in Section 6 by providing relevant implications and identifying directions for future research.
2 Literature Review
Portfolio trading is an optimization problem that involves a sequential decisionmaking process over multiple rebalancing periods. In addition, the stochastic components of market variables should be considered in this process. Thus, traditionally, the derivation of portfolio trading strategies has been formulated as a stochastic programming problem to find an optimal trading strategy. Recently, much effort has been made to solve this stochastic optimization problem using a learningbased approach, RL. To formulate this stochastic optimization problem, it is necessary to determine how to measure the features of the stochastic components corresponding to changes in the financial market. Utilizing technical indicators is more common than utilizing the fundamental indexes of securities in daily frequency portfolio trading, as in our study.
This section reviews how previous studies have attempted to model stochastic market components to formulate the portfolio trading problem and derive an optimal trading strategy. Section 2.1 provides a brief description of previous studies that formulate the stochastic components of the financial market. Section 2.2 reviews previous studies that discuss heuristic methods for deriving an optimal trading strategy. Section 2.3 reviews previous studies that address the stochastic optimization problem to derive an optimal trading strategy using RL.
2.1 Stochastic programmingbased models
Early studies on portfolio trading and, sometimes, management used stochastic programmingbased models. Stochastic programming models formulate a sequence of investment decisions over time that can maximize a portfolio manager’s expected utility at the end of the investment horizon. Golub et al. [1995] modeled an interest rate series as a binomial lattice scenario using Monte Carlo procedures to solve a money management problem with stochastic programming. Kouwenberg [2001] solved an assetliability management problem using the event tree method to generate random stochastic programming coefficients. Consigli and Dempster [1998] used scenariobased stochastic dynamic programming to solve an assetliability management problem. However, stochastic programmingbased models have the limitation of needing to generate numerous scenarios to solve a complex problem, such as understanding a financial environment, resulting in a large computational burden.
2.2 Heuristic methods
Because of this limitation of stochastic programmingbased models, many studies have devised heuristic methods (i.e., trading heuristics). One of the most famous such methods is technical analysis for asset trading. This method provides a simple and sophisticated way to identify hidden relationships between market features and asset returns through the study of historical data. Using these identified relationships, investments are made in assets by taking appropriate positions. Brock et al. [1992] conducted backtests with real and artificial data using moving average and trading range strategies. Zhu and Zhou [2009] considered theoretical rationales for using technical analysis and suggested a practical moving average strategy to determine a portion of investments. Chourmouziadis and Chatzoglou [2016] suggested an intelligent stocktrading fuzzy system based on rarely used technical indicators for shortterm portfolio trading. Another popular heuristic method is the pattern matching (i.e., charting heuristics) method, which detects critical market situations by comparing the current series of market features to meaningful patterns in the past. Leigh et al. [2002] developed a trading strategy using two types of bull flag pattern matching. Chen and Chen [2016] proposed an intelligent patternmatching model based on two novel methods in the pattern identification process. The other wellknown heuristic method is a metaheuristics algorithm that can find a near optimal solution in acceptable computation time. Derigs and Nickel [2003] developed a decision support system generator for portfolio management using simulated annealing, and Potvin et al. [2004] applied genetic programming to generate trading rules automatically. Chen and Yu [2017] used a genetic algorithm to group stocks with similar price series to support investors in making more efficient investment decisions. However, these heuristic methods have limited ability to fully search a very large feasible solution space because they are inflexible. Thus, we need to be careful about the reliability of obtaining an optimal trading strategy using these methods.
2.3 Reinforcement learningbased methods
A recent research direction is optimizing a trading strategy using RL such that a learning agent develops a policy while interacting with the financial environment. Using RL, a learningbased method, the learning agent can search for an optimal trading strategy flexibly in a highdimensional environment. Unlike supervised learning, RL allows learning from experience, leading to training the agent with unlabeled data obtained from interactions with the environment.
In the earliest such studies, Neuneier [1996, 1998] optimized multiasset portfolio trading using Qlearning, a modelfree and valuebased RL. In other early studies, Moody et al. [1998] and Moody and Saffell [2001] used Direct RL with Recurrent RL as a base algorithm and derived a multiasset longshort portfolio trading strategy and a single asset trading rule, respectively. Direct RL is policybased RL, which optimizes an objective function by adjusting policy parameters, and Recurrent RL is an RL algorithm in which the last action is received as an input. These studies introduced several measures, such as profits and the differential Sharpe ratio, as objective functions and compared the trading strategies derived using different objectives. Casqueiro and Rodrigues [2006] derived a single asset trading strategy using Qlearning, which can maximize the differential Sharpe ratio. Dempster and Leemans [2006] developed an automated foreign exchange trading system using an adaptive learning system with a base algorithm of Recurrent RL by dynamically adjusting a hyperparameter depending on the market situation. O et al. [2006] proposed a Qlearningbased local trading system that categorized an asset price series into four patterns and applied different trading rules. Bertoluzzo and Corazza [2012] suggested a single asset trading system using Qlearning with linear and kernel function approximations. Eilers et al. [2014] developed a trading rule for an asset with a seasonal price trend using Qlearning. Zhang et al. [2015] derived a trading rule generator using extended classifier systems combined with RL and a genetic algorithm. Almahdi and Yang [2017] suggested a Recurrent RLbased trading decision system that enabled multiasset portfolio trading and compared the performance of the system when several different objective functions were adopted. Pendharkar and Cusatis [2018] suggested an indices trading rule derived using two different RL methods, onpolicy (SARSA) and offpolicy (Qlearning) methods and compared the performance of these two methods, and it also compared the performances of discrete and continuous agent action space modeling. Almahdi and Yang [2019] used a hybrid method that combined Recurrent RL and particle swarm optimization to derive a portfolio trading strategy that considers realworld constraints.
More recently, DRL, which combines deep learning and RL algorithms, was developed, and, thus, studies have suggested using DRLbased methods to derive portfolio trading strategies. DRL methods enable an agent to understand a complex financial environment through deep learning and to learn a trading strategy by automatically applying an RL algorithm. Jiang et al. [2017] used a deep deterministic policy gradient (DDPG), an advanced method of combining policybased and valuebased RL, and introduced various DNN structures and techniques to trade a portfolio consisting of cash and several cryptocurrencies. Deng et al. [2016] derived an asset trading strategy using a Recurrent RLbased algorithm and introduced a fuzzy deep recurrent neural network that used fuzzy representation to reduce uncertainty in noisy asset prices and used a deep recurrent neural network to consider the previous action and utilize highdimensional nonlinear features. Xiong et al. [2018] used the DDPG method and defined a practical action space for buying and selling stocks per share to derive a stock trading strategy. Jeong and Kim [2019] derived an asset trading rule that determined actions for assets and the number of shares for the actions taken. To learn this trading rule, Jeong and Kim [2019] used a deep Qnetwork (DQN) with a novel DNN structure consisting of two branches, one of which learned action values while the other learned the number of shares to take to maximize the objective function.
The above studies used various RLbased methods in different problem settings. All of the methods performed well in each setting, but some issues limit the applicability of these methods to the real world. First, some problem settings did not consider transaction costs [Bertoluzzo and Corazza, 2012, Eilers et al., 2014, Jeong and Kim, 2019, O et al., 2006, Pendharkar and Cusatis, 2018, Xiong et al., 2018]. A trading strategy developed without assuming transaction costs is likely to be impractical for application to the real world. The second issue is that some strategies consider trading for only one asset [Almahdi and Yang, 2017, Bertoluzzo and Corazza, 2012, Casqueiro and Rodrigues, 2006, Dempster and Leemans, 2006, Eilers et al., 2014, Deng et al., 2016, Jeong and Kim, 2019, Moody and Saffell, 2001, Xiong et al., 2018, Zhang et al., 2015]. A trading strategy of investing in only one risky asset may have high risk exposure because it has no risk diversification effect. Finally, in previous studies deriving multiasset portfolio trading strategies using RL, the agent’s action space was defined as the portfolio weights in the next period [Almahdi and Yang, 2017, 2019, Jiang et al., 2017, Moody et al., 1998]. The action spaces of these studies do not provide portfolio traders with a direct guide that is applicable to a realworld trading scenario that includes transaction costs. This is because there are many different ways to transition from the current portfolio weight to the next portfolio weight. Thus, previous studies using portfolio weights as the action space required finding a way to minimize transaction costs at each rebalancing moment. Rebalancing in a way that reduces both transaction costs and dispersion from the next target portfolio is not an easily solved problem [Grinold and Khan, 2000]. In addition, a portfolio trading strategy derived based on the action spaces of the previous studies may be difficult to apply to realworld trading because the turnover rate is likely to be high. An action space that determines portfolio weights can result in frequent asset switching because the amount of asset changes has no upper bound. Thus, we contribute to the literature by deriving a portfolio trading strategy that has no such issues.
3 Problem definition
In this study, we consider a portfolio consisting of cash and several risky assets. All assets in the portfolio are bought using cash, and the value gained from selling assets is held in cash. That is, the agent cannot buy an asset without holding cash and cannot sell an asset without holding the asset. This type of portfolio is called a longonly portfolio, which does not allow short selling. Our problem setting also has a multiplicative profit structure in that the portfolio value accumulates based on the profits and losses in previous periods. We consider proportional transaction costs that are charged according to a fixed proportion of the amount traded in transactions involving buying or selling. In addition, we allow the agent to partially buy or sell assets (e.g., the agent can buy or sell half of a share of an asset).
We set up some assumptions in our problem setting. First, transactions can only be carried out once a day, and all transactions in a day are made at the closing price in the market at the end of that day. Second, the liquidity of the market is high enough that each transaction can be carried out immediately for all assets. Third, the trading volume of the agent is very small compared to the size of the whole market, so the agent’s trades do not affect the state transition of the market environment.
To apply RL to solve our problem, we need a model of the financial environment that reflects the financial market mechanism. Using the notations summarized in Table 1, we formulate a Markov decision process (MDP) model that maximizes the portfolio return rate in each period by selecting sequential trading actions for the individual assets in the portfolio according to timevarying market features.
Decision variables  
agent’s action at the end of period  
Set and indices  
portfolio asset index (i=0 represents cash)  
time period index  
set of index if  
set of index if  
Parameters  
size of the time window containing recent previous market features  
portfolio value changed by the action at the end of period  
portfolio value before the agent takes an action at the end of period  
portfolio value at the end of period when the agent takes no action at the end of the previous period (static portfolio value in period )  
proportion of asset changed by the action at the end of period  
proportion of asset before the agent takes an action at the end of period  
auxiliary parameter used to derive  
decay rate of transaction costs at the end of period  
commission rate for selling  
commission rate for buying  
trading size for selling or buying  
return rate of the portfolio in period  
opening price of asset in period  
closing price of asset in period  
highest price of asset in period  
lowest price of asset in period  
volume of asset in period  
Features  
rate of change of the closing price of asset in period  
ratio of the opening price in period to the closing price in period for asset  
ratio of the closing price to the highest price of asset in period  
ratio of the closing price to the lowest price of asset in period  
rate of change of the volume of asset in period 
3.1 State space
The state space of the agent is defined as the weight vector of the current portfolio before the agent selects an action and the tensor that contains the market features (technical indicators) for the assets in the portfolio. This type of state space is similar to that used in a previous study [Jiang et al., 2017]. That is, the state in period can be represented as below (Equations (1)(3)):
(1)  
(2)  
(3) 
where denotes the weight vector of the current portfolio and represents the technical indicator tensor for the assets in the portfolio. For this tensor, we use five technical indicators for the assets in the portfolio, as below (Equations (4)(8)):
(4)  
(5)  
(6)  
(7)  
(8) 
Every set of five technical indicators can be expressed as a matrix (Equations (9)(13)), where the rows represent each asset in the portfolio and the columns represent the series of recent technical indicators in the time window. Here, if we set a time window of size (considering lag autocorrelation) and a portfolio of assets, the technical indicator tensor is an dimensional tensor, as in Figure 1.
(9)  
(10)  
(11)  
(12)  
(13) 
3.2 Action space
We define the action space to overcome the limitations of the action spaces in previous studies. Agent actions determine which assets to hold and which assets to sell or buy by prespecifying a constant trading size. For example, if a portfolio includes two assets and the trading size is 10,000 , then the agent can select the action of buying 10,000 of asset1 and selling 10,000 of asset2. The action space includes the trading directions of buying, selling, or holding each asset in the portfolio, so the action space contains different actions. These actions are expressed in a vector form that includes trading directions for each asset in a portfolio. In addition, each trading direction is encoded as , respectively. For example, an action that involves selling asset1 and buying asset2 can be encoded into the vector .
Because the trading actions for individual assets are carried out in fixed trading size, this action space is modeled as a discrete type. Although this discrete action space may not be able to derive a trading strategy that outperforms trading strategies derived using a continuous action space [Pendharkar and Cusatis, 2018], this action space can provide a direct trading guide that a portfolio trader can follow in the real world. Furthermore, this discrete action space can derive a portfolio trading strategy with lower turnover relative to the strategies developed in previous studies. In previous studies, if a portfolio with a very large amount of capital is changed by a small amount in portfolio weight then the trader may pay significant transaction costs. In addition, the losses from these transaction costs can be very high because portfolio weight changes have no upper bound. In contrast, our action space has an upper bound for portfolio weight changes, and, thus, the issue of massive changes in portfolio weights and the resulting large losses from transaction costs do not arise. Our agent action space has these advantages, and the only disadvantage of the fixed trading amount is similar to the restrictions of hedge funds that allow portfolio traders to trade below a certain amount each day. Thus, our discrete agent action space is not too unrealistic to apply to realworld trading.
In our action space, some actions are infeasible in some states (e.g., the agent cannot buy assets because of a cash shortage or cannot sell assets because of a shortage of held assets). To handle infeasible actions, we first set the action values (i.e., Qvalues) of infeasible actions to be very low so that the agent does not select these actions. Thus, we devise a way to select the best action among the feasible actions. The details of this method are explained in Section 4.1.
3.3 MDP modeling
With the state space and action space defined in the previous subsections, we can define the MDP model as follows. The financial market environment operates according to this model during the investment horizon. To define the transitions in the financial market environment (i.e., the system dynamics in the MDP model), we need to define following parameters:
(14)  
(15)  
(16)  
(17) 
where denotes the portfolio weight after the agent takes an action at the end of period (Equation (14)). Equation (15) provides the constraint that the portfolio weight elements sum to one in all periods. Equations (16) and (17) represent the change in the portfolio value and the change in the proportions of the assets in the portfolio given the changes in the value of each asset in the portfolio, respectively. Here, represents the elementwise product of two vectors, and is a vector of size I+1 with all elements equal to one. is an operator that not only increases a vector’s dimension by positioning zero as the first element but also adds it to the vector ( : ).
Now, we can define the state changes after the agent takes an action as follows:
(18)  
(19)  
(20)  
(21)  
(22)  
(23) 
After the agent takes an action, transaction costs arise, and the portfolio value is then decayed (Equations (18)(19)). Here, is the size of set . denotes the auxiliary weight of the portfolio that is needed to connect the change in the portfolio weights before and after the agent takes an action at the end of period (Equation (20)). The procedure by which the action selected by the agent is handled for trading in the financial environment is as follows. The auxiliary weight of an asset in the portfolio increases (or decreases) as a proportion of the trading size when buying (or selling) the asset. On the contrary, the auxiliary weights of the assets do not change when the agent holds the assets (Equation (21)). As a result of selling asset, the proportion of cash increases by the proportion of the trading size discounted by the selling commission rate. As a result of buying asset, the proportion of cash decreases by the proportion of the trading size multiplied by the buying commission rate (Equation (22)). To ensure that the sum of the portfolio weight elements equals one after the agent takes an action, a process for adjusting the auxiliary weights is required (Equation (23)). In summary, the financial market environment transition is illustrated by Figure 2.
Last, the reward in the MDP model should reflect the contribution of the agent’s action to the portfolio return. This reward can be simply defined as the portfolio return. However, if the portfolio return is defined only as a reward, then different reward criteria can be given depending on the market trend. For example, when the market trend is sufficiently improving, then no matter how poor the agent’s action is, a positive reward is provided to the agent. In contrast, if the market trend is sufficiently negative, then no matter how helpful the agent’s action is, a negative reward is provided to the agent. Thus, the reward must be defined as the rate of change in the portfolio value by which the market trend is removed. Therefore, we define the reward as the change in the portfolio value at the end of the next period relative to the static portfolio value (Equation (24)). The static portfolio value is the next portfolio value when the agent takes no action at the end of the current period (Equation (25)).
(24)  
(25) 
4 Methodology
In this section, we introduce our proposed approach for deriving the portfolio trading strategy using DQL. In our action space, some issues may prohibit a DQL agent from deriving an intelligent trading strategy. We first explain how to resolve these issues by introducing some techniques and applying existing methodologies. Then, we describe our DQL algorithm with these techniques.
4.1 Mapping function
In our action space, we need to define a rule for selecting the appropriate action from the remaining actions when infeasible actions are excluded. In the simplest way, we can define this rule as selecting the action that has the largest Qvalue, excluding infeasible actions. However, if we adopt this simple rule, then it may lead to an agent deriving an unreasonable trading strategy. For example, when an agent’s strategy selects the action of selling both asset1 and asset2 but this action is infeasible owing to a lack of asset2, the action of buying both asset1 and asset2, which is the largest Qvalue action in the remaining action space, is selected. Because learning the similarity between actions is difficult for an RL agent, the agent will take this action without any doubt even though this selected action is the opposite of the original action determined by the agent’s strategy. This issue leads to the selection of unreasonable actions, which degrades the trading performance. Thus, a mapping rule is required to map infeasible actions to similar and valuable actions in feasible action set. Thus, we resolve this problem by introducing such a mapping function.
The mapping function is a type of RL constraint that allows the agent to derive a reasonable trading strategy by mapping infeasible actions to similar and valuable actions in feasible action set. Pham et al. [2018] handled constrained action space by adding an optimization layer (OptLayer) for solving mathematical programming at the last layer of the agent’s policy network, determining an action that minimizes differences from the output at the previous layer while satisfying constraints. Bhatia et al. [2018] proposed three different methods to handle constrained action space of resource allocation problem, which contains lower/upper bounds constraint and global sum constraint. However, both studies cannot be applied to our situation because they can only deal with continuous action. Also, the addition of a layer to handle constraints of action space in the neural network results in additional computation costs at each learning phase. Thus, we implement the mapping function by applying the constraint to a separate module from the learning phase. As a result, this technique works as if the mapping rule constraint is applied in the learning phase without requiring an additional computation costs.
The mapping function has several different rules for each infeasible action case, and we call these rules mapping rules. Our mapping function has two mapping rules, each of which is required for mapping infeasible actions, that are divided into two cases. In the first case, the amount of cash is not sufficient to take an action that involves buying assets. In this case, a similar action set is derived by holding rather than buying a subset of the asset group to be bought in the original action. Thereafter, infeasible actions are mapped to the most valuable feasible actions in the similar action set. For example, if the action of buying both asset1 and asset2 is infeasible owing to a cash shortage, this action is mapped to the most valuable feasible action within the set of similar actions, which includes the action of buying asset1 and holding asset2, the action of holding asset1 and buying asset2, and the action of holding both asset1 and asset2. In the second case, an action that involves selling assets is infeasible because of a shortage of the assets. In this case, the original action is simply mapped to an action in which the assets that are not enough to sell are held. These examples are illustrated in Figure 3.
We provide the details of the two mapping rules and the mapping function in the following pseudocode in Algorithms (1) and (2). In Algorithm (1), the last part (i.e., Lines (21)(22)) of the mapping rule for the second case(Rule2) is necessary. Because, in the second case, converting the original action of selling assets that cannot be sold into an action that holds the selling assets which cannot be sold, then the cash amount gained from selling assets is removed, causing the first infeasible action case to arise. Furthermore, this part of the code can handle the special case in which an asset shortage and a cash shortage occur simultaneously. Next, the RL flow chart with the mapping function technique is shown in Figure 4.
4.2 DQN algorithm
We optimize the multiasset portfolio trading strategy by applying the DQN algorithm. DQN is the primary algorithm for DQL. Mnih et al. [2013] developed the DQN algorithm, and Mnih et al. [2015] later introduced additional techniques and completed this algorithm. The base algorithm for DQN, Qlearning, is valuebased RL, which is a method that approximates an action value (i.e., a Qvalue) in each state. Further, Qlearning is a modelfree method such that even if the agent does not have knowledge of the environment, the agent can develop a policy using repeated experience by exploring. In addition, Qlearning is an offpolicy algorithm, that is, the action policy for selecting the agent’s action is not the same as the update policy for selecting an action on the target value. An algorithm based on Qlearning that approximates the Qfunction using DNN is the basis of DQN [Mnih et al., 2013]. To prevent DNN from learning only through the experience of a specific situation, experience replay was introduced to sample a general experience batch from memory. Additionally, the DQN algorithm used two separate networks: a Qnetwork that approximates the Qfunction and a target network that approximates the target value needed for the Qnetwork updated to follow a fixed target [Mnih et al., 2015]. Based on this algorithm, we introduce several techniques to support the derivation of an intelligent trading strategy.
The existing DQN algorithm updates the Qnetwork with experience by allowing the agent to take only one action in each stage. Because the agent has no information about the environment, only one action is taken then proceeding to the next state. Thus, it is impossible to take multiple actions in the existing DQN. However, for this problem, we use historical technical indicator data of the assets in the portfolio as training data. Thus, our agent can take multiple actions in one state in each stage and observe all of their experiences based on those actions. To utilize this advantage, we introduce a technique that simulates all feasible actions in one state at each stage and updates the trading strategy by using the resulting experiences from conducting these simulations.
Motivated by Tan et al. [2009], we utilize a simulation technique that takes all feasible actions virtually to force to the agent learns about many experiences efficiently for deriving a fully searched multiasset trading strategy. Thus, this technique can relax the data shortage issue that arises when deriving a multiasset trading strategy. Although simulating all feasible actions can result in a huge computational burden, using multicore parallel computing can prevent this computational burden from greatly increasing. Moreover, even if the agent takes multiple actions in the current state, the next state only depends on the action selected by the action policy (epsilongreedy) with the mapping rule. The application of this technique requires a change in the data structure of the element in replay memory for storing a list of experiences in a state. The concepts related to this technique are illustrated in Figure 5. In this figure, means that the action of the agent is taken at the end of period . is the reward obtained by taking action , and is the next state that results from taking action .
In DQN, a multiple output neural network is commonly adopted as the Qnetwork structure. In this network structure, the input of the neural network is the state, and the output is the Qvalue of each action. Using the above technique, we can approximate the Qvalue of all feasible actions by updating this multiple output Qnetwork in parallel with the experience list. To maintain Qvalues of infeasible actions, the current Qvalue of an infeasible stateaction pair is assigned to the target value of the Qnetwork output of the corresponding infeasible action to set a temporal difference error of zero. Furthermore, as in DQN, several experience lists are sampled from replay memory, and the Qnetwork is updated using the experience list batch. A detailed description of the process for updating the Qnetwork is shown in Figure 6.
In addition, to apply RL, learning episodes must be defined for the agent to explore and experience the environment. Rather than defining all of the training data, which cover several years, as one episode, we divide the training data into several episodes. If we define a much longer training episode than the investment horizon of the test data that will be used to test the trading strategy, this difference in the lengths of the training and test data can produce negative results. For example, in our experiment, the training and testing processes begin with the same portfolio weights. In this case, the farther the agent is from the beginning of the long training episode, the farther the agent is from the initial portfolio weights. Thus, it is difficult for the agent to utilize the critical experience obtained from the latter half of the long episode in the early testing process. Therefore, we divide training data into sets of the same length as the investment horizon of the test data (i.e., one year, as the investment horizon of the test data is a year in our experiment). Thus, the criteria for dividing the training data are defined in yearly units so that the episodes do not overlap (e.g., episode1 contains data from 2016, and episode2 contains data from 2015). In each training epoch, the agent explores and learns in an episode sampled from the training data.
It is well known that more recent historical data have more explainable for predicting future data than less recent historical data have. Thus, it is reasonable to assign higher sampling probabilities to episodes that are closer to the test data period [Jiang et al., 2017]. We use a truncated geometric distribution to assign higher sampling probabilities to episodes that are closer to the test period. This truncated geometric sampling distribution is expressed in Equation (26). Here, is the year of the episode, is the year of the test data, and is the number of total training episodes. is a parameter for this sampling distribution that ranges from zero to one. If this parameter is closer to one, episodes closer to the test period are sampled frequently.
(26) 
To implement DQN, we need to model the neural network structure for approximating the Qfunction of an agent’s state and action. We construct a hybrid LSTMDNN neural network that enables us to approximate the Qvalue of an agent’s action in our predefined state and action space. First, we use LSTM, a deep learning model suitable for longterm time series pattern learning, to encode the technical indicator sequences for assets in the portfolio. The technical indicator sequence of each asset in the portfolio shares the same LSTM layer to be encoded in the lowdimensional encoding vector. It is known that a single deep learning model is more effective for learning the price patterns of different assets than multiple deep learning models that learning individual assets [Sirignano and Cout, 2018]. Because this LSTM layer encodes a multivariate timeseries of technical indicators for each asset into a lowdimensional latent vector, we refer to this layer as the pattern encoder. Using a sigmoid as the activation function for the output layer of the pattern encoder, we set the same scale (01) for another input, the portfolio weights. Then, the encoded outputs for each asset are concatenated to create the intermediate output, and this intermediate output is then combined again with the current portfolio weights to use as the input to the DNN. Through this DNN layer, we can obtain the Qvalue of each action of the agent. Because this DNN layer extracts meaningful features through nonlinear mapping using a multilayer neural network and conducts a regression for the Qvalue, we refer to this layer as the DNN regressor. The overall Qnetwork structure is as shown in Figure 7. In summary, the overall DQN algorithm for our approach for deriving the portfolio trading strategy is as follows (Algorithm 3).
4.3 Online learning
Online learning is a learning method that can be applied after deriving a trading strategy based on historical data using our algorithm. Even during the test period in which the trading strategy is applied, the trading strategy can be updated by learning about already observed experiences with test data. In testing, the situation in the next period is highly correlated with the current observed state. Thus, by learning about the current observed test experience and updating the trading strategy, this adaptation enables the agent to respond to the next uncertain period. However, unlike learning during the training episode, the trading strategy is updated after only one current observed experience (it is impossible to take multiple actions in the test data) rather than through batch learning. This onesample learning is similar to adaptive learning [Pendharkar and Cusatis, 2018], which updates the trading strategy with a bias toward the current observed experience in the current test period. Learning that is biased toward the current experience can make the trading strategy more responsive to the situation in the next test period. In other words, online learning allows the agent to update the trading strategy using more and important experiences and to update the trading strategy flexibly during real trading.
5 Experimental results
In this section, we demonstrate that the DQN strategy (i.e., the trading strategy derived using our proposed DQN algorithm for portfolio trading) can outperform in realworld trading. We conduct a trading simulation for two different portfolio cases using both our DQN strategy and traditional trading strategies as benchmarks, and we verify that the DQN strategy is relatively superior to the other benchmark strategies based on several common performance measures.
5.1 Performance measures
We use three different output performance measures to evaluate trading strategies. The first measure is the cumulative return based on the increase in the portfolio value at the end of the investment horizon relative to the initial portfolio value, as defined as Equation (27):
(27) 
where is the final date of the investment horizon and is the initial portfolio value.
The second measure is the Sharpe ratio, as defined in Equation (28):
(28) 
where is the standard deviation of the daily return rate, is the daily riskfree rate (assumed to be 0.01%), and is the number of transaction days in the investment horizon. This ratio is a common measure of the riskadjusted return, and it is used to evaluate not only how high the risk premium is but also how small the variation in the return rate is.
For the last measure, we use the customized average turnover rate defined as in Equation (29):
(29) 
The average turnover measures the average rate of change of the portfolio weight vector during the investment horizon. We do not have to consider changes in the cash proportion, so we customize this measure by excluding the change in the weight on cash before and after the agent takes an action. This rate can evaluate the change in the proportions of asset investments. Considering transaction costs, this measure should be low to better apply the trading strategy in the real world.
5.2 Data summary
We experiment with two different threeasset portfolios. The first consists of three exchange traded funds (ETFs) in the US market that track the S&P500 index, the Russell 1000 index, and the Russell Microcap Index. This type of portfolio was tested in a previous study [Almahdi and Yang, 2017]. The second portfolio is a Korean portfolio consisting of the KOSPI 100 index, the KOSPI midcap index, and the KOSPI microcap index. More information for these test portfolios is provided in Table 2.
1.0in1.0in
Assets  Portfolio  
US Portfolio (USETF)  Korean Portfolio (KORIDX)  
Asset 1  SPDR S&P 500^{1}  KOSPI 100 index 
Asset 2  iShares Russell 1000 Value^{2}  Midcap KOSPI index 
Asset 3  iShares Microcap^{3}  Microcap KOSPI index 

ETF tracks the S&P500 index

ETF tracks the Russell 1000 (mid and largecap US stocks) index

ETF tracks the Russell microcap index
We obtain data on the three US ETFs from Yahoo Finance and data on the Korean indices from Investing.com. Both cases are tested in 2017. The trading strategy for the US portfolio is derived by training on data from 2010 to 2016, and the trading strategy for the Korean portfolio is derived by training on data from 2012 to 2016.
5.3 Experiment setting
Through several rounds of tuning, we derive appropriate hyperparameters. In particular, the time window size is the most important hyperparameter, and we adopt the value of 20 among the candidates (5,20,60,120). This time window size and the other tuned hyperparameters are summarized in Table 3.
hyperparameter  value  hyperparameter  value 
time window size  20  replay memory size  2000 
learning rate  1e7  number of epochs  500 
distribution parameter  0.3  discount factor  0.9 
DNN input dimension  64  batch size  32 
DNN layer  2  LSTM layer  3 
DNN 1st layer dimension  64  LSTM unit dimension  128 
DNN 2nd layer dimension  32  LSTM output dimension  20 
In the experiment, we also need to set trading parameters, such as the initial portfolio value and the trading size. We set the initial portfolio value as one million in both portfolio cases (e.g., 1M for the US portfolio and 1M for the Korean portfolio). Similarly, we set the trading size as ten thousand in both portfolio cases (e.g., 10K trading quantity for the US portfolio case and 10K for the Korean portfolio case). We set the commission rate for buying and selling in both the US and Korean markets as 0.25%. In both cases, the initial portfolio is set up as an equally weighted portfolio, in which every asset and cash has the same proportion.
5.4 Benchmark strategy
To evaluate our DQN strategy, we compare it to some traditional portfolio trading strategies. The first strategy is a buyandhold strategy that does not take any action but rather holds the initial portfolio until the end of the investment horizon. The second strategy is a randomly selected strategy that takes action within the feasible action space randomly in each state. The third strategy is a momentum strategy . This strategy buys assets whose values increased in the previous period and sells assets whose values decreased in the previous period. However, if it cannot buy all assets with increased values, it gives buying priority to assets whose values increased more. If it is unable to sell assets whose values decreased, it simply holds the assets. The last strategy is a reversion strategy , which is the opposite of the momentum strategy. This strategy sells assets whose values increased in the previous period and buys assets whose values decreased in the previous period. However, if it cannot buy all of the assets whose values decreased, it gives buying priority to the assets whose values decreased more. If it is unable to sell the assets whose values increased, then it simply holds the assets.
5.5 Result
We derive a trading strategy for both portfolio cases using DQN. For both cases, we identify the increase in the cumulative return over the investment horizon of the test period as episode learning continues. Figure 8 shows the trend in the cumulative return performance over the learning episodes in both cases.
Figure 9 shows the portfolio value trend when applying DQN and the benchmark strategies in the US portfolio case. In this figure, we observe that the DQN strategy outperforms the benchmark strategies for most of the test period. The final portfolio value of the DQN strategy is 15.69% higher than that of the B&H strategy, 23.46% higher than that of the RN strategy, 21.81% higher than that of the MO strategy, and 114.47% higher than that of the RV strategy.
Figure 10 shows the portfolio value trend when applying DQN and the benchmark strategies in the Korean portfolio case. Likewise, we observe that the DQN strategy outperforms the benchmark strategies for most of the test period. The final portfolio value of the DQN strategy is 25.52% higher than that of the B&H strategy, 114.35% higher than that of the RN strategy, 13.22% higher than that of the MO strategy, and 247.91% higher than that of the RV strategy.
Table 4 summarizes the output performance measure results when using DQN and the benchmark strategies in both portfolio cases. This table shows that the DQN strategy has the best cumulative return and Sharpe ratio performances for the US portfolio, and this strategy has the lowest turnover rate except for the B&H strategy, which has no turnover rate. In the Korean portfolio case, the DQN strategy also has the best cumulative return and Sharpe ratio performances. Moreover, the DQN strategy has the lowest turnover rate except for the B&H strategy. Given that the B&H strategy does not incur any transaction costs during the investment horizon, it is a remarkable achievement that the DQN strategy outperforms the B&H strategy in terms of the cumulative return and Sharpe ratio.
US portfolio  Korean portfolio  
strategy  
10.921%  1.302  0.000%  7.913%  0.872  0.000%  
10.241%  1.139  0.969%  4.634%  0.374  1.027%  
10.372%  1.109  1.368%  8.773%  0.823  1.233%  
5.891%  0.639  1.404%  2.855%  0.144  1.370%  
DQN  12.634%  1.376  0.954%  9.933%  0.927  0.989% 
6 Conclusion
The main contribution of our study is applying the DQN algorithm to derive a multiasset portfolio trading strategy. However, applying DQN to portfolio trading has some challenges. To overcome these challenges, we introduce an action space and several techniques. First, we define a discrete action space that can be applied to individual assets in a portfolio, and the resulting derived trading strategy has a low turnover rate and can provide a direct trading guide to a portfolio trader. Second, we introduce a mapping function for handling infeasible actions to derive a reasonable trading strategy. Trading strategies derived from RL agents can be unreasonable to apply in the real world. Thus, we apply a domain knowledge rule to develop a trading strategy with an infeasible action mapping constraint. As a result, this function works well, and we can derive a reasonable trading strategy. Third, we relax the data shortage issue for deriving multiasset trading strategies in RL by introducing a technique that simulates all feasible actions and then updating the trading strategy based on the experiences of these simulated actions. The experimental results show that the DQN strategy outperforms most benchmark strategies in terms of overall performance in the two portfolio cases. We also find that the DQN strategy performs relatively well under general transaction cost levels. Thus, the DQN trading strategy can be applied to realworld trading.
However, as shown in Figure 8, in a certain training range, the cumulative return performance trend tends to decrease as learning goes on. In the US portfolio case, the cumulative return performance trend is decreasing in the early training phase, but it recovers to an increasing trend. In addition, in the Korean portfolio case, the cumulative return trend is decreasing in the latter half of the training phase. However, this decreasing trend is not significant when the decrease is considered in the context of the smoothing trend of the cumulative return. These flaws are tolerable and are not critical compared to the advantage of applying our proposed DQN method to portfolio trading. Thus, this DQN strategy derived using our approach is worth introducing.
In future work, we will compare the performance of the DQN strategy to that of a portfolio trading strategy derived using the DRL method of previous studies, and we verify how the lower turnover rate of our strategy compares to those of previous strategies. This comparative verification is difficult to do in our study because our setting is not the same as those of previous studies. Thus, we cannot compare the performance measures of trading strategies numerically. In a following study, we will, therefore, implement the methods of previous studies to derive a trading strategy using DRL in our problem setting, and we will compare our DQN strategy numerically to the performance results of previous trading strategies. In addition, in our current study, we use a longonly portfolio, but we will extend the analysis to a longshort portfolio setting. Furthermore, in our study, the reward of the MDP model is optimized only for returns and not for risk. We can extend this analysis to a risk management portfolio by adding a penalty term for risks, such as the variance of the return rate or the conditional value at risk. In addition, advanced DRL methodologies have been developed, and we plan to apply these methods to derive a portfolio trading strategy in future research.
References
 Almahdi and Yang [2017] S. Almahdi and S. Y. Yang. An adaptive portfolio trading system: A riskreturn portfolio optimization using recurrent reinforcement learning with expected maximum drawdown. Expert Systems With Applications, 87:267–279, 2017.
 Almahdi and Yang [2019] S. Almahdi and S. Y. Yang. A constrained portfolio trading system using particle swarm algorithm and recurrent reinforcement learning. Expert Systems With Applications, 130:145–156, 2019.
 Bertoluzzo and Corazza [2012] F. Bertoluzzo and M. Corazza. Testing different Reinforcement Learning configurations for financial trading: Introduction and applications. Procedia Economics and Finance, 3:68–77, 2012.
 Bhatia et al. [2018] A. Bhatia, P. Varakantham, and A. Kumar. Resource constrained deep reinforcement learning. Proceedings of the International Conference on Automated Planning and Scheduling, 29:610–620, 2018.
 Brock et al. [1992] W. Brock, J. Lakonishok, and B. Lebaron. Simple Technical Trading Rules and the Stochastic Properties of Stock Returns. The Journal of Finance, 47:1731–1764, 1992.
 Casqueiro and Rodrigues [2006] P. X. Casqueiro and A. J. L. Rodrigues. Neurodynamic trading methods. European Journal of Operational Research, 175:1400–1412, 2006.
 Chen and Yu [2017] C. H. Chen and H. Y. Yu. A series based group stock portfolio optimization approach using the grouping genetic algorithm with symbolic aggregate Approximations. KnowledgeBased Systems, 125:146–163, 2017.
 Chen and Chen [2016] T. Chen and F. Chen. An intelligent pattern recognition model for supporting investment decisions in stock market. Information Sciences, pages 261–274, 2016.
 Chourmouziadis and Chatzoglou [2016] K. Chourmouziadis and P. D. Chatzoglou. An intelligent short term stock trading fuzzy system for assisting investors in portfolio management. Expert Systems With Applications, 43:298–311, 2016.
 Consigli and Dempster [1998] G. Consigli and M. A. H. Dempster. Dynamic stochastic programming for asset–liability management. Annals of Operations Research, 81:131–161, 1998.
 Dempster and Leemans [2006] M. A. H. Dempster and V. Leemans. An automated FX trading system using adaptive reinforcement learning. Expert Systems With Applications, 30:543–552, 2006.
 Deng et al. [2016] Y. Deng, B. Feng, Y. Kong, Z. Ren, and Q. Dai. Deep Direct Reinforcement Learning for Financial Signal Representation and Trading. IEEE Transactions on Neural Networks and Learning Systems, 28:653–664, 2016.
 Derigs and Nickel [2003] U. Derigs and N. H. Nickel. Metaheuristic based decision support for portfolio optimization with a case study on tracking error minimization in passive portfolio management. OR Spectrum, 25:345–378, 2003.
 Eilers et al. [2014] D. Eilers, C. L. Dunis, H. J. Mettenheim, and M. H. Breitner. Intelligent trading of seasonal effects: A decision support algorithm based on reinforcement learning. Decision Support Systems, 64:100–108, 2014.
 Golub et al. [1995] B. Golub, M. Holmer, R. Mckendall, L. Polhlman, and S. A. Zenios. A stochastic programming model for money management. European Journal of Operations Research, 85:282–296, 1995.
 Grinold and Khan [2000] R. C. Grinold and R. N. Khan. Active portfolio management: A quantitative approach for producing superior returns and controlling risk. McGrawHill, 2, 2000.
 Jeong and Kim [2019] G. Jeong and H. Y. Kim. Improving financial trading decisions using deep Qlearning: Predicting the number of shares, action strategies, and transfer learning. Expert system with applications, 117:125–138, 2019.
 Jiang et al. [2017] Z. Jiang, D. Xu, and J. Liang. A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem. arXiv preprint arXiv:1706.10059, 2017.
 Kouwenberg [2001] R. Kouwenberg. Scenario generation and stochastic programming models for asset liability management. European Journal of Operational Research, 134:279–292, 2001.
 Leigh et al. [2002] W. Leigh, N. Modani, R. Purvis, Q. Wu, and T. Robert. Stock market trading rule discovery using technical charting heuristics. Expert Systems with Applications, 23:155–159, 2002.
 Mnih et al. [2013] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing Atari with Deep Reinforcement Learning. arXiv preprint arXiv:1312.5602, 2013.
 Mnih et al. [2015] V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Veness, M. Bellemare, A. Graves, M. Riedmiller, A. Fidjeland, G. Ostrovski, and et al. Humanlevel control through deep reinforcement learning. Nature, 518:529–533, 2015.
 Moody and Saffell [2001] J. Moody and M. Saffell. Learning to Trade via Direct Reinforcement. IEEE TRANSACTIONS ON NEURAL NETWORKS, 12:875–889, 2001.
 Moody et al. [1998] J. Moody, L. WU, Y. Liao, and M. Saffell. Performance Functions and Reinforcement Learning for Trading Systems and Portfolios. Journal of Forecasting, 17:441–470, 1998.
 Neuneier [1996] R. Neuneier. Optimal Asset Allocation using Adaptive Dynamic Programming. Advances in Neural Information Processing Systems, pages 952–958, 1996.
 Neuneier [1998] R. Neuneier. Enhancing QLearning for Optimal Asset Allocation. Advances in Neural Information Processing Systems, pages 936–942, 1998.
 O et al. [2006] J. O, J. Lee, J. W. Lee, and B. T. Zhang. Adaptive stock trading with dynamic asset allocation using reinforcement learning. Information Sciences, 176:2121–2147, 2006.
 Papailias and Thomakos [2015] F. Papailias and D. D. Thomakos. An improved moving average technical trading rule. Physica A, 428:458–469, 2015.
 Pendharkar and Cusatis [2018] P. C. Pendharkar and P. Cusatis. Trading financial indices with reinforcement learning agents. Expert Systems with Applications, 103:1–13, 2018.
 Pham et al. [2018] T. H. Pham, G. D. Magistris, and R. Tachibana. Optlayerpractical constrained optimization for deep reinforcement learning in the real world. 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 6236–6243, 2018.
 Potvin et al. [2004] J. Y. Potvin, P. Soriano, and M. Vallee. Generating trading rules on the stock markets with genetic programming. Computers & Operations Research, 31:1033–1047, 2004.
 Sirignano and Cout [2018] J. Sirignano and R. Cout. Universal features of price formation in financial markets: Perspectives from Deep Learning. arXiv preprint arXiv:1803.06917, 2018.
 Tan et al. [2009] Y. Tan, W. Liu, and Q. Qiu. Adaptive Power Management Using Reinforcement Learning. ICCAD, pages 461–467, 2009.
 Wang et al. [2016] Y. Wang, D. Wang, S. Zhang, Y. Feng, S. Li, and Q. Zhou. Deep Qtrading. http://cslt.riit.tsinghua.edu.cn, 2016.
 Xiong et al. [2018] Z. Xiong, X. Y. Liu, S. Zhong, H. Yang, and A. Waild. Practical deep reinforcement learning approach for stock trading. arXiv preprint arXiv:1811.07522, 2018.
 Zhang et al. [2015] X. Zhang, Y. Hu, K. Xie, W. Zhang, L. Su, and M. Liu. An evolutionary trend reversion model for stock trading rule discovery. KnowledgeBased Systems, 79:27–35, 2015.
 Zhu and Zhou [2009] Y. Zhu and G. Zhou. Technical analysis: An asset allocation perspective on the use of moving averages. Journal of Financial Economics, 92:519–544, 2009.