Deep Reinforcement Learning in High Frequency Trading
Abstract.
The ability to give precise and fast prediction for the price movement of stocks is the key to profitability in High Frequency Trading. The main objective of this paper is to propose a novel way of modeling the high frequency trading problem using Deep Reinforcement Learning and to argue why Deep RL can have a lot of potential in the field of High Frequency Trading. We have analyzed the model’s performance based on it’s prediction accuracy as well as prediction speed across fullday trading simulations.
1. Introduction
Deep reinforcement learning is prophesied to revolutionize the field of AI and represents a step towards building autonomous systems. The technique holds the ability to provide machines with a higher level understanding of the world around it and the ability to directly take actions and improve themselves regularly without human in the loop. (1708.05866, ).
Deep Reinforcement Learning has been providing promising results in situations where the current state of the environment helps to directly decide which action is to be taken next without any intermediate modeling required. While there has a been a lot of debate recently on the false hype of Deep Reinforcement Learning, the fundamental challenges like huge training data requirement and the lack of interpretability of the values learned (rlblogpost, ; MIT, ), there is no denying of the fact that its success is unparalleled in domains where it actually works. For example, currently, deep reinforcement learning is used in problems such as learning to play games directly from the pixels of an image capturing the current state of the game (1312.5602, ).
The inference of predictive models from historical data is not new in quantitative finance; a number of examples include coefficient estimation for the CAPM, Fama and French factors (RePEc:eee:jfinec:v:33:y:1993:i:1:p:35, ), and related approaches. However the special challenges for machine learning presented by HFT can be considered two fold :

Microsecond sensitive live trading ¿ As the complexity of the model increases, it gets more and more computationally expensive to keep up with the speed of live trading and actually use the information provided by the model.

Tremendous amount and fine granularity of data ¿ The past data available in HFT is huge in size and extremely precise. However there is a lack of understanding of how such lowlevel data, like the recent trading behavior, relates to actionable circumstances (such as profitable buying or selling of shares) (Kearns_machinelearning, ). There has been a lot of research on what ”features” to use for prediction or modeling (Kearns_machinelearning, ; 1709.01268, ; doi:10.1111/j.15406261.1991.tb02683.x, ; 1011.6402, ), yet it is difficult to model the problem without a proper understanding of the underlying relations.
2. Related Work
2.1. Deep Reinforcement Learning
One of the primary goals of the field of artificial intelligence (AI) is to produce fully autonomous agents that interact with their environments to learn optimal behaviours, possibly improving over time through trial and error.
Currently, deep learning is enabling reinforcement learning to scale to problems that were previously intractable, such as learning to play video games directly from pixels. Deep reinforcement learning algorithms are also applied to robotics, allowing control policies for robots to be learned directly from camera inputs in the real world (1312.5602, ; 1803.06773, ; 1610.00633, ).
2.2. Machine Learning in High Frequency Trading
Since the huge advancements of machine learning, there haven’t been many fields which are left by it and high frequency trading is no exception. There has been a lot of work done in terms of feature engineering in HFT for simpler models like linear regression, multiple Kernel learning, maximum margin, traditional modelbased reinforcement learning etc. (Kearns_machinelearning, ; 1705.03233, ).
Due to the computational complexity of Deep Learning models, lesser work has been done in terms of incorporating such recent and more complex models and instead more focus is made towards extracting useful features from the current trading book state and recent trading behavior. The common feature values like bidask spread, percentage change in price, weighted price etc. (Kearns_machinelearning, ) and some specialized features involving order imbalance (imbalance, ) were among many others that we used in our model.
3. Problem Statement
We aim to create a pipeline which uses all or some of the information about the past trading behavior available to it and current snapshot of the order book to predict price movement in the near term and keep on updating its decision every time some new information comes in.
3.1. Tick Data
Tick data refers to most granular level market information available from electronically traded markets. Every order request and trade information is provided as a ”tick” event. Current state of the order book refers to the top few (five, in our case) bid orders and ask orders placed along with their proposed prices and volumes. (tick, )
Tick data is raw, uncompressed data of the trading behavior of the market and requires a lot of storage space along with a minimum standard hardware requirements for capturing the data. Tick data is essential for a lot of different types of analysis of the trends present in the market.
3.2. Bid  Ask Spread
Bid and Ask are the prices that buyers and sellers respectively are willing to transact at, the bid for the buying side, and the ask for the selling side. A transaction takes place when either a potential buyer is willing to pay the asking price, or a potential seller is willing to accept the bid price. The spread between the top bid and the ask price at any moment is called the bidask spread. The midprice is the mean price of the top bid and the top ask price. (bid_ask, )
Most of the standard trading algorithms work on the principle of midprice prediction or midprice movement prediction. However a big drawback of this technique can be seen from Fig 1. Clearly in case 1, there is an increase in the midprice of the product, but since we need to enter the market at the ask price and exit at the bid price, we are actually incurring a loss in this transaction. In case 2, the movement of the price is large enough to cover the bidask spread and thus is a profitable transaction.
We have however diverted from this traditional approach and instead focused on predicting only those price movements which are substantial enough to cross the bidask spread. It is important to understand that since we only consider such movements which would certainly generate profit, predicting ”no movement” does not mean that there was absolutely no movement in the price of the product. It just means that the movement was not substantial enough to cross the spread and thus was ignored by our model.
3.3. Mean Reversion
Mean reversion is a financial theory which suggests that the price of a stock tends to return towards its long running mean/average price over time (mean_rev, ). Trading on this strategy is done by noticing companies whose stock values have significantly moved away in some direction from its long running mean and thus is now expected to move in the opposite direction. Ignoring the erratic trading periods at the start and end of the day, this behavior in stock prices is evident in most of the stock markets across the world (Chakraborty2011MarketMA, ). The oscillating behavior of the price around some changing mean across the day as seen in Fig 2 is an example of Mean Reversion.
Using mean reversion in stock price analysis involves both identifying the trading range for a stock and computing the mean around which the prices will be oscillating. There exists a few good statistical algorithms for the mean prediction, however the movement of the mean value in itself is something that comes into evidence a little too late for such statistical methods which can cause wrong predictions. Allowing a deep learning model to understand this oscillating nature of price movement and letting it learn the mean values and the trading range can substantially boost the prediction accuracy.
4. Proposed Model
The proposed pipeline contains a deep reinforcement learning model predicting the stock price movement followed by a financial model which places orders in the market based on the predicted movement. The training of the neural network can be broadly divided into two parts, the first part is thorough training using the data from the previous day and the second part is online training while receiving the live data. Now let us discuss the prediction model in detail.
4.1. Deep Neural Model
The base model is an ensemble of three MLP models. The problem statement is modeled in such a way that there are three possible future price movement categories every time a tick is received namely, increase in price, decrease in price or no movement. We use onevsone classification technique using an ensemble of three Neural models.
Activation functions used are ReLU at every hidden layer and the softmax function after the last layer while the error function used is Categorical Cross entropy. The final outputs of all three models are used to calculate confidence scores for each of the three possible outcomes and those predictions in which the outcome with the highest confidence score is less than the threshold are considered not good enough to act upon. These are tagged no confidence predictions and our behavior in these case is same as our behavior when the prediction is no movement.
Since every security in the market has a different frequency of activity, both history for feature creation and range of future prediction is discussed in terms of number of ticks instead of absolute time. It can be interpreted as ignoring the concept of actual time and considering every tick in the life of a security as one second.
4.2. Reinforcement Learning
The market dynamics keep on changing with time and using just the knowledge of the market behavior on previous days is not a smart choice, no matter how good our model is. So we need to also keep updating the model online while autonomously trading side by side. This is where we incorporate the basics of reinforcement learning and use backpropogation at regular intervals to keep updating the model weights. As mentioned earlier, processing speed of the model is a very important factor in the trading domain and thus updating the model after every tick is not a feasible solution, so instead we create minibatches of data for training.
We accumulate ticks to create these batches and after some specified number of ticks, we update our model using this mini batch. Since the model weights are already initialized by the weights learned on the previous day data, we hypothesize that the model understands the concept of mean inversion but just needs to fine tune itself according to the current time of the day and market conditions and thus lesser number of iterations during the online training.
4.3. Complete Pipeline
The complete pipeline can be broadly divided into three sections :

PreTraining ¿ The model needs to be trained beforehand using the data available from the market activity of previous days. The trained weights are saved and used to initialize the model before taking it live.

Prediction ¿ Features are created on the run and the model predicts price movement of the stocks using these features at every tick. Bid and Ask orders are placed in or withdrawn from the market using this information.

Online Training ¿ Running parallel to the predictions, we need to accumulate the feature values and corresponding ground truth values which we will get in the near future. Once a minibatch is formed, the weights are updated by tuning the model using this batch and the same process of accumulating data starts again.
Refer to Fig 3 for a a better understanding of the flow of the pipeline.
5. Dataset
The dataset consists of tick by tick (tbt) data of over 1200 securities for a whole week. Every tick contains the price and volume information of the last transaction and the top 5 levels of the order book of both the bid and ask side.
The market behvior is not mean reverting in the first few minutes after the market opens and in the last few minutes before it closes. So in order to capture the mean reverting trend in the market, we removed the first 30 and the last 30 minutes of data from every day to create a cleaner dataset. Also, we categorized securities based on their frequency of activity in the market everyday and only worked with securities which had frequency between 50,000 to 100,000 ticks on an average everyday.
The final dataset that we used for training and simulation contains around 200 filtered securities, with an average activity of around 70,000 ticks everyday. Thus we had approximately 70,000 * 200 = 14 million data points for every single day and a whole week’s worth of similar data to work with.
6. Model Analysis
The traditional comparison metrics like accuracy, Fscore etc. across all possible output classes are not appropriate in this case due to the factors like risk involved, opportunities to earn profit and the speed of prediction.
A possible example of two different extremes can be noticed by varying the confidence bound of the model. For example, a model with low confidence bound will have negligible ”no confidence” predictions and thus will provide a lot of opportunities for the user to gain profit but will also fail an awful lot number of times. On the other hand, a model with high confidence bound will have almost all of its predictions as ”no confidence” prediction and thus will give prediction of price movements only in the most obvious of cases and thus will provide with close to zero opportunities for the user to enter or exit the market, even though the model is extremely accurate.
Thus a combination of few different measures and terms are defined and used in correlation with each other to evaluate model performance and for parameter tuning. Every analysis done from this point forward is done on full day trading simulations of the dataset available to us. The terms defined below will be directly used in the analysis that follows :

Actionable Predictions ¿ Predicting increase or decrease in the price is an actionable prediction since it initiates an action of either entering or exiting the market. Predicting no movement in price or a no confidence prediction are not considered actionable predictions.

Accuracy ¿ Percentage of correct actionable predictions out of all actionable predictions.

Confidence Bound ¿ The threshold bound of the confidence score above which a prediction is considered valid. If the confidence score of no output class is above the confidence bound, the final prediction is considered as a no confidence prediction. The trading model behavior in this case is same as the behavior in case of no movement prediction.

Participation Percentage ¿ Percentage of actionable predictions out of all the predictions.
6.1. Reinforcement Learning
To understand the impact and importance of online learning in our model, we ran two different models, one with regular weight updates and another one with constant weights throughout the simulation. The two models were compared across a variety of different parameter values and every time the reinforcement learning model seemed to do exceptionally better than a simple neural model as can be seen in Fig 4.
6.2. Trailing History and Prediction Range
As previously mentioned, the unit of time used in our pipeline will be in terms of the number of ticks instead of absolute time. The two important parameters of our model are the trailing window of time which is used to create features for the current tick and the range of prediction representing how much into the future are we predicting.
Both of these values need to be represented in terms of fixed number of ticks. We experimented with different combinations of the two and compared their performance. To do a fair assessment of the models, their participation percentage was made constant at approx 10% by varying the confidence bound and then their accuracy was compared as shown in Fig 5.
Using a history of 500 ticks for feature creation and then predicting 100 ticks into the future seems to be outperforming any other combination. Since our hypothesis is based on the model understanding the trend of mean reversion in the market, clearly these optimal parameter values can vary from one market to the other. For our analysis we will use this combination in the all the following comparisons.
6.3. Mini Batch
The size of minibatch is also a parameter of concern. In case we take minibatches of larger size, there is a longer period of time for which the model runs on constant weights and thus performance decreases a little. However with larger batch sizes, we have a benefit of speed since it requires update after a larger interval and lesser computational power can be spent on it. Also with larger batches, there is a smaller chances of the noise present in the data affecting our training adversely. The opposite happens in case of smaller batch sizes and thus we need to create a balance for the best prediction accuracy with good prediction speed.
With the 500 tick history, 100 tick prediction range and a constant participation percentage of around 10%, the performance of the model for different minibatch sizes was compared in Fig 6. It is clear that decreasing the batch size any further than 100 does not provide us with any substantial increase in accuracy. Another thing we need to keep in mind is that in order to keep up with the live predictions, we would need twice the amount of computation power in case of batch size of 50 than of 100. Thus it might not be worth it to shift to a batch size of 50 for the minimalistic improvement in accuracy and will depend on one’s preferences and availability of computation power.
6.4. MultiLayer Perceptron
Once we have finalized the inputoutput parameters, we need to tune our MLP to incorporate the fact that we cannot afford a very complex model due to the time restraint and prediction speed requirement and at the same time a simple regression model might not be able to provide us with desirable predictive power. Thus some form of nonlinearity needs to be introduced in the model along with making sure that the complexity remains at a minimum.
Different model architectures were experimented with. Every hidden layer had a ReLU layer at the end and the final layer had softmax for all the models compared. The final results can be seen from Fig 7. Clearly the increase in accuracy beyond a (10, 10) model is not significant and as mentioned earlier, the choice of model depends on the availability of computational power.
6.5. Confidence Bound
One important aspect of our model is our ability to switch between giving more importance to participation percentage of the model or accuracy of its predictions. The analysis of the same is done in Fig 8 and one can always adjust the bar between the two according to the person’s willingness towards the risks involved. However just to put things in perspective, since there are 14 million ticks happening on an average everyday, one percentage change in participation percentage amounts to a change of 140,000 chances of making a trade everyday in absolute terms.
6.6. Prediction Speed Analysis
The above analysis was solely based on the profitability and accuracy of the model. However an important aspect of the proposed pipeline is availability of enough computational power to keep updating the model weights of every product in parallel while making predictions in real time.
To give a perspective to the reader, we coded our real time prediction model in C++ along with a working trading module and our code for updating the model weight in python and both of these were running in parallel, communicating in real time. The server on which we ran the two programs was Intel 4 core, 3GHz processor with 16 GB RAM. The two programs were running in parallel and in pace with each other when we were trading live across 20 different securities using a 500 tick trailing history, 100 ticks prediction range, with a minibatch of size 100 and (10, 10) MLP model.
7. Challenges and Future Work
One of the biggest challenges which when solved became the crucial success point of our proposed pipeline was the novel way of modeling of the problem statement that we proposed.
Strategies like using ticks as a measurement of time instead of the absolute time itself, considering only those price movements which were able to cross the bidask spread, introduction of a confidence bound to ensure the difference between actionable and no confidence predictions etc. were different from the conventional approaches used by many successfully working trading algorithms.
Since it is clear that reinforcement learning along with the modeling techniques described above can substantially boost a model’s performance in this domain. Thus an analysis of some other existing pipelines can also be done after integrating them with reinforcement learning in order to see the prediction speeds and corresponding accuracies.
While our focus was only on the prediction part of the pipeline, there are also a lot of financial models which can use the information provided by our model and create a better trading strategy for more profitability. One can also experiment to see how well can our model integrate with these techniques to create the most profitable autonomous trading system.
Acknowledgements.
We would like to thank WealthNet Advisors for their continuous guidance and access to computer hardware and a working trading module for all the experimentation and analysis done in this paper.References
 (1) Kai Arulkumaran and Marc Peter Deisenroth and Miles Brundage and Anil Anthony Bharath, A Brief Survey of Deep Reinforcement Learning, 2017
 (2) Volodymyr Mnih and Koray Kavukcuoglu and David Silver and Alex Graves and Ioannis Antonoglou and Daan Wierstra and Martin Riedmiller, Playing Atari with Deep Reinforcement Learning, 2013
 (3) Deep Reinforcement Learning Doesn’t Work Yet, Irpan, Alex
 (4) The Dark Secret at the Heart of AI, https://www.technologyreview.com/s/604087/thedarksecretattheheartofai/
 (5) Common risk factors in the returns on stocks and bonds, Fama, Eugene and French, Kenneth, 1993, Journal of Financial Economics
 (6) Michael Kearns and Yuriy Nevmyvaka, Machine Learning for Market Microstructure and High Frequency Trading
 (7) Dat Thanh Tran and Martin Magris and Juho Kanniainen and Moncef Gabbouj and Alexandros Iosifidis, Tensor Representation in HighFrequency Financial Data for Price Change Prediction, 2017
 (8) LEE, CHARLES M. C. and READY, MARK J., Inferring Trade Direction from Intraday Data, The Journal of Finance
 (9) Rama Cont and Arseniy Kukanov and Sasha Stoikov, The Price Impact of Order Book Events, 2010
 (10) Volodymyr Mnih and Koray Kavukcuoglu and David Silver and Alex Graves and Ioannis Antonoglou and Daan Wierstra and Martin Riedmiller, Playing Atari with Deep Reinforcement Learning, 2013
 (11) Tuomas Haarnoja and Vitchyr Pong and Aurick Zhou and Murtaza Dalal and Pieter Abbeel and Sergey Levine, Composable Deep Reinforcement Learning for Robotic Manipulation, 2018
 (12) Shixiang Gu and Ethan Holly and Timothy Lillicrap and Sergey Levine, Deep Reinforcement Learning for Robotic Manipulation with Asynchronous OffPolicy Updates, 2016
 (13) Adamantios Ntakaris and Martin Magris and Juho Kanniainen and Moncef Gabbouj and Alexandros Iosifidis, Benchmark Dataset for MidPrice Prediction of Limit Order Book data, 2017
 (14) Investopedia, Bid and Asked, https://www.investopedia.com/terms/b/bidandask.asp
 (15) Market making and mean reversion, Tanmoy Chakraborty and Michael Kearns, 2011
 (16) Investopedia, Mean Reversion, https://www.investopedia.com/terms/m/meanreversion.asp
 (17) Trade Ideas, Tick Data, https://www.tradeideas.com/glossary/tickdata/
 (18) Darryl Shen, Order Imbalance Based Strategy in High Frequency Trading, 2015