Deep prediction of investor interest: a supervised clustering approach
Abstract
We propose a novel deep learning architecture suitable for the prediction of investor interest for a given asset in a given time frame. This architecture performs both investor clustering and modelling at the same time. We first verify its superior performance on a synthetic scenario inspired by real data and then apply it to two realworld databases, a publicly available dataset about the position of investors in Spanish stock market and proprietary data from BNP Paribas Corporate and Institutional Banking.
Keywords — investor activity prediction, deep learning, neural networks, mixture of experts, clustering
1 Introduction
Predicting investor activity is a challenging problem in Finance. The basic problem can be stated as follows: given many thousands of assets and many thousands of investors, predict which investors will be interested in buying/selling which assets in the next (short) time period. What makes this problem difficult is the large heterogeneity of both investors and assets, compounded by the nonstationary nature of markets and investors and the limited time over which predictions are relevant.
Adhoc methods are surprisingly efficient at clustering investors according to their trades in a single asset (Tumminello et al., 2012). In addition, clusters of investors determined for several assets separately have a substantial overlap (Baltakys et al., 2018), which shows that one may be able to cluster investors for more than a few assets at a time. The activity of a given cluster may systematically depend on the previous activity of some clusters, which can then be used to predict the investment flow of investors (Challet et al., 2018). Here, we leverage deep learning to train a single neural network on all the investors and all the assets of a given market and give temporal predictions for each investor.
The heterogeneity of investors translates into a heterogeneity of investment strategies (Tumminello et al., 2012; Musciotto et al., 2018): for the same set of information, e.g., financial and past activity indicators, investors can take totally different actions. Take for instance the case of an asset whose price has just decreased: some investors will buy it because they have positive longterm price increase expectations and thus are happy to be able to buy this asset at a discount; reversely, some other investors will interpret the recent price decrease as indicative of the future trend or risk and refrain from buying it.
Formally, in our setting, a strategy is a mapping from current information to expression of interest to buy and/or sell a given asset, encoded by a categorical variable : . We call here the set of all the investment strategies that an investor may follow. Unsupervised clustering methods suggest that the number of different strategies that describe investors’ decisions is finite (Musciotto et al., 2018). We therefore expect our dataset to have a finite number of clusters of investors, each following a given investment strategy . Consequently, we expect to be such that , i.e. . Alternatively, can be thought of as the set of distinguishable strategies, which may be smaller than the total number of strategies and which may therefore be considered as an effective set of strategies. At any rate, a suitable algorithm to solve our problem therefore needs to be able to infer the set of investment strategies .
A simple experiment shows how investors differ. We first transform BNP Paribas CIB bonds’ Request for Quotation (RFQ) database, along with market data and categorical information related to the investors and bonds, into a dataset of custommade, proprietary features describing the investors’ interactions with all the bonds under consideration. This dataset is built so that each row can be mapped to a triplet (Investor, Financial Product, Date). This structure allows us, for a given investor and at a given date, to provide probabilities of interest in buying and/or selling a given financial product in a given timeframe. We consider a given day to be a positive event for a given investor and a given financial product when the investor actually signalled his interest in that product in a window of 5 business days around that day. The reason is twofold: first because bonds are by essence illiquid financial products and second because this increases the proportion of positive events.
At each date, negative events are randomly sampled in the (Investor, Financial Product) pairs that were observed as positive events in the past and that are not positive at this date. Using this dataset, we conduct an experiment to illustrate the nonuniversality of investors, i.e. the fact that investors have distinct investment strategies. The methodology of this experiment is reminiscent of the one used in Sirignano and Cont (2018) to study the universality of equity limit order books.
We use a dataset constructed as described above with five months of bonds’ RFQ data. We split this dataset into many subsets according to the investors’ business sector, e.g. one of these subset contains investors coming from the Insurance sector only. We consider here only the sectors with a sufficient amount of data samples to train and test a model. The remaining sectors are grouped together under the Others flag. Note that this flag is mainly composed of Corporate sectors, such as car industry, media, technology, telecommunications…For each sector, some of the latest data is held out, and a gradient boosting model is trained on the remaining data. This model is then used for prediction on the heldout data of the model’s underlying sector, and for all the other sectors as well. For comparison purposes, an aggregated model using all sectors at once is also trained and tested in the same way.
Because classes are unbalanced, we compute the average precision score of the obtained results, as advised by Davis and Goadrich (2006), macroaveraged over all the classes, which yields the universality matrix shown in Fig. 1. The axis labels the sector used for training, and the axis is the section on which the predictions are made.
We observe that some sectors are inherently difficult to predict, even when calibrated on their data only — this is the case for Asset Managers of Private Banks and Pension Funds. On the contrary, some sectors seem to be relatively easy to predict, e.g. Broker Dealers and, to some extent, Central Banks. Overall, we note that there is always some degree of variability of the scores obtained by a given model — no universal model gives good predictions for all the sectors of activity. Thus follows the nonuniversality of clients. In addition, it is worth noting that the aggregated model obtained better performance on some sectors than the models trained on these sectors’ data only. As a consequence, a suitable grouping of sectors would improve predictions for some sectors. This observation is in agreement with the above investment strategies hypothesis.
Following on these hypotheses, this work leverages deep learning both to uncover the structure of similarity between investors, namely the clusters, or strategies, and to make relevant predictions using each inferred clusters. The advantage of deep learning lies in the fact that it allows to solve both of these tasks at once, and thereby unveils the structure of investors that most closely corresponds to their trading behaviour in a selfconsistent way.
2 Related work
This work finds its roots in mixtureofexperts research, which began with Jacobs et al. (1991), from which we keep the basic elements which drive the structure presented in Section 3, and more particularly the gating and expert blocks. A rather exhaustive history of the research performed on this subject can be found in Yuksel et al. (2012).
The main inspiration for our work is Shazeer et al. (2017), which, although falling within the conditional computation framework, presented the first adaptation of mixture of experts for deep learning models. We build on this work to come up with a novel structure designed to solve the particular problem presented in Section 1. As far as we know, the approach we propose is new. We use an additional loss term to improve learning of the strategies, reminiscent of the one introduced in Liu and Yao (1999).
3 Experts Network
We introduce here a new algorithm, inspired by Shazeer et al. (2017), which we call the Experts Network (ExNet). The ExNet is purposely designed to be able to capture the hypotheses formulated in Section 1, i.e. to capture a finite, unknown number of distinct investment strategies .
3.1 Architecture of the network
The structure of an ExNet, illustrated in Fig. 2, comprises two main parts: a gating block and an experts block. Their purposes are the following:

The gating block is an independent neural network whose role is to learn how to dispatch investors to experts defined below. This block receives a distinct, categorical input, the gating input, corresponding to an encoding of the investors and such that the th row of the gating input corresponds to the investor indexing the th row of the experts input. Its output consists in a vector of size which contains the probabilities that the input should be allocated to the experts, computed by a softmax activation.

The experts block is made of independent subnetworks, called experts. Each expert receives as input the same data, the experts input, corresponding to the features used to solve the classification or regression task at hand, e.g. in our case the features outlined in Section 1 — for a given row, the intensity of the investor’s interest in the financial asset considered, the total number of RFQ done by the investor, the price and the volatility of the asset…As investors are dispatched to the experts through the gating block, each expert will learn a mapping that most closely corresponds to the actions of its attributed investors. The role of an expert is therefore to retrieve a given , corresponding to one of the underlying clusters of investors which we hypothesized.
The outputs of these two blocks are combined through , where denotes the investor related to data sample and is the probability that investor is assigned to expert . Our goal is that experts learn to specialize to clusters. As is unknown, retrieving all clusters requires that , i.e. should be ’large enough’. We will show below that the network ability to retrieve the clusters is not impacted by high values of ; using large values therefore ensures that the condition is respected and only impacts computational efficiency. The described architecture corresponds in fact to a metaarchitecture. The architecture of the experts is still to be chosen, and indeed any kind of neural network could be used. For the sake of simplicity and computational ease, we use here rather small feedforward neural networks for the experts, all with the same architecture, but one could easily use experts of different architectures to represent a more heterogeneous space of strategies.
Both blocks are trained simultaneously using gradient descent and backpropagation, with a loss corresponding to the task at hand, be it a regression or classification task, and computed using the final output of the network only, . One of the most important features of this network lies in the fact that the two blocks do not receive the same input data. We saw previously that the gating block receives as input an encoding of the investors. As this input is not timedependent, the gating block of the network can be used a posteriori to analyse how investors are dispatched to experts with a single pass of all investors’ encodings through this block alone, thereby unveiling the underlying structure of investors interacting in the considered market.
For a given investor , the gating block computes attribution probabilities of investor to each expert
,
where is a trainable dimensional embedding of the investor , is a trainable dimensional matrix where the th row corresponds to the embedding of the corresponding expert, and we define .
3.2 Disambiguation of investors’ experts mapping
The ExNet architecture is similar to an ensemble of independent neural networks, where the weighted average is given by the gating block of the network. We empirically noticed that ExNets may assign equal weights to all experts for all investors without additional penalization. To avoid this behaviour, and thereby to help each investor follow a single expert, we introduce an additional loss term
,
where is the current batch of data considered, is the number of experts, and is the attribution of investor to the th expert. This loss term corresponds exactly to the entropy of the probability distribution over experts of a given investor. Minimising this loss term will therefore encourage distributions over experts to peak on one expert only.
3.3 Helping experts specialize
Without a suitable additional loss term, the network has a tendency to let a few experts learn the same investment strategy, which also leads to more ambiguous mapping from investors to experts. Thus, to help the network finding different investment strategies and to increase its discrimination power regarding investors, we add a specialization loss term, which involves crossexperts correlations, weighted accordingly to their relative attribution probabilities. It is written as:
Here, , is the batchwise correlation between experts and outputs, averaged over the output dimension, and is the batchwise mean attribution probability to expert , with the attribution probability of investor to expert , computed on the current batch of investors that counted this expert in their top only. The intuition behind this weight is that we want to avoid correlation between experts that were confidently selected by investors, i.e. to make sure that the experts that matter do not replicate the same investment strategy. As the size of the investors clustering around a given expert should not matter in this weighing, we only account for the top probabilities for all the considered investors in these weights. In some experiments, it was found useful to rescale from to .
This additional loss term is reminiscent of Liu and Yao (1999). As a matter of fact, in ensembles of machine learning models, negatively correlated models are expected to perform better than positively correlated ones. This can also be expected from the experts of an ExNet, as negatively correlated experts better span the space of investment strategies. As the number of very distinct strategies grow, we can expect to find strategies that more closely match the ones the investors use in the considered market, or the basis functions on which investment strategies can be decomposed.
3.4 From gating to classification
Up to this point, we only discuss gating input related to investors. However, as seen above, being able to retrieve the structure of attribution of inputs to experts only requires to use categorical data as input to the gating part of the network after the training phase. We can therefore perform gating on whatever is thought to be suitable — for instance, it is reasonable to think that bonds investors have different investment strategies depending on the bonds’ grades, or depending on the sector of activity of the bonds’ issuers. Higherlevel details about investors could also be considered, for instance because investment strategies may depend on factors such as the sector of activity of the investor, i.e. whether it is a hedge fund, a central bank or an asset manager, or the region of the investor. The investor dimension could even be totally forgotten, and the gating performed on asset related categories only.
Gating allows one to retrieve the underlying structure of interactions of a given category, or set of categories. One can therefore purposely set categories to study how they relate in the problem one wants to study. This may however impact performance of the model, as chosen categories do not necessarily have distinct decision rules.
Note also that the initialization of weights in the gating network has a major impact on the future performance of the algorithm. To find relevant clusters, i.e. clusters that are composed of unequivocally attributed categories and that correspond to the original clusters expected in the dataset, categories need to be able to explore many different clusters’ configurations before the exploitation of one relevant configuration. To allow for this exploration, the gating block must be initialized so that all the expert weights are fairly evenly initially distributed. In our implementation, we therefore use a random normal initialization scheme for the dimensional embeddings of the categories and of the experts.
3.5 Limitations of the approach
Our approach allows us to treat well a known, fixed base of investors. However, it cannot easily deal with new investors, or, at a higher level, new categories as seen in Section 3.4, as embeddings for these new types of element would need to be trained from scratch. To cope with such situations, we therefore recommend to use sets of fixed categories to describe the evolving ones. For instance, instead of performing gating on investors directly, one can use investors’ categories such as sector, region,…, that are already present in the dataset and on which we can train embeddings. Doing so improves the robustness of our approach to unseen categories. Note that this is reminiscent of one of the classic problems of recommender systems, known in the literature as the cold start problem.
4 Experiments
Before testing the ExNet architecture on real data, we first check its ability to recover a known strategy set, to attribute correctly traders to strategies, and finally to classify the interest of traders on synthetic data. We then show how our methodology compares with other algorithms on two different datasets: a dataset opensourced as part of the experiments presented in GutiérrezRoig et al. (2019), and a BNP Paribas CIB dataset. Source code for the experiments on synthetic data and the opensource dataset is provided, and can be found at https://github.com/BptBrr/deep_prediction.
4.1 Synthetic data
4.1.1 Generating the dataset
Taking a cue from BNP Paribas CIB bonds’ RFQ database, we define three clusters of investors, each having a distinct investment strategy, which we label as ’highactivity’, ’lowactivity’ and ’mediumactivity’. Each cluster contains a different proportion of investors, and each trader within a cluster has the same activity frequency: the ’highactivity’ cluster accounts for roughly of the dataset samples, while containing roughly of the total number of investors. The ’lowactivity’ cluster accounts for roughly of the samples, while containing roughly of the total number of investors. The ’mediumactivity’ cluster accounts for the remaining number of samples and investors. In all the clusters, we assume that investors are equally active.
We model the state of investors as a binary classification task, with a set of features, denoted by , and a binary output representing the fact that a client is interested or not in the considered asset. Investor belonging to cluster follows the decision rule given by , where , being the cluster weights and an investorspecific bias, for , is distributed according to the uniform distribution on , and is the logistic function.
The experiment is conducted using a generated dataset of samples, investors and features. This dataset is split into train/validation/test sets, corresponding to of the whole dataset. is set to , and the cluster weights are taken as follows:

Highactivity cluster:

Lowactivity cluster:

Mediumactivity cluster:
These weights are chosen so that the correlation between the low and mediumactivity clusters is positive, but both are negatively correlated with the the highactivity cluster. In this way, we build a structure of clusters, core decision rules and correlation patterns that is sufficiently challenging to demonstrate the usefulness of our approach.
4.1.2 Results
We examine performance of our proposed algorithm, ExNet, against a benchmark algorithm, LightGBM (Ke et al., 2017). LightGBM is a popular implementation of gradient boosting, as shown for example by the percentage of top Kaggle submissions that use it. This algorithm is fed with both the experts input of the ExNet and an encoding of the considered investors, used as a categorical feature in the LightGBM algorithm. For comparison purposes, experiments are also performed on a LightGBM model fed with experts input and an encoding of the investors’ underlying clusters, i.e. whether the investor belongs to the high, low or mediumactivity cluster, called LGBMCluster.
ExNets are trained using the crossentropy loss, since the problem we want to solve is a classification one. The network is optimized using Nadam (Dozat, 2016), a variation of the Adam optimizer (Kingma and Ba, 2014) using Nesterov’s Accelerated Gradient (Nesterov, 1983), reintroduced in the deep learning framework by Sutskever et al. (2013), and Lookahead (Zhang et al., 2019). For comparison purposes, experiments are also performed on a multilayer perceptron model fed with the experts inputs concatenated with a trainable embedding of the investors, called EmbedMLP — this model therefore differs from a oneexpert ExNet in that this ExNet does not use an embedding of the investor to perform its predictions. All neural network models presented here used the rectified linear unit, , as activation function (Nair and Hinton, 2010).
LightGBM, ExNet and EmbedMLP results are shown in Table 1. They were obtained using a combination of random search (Bergstra and Bengio, 2012) and manual finetuning. LightGBMCluster results used the hyperparameters found for LightGBM. These results correspond to the model which achieved best validation accuracy over all our tests. The LightGBM and LightGBMCluster shown had leaves, a minimum of samples per leaf, a maximum depth of , a learning rate of and a subsample ratio of with a frequency of . ExNetOpt, the ExNet which achieved the best validation accuracy, used experts with three hidden layers of sizes , and , a dropout rate (Srivastava et al., 2014) of , loss weights and , a batch size of , and a learning rate of . The EmbedMLP model shown used two hidden layers of size and , a dropout rate of , an embedding size , a batch size of , and a learning rate of .
To study the influence of the number of experts on the performance of ExNets, we call ExNetn an ExNet algorithm with experts and vary . These ExNets used experts with no hidden layers, batchnormalized inputs (Ioffe and Szegedy, 2015), and an investor embedding of size . These neural networks were trained for epochs, using early stopping with a patience of . All these experiments were carried out with a learning rate equal to and a batch size of , which was found to lead to satisfactory solutions in all the tested configurations. In other words, we only vary so as to be able to disentangle the influence of for an overall reasonably good choice of other hyperparameters. Only the weights attributed to the specialization and entropy losses, and , were allowed to change across experiments.
Algorithm  Train Acc.  Val Acc.  Test Acc.  High Acc.  Medium Acc.  Low Acc. 

LGBM  96.38  92.05  92.41  92.85  90.47  92.34 
LGBMCluster  93.89  92.33  92.94  93.03  92.54  93.89 
EmbedMLP  93.87  92.88  93.19  93.20  93.14  92.24 
ExNetOpt  93.57  92.99  93.47  93.56  93.09  93.17 
ExNet1  74.86  74.56  74.56  80.39  48.67  38.72 
ExNet2  90.73  90.59  90.86  91.66  87.32  82.71 
ExNet3  92.73  92.50  93.06  92.97  93.47  93.89 
ExNet10  92.91  92.66  93.16  93.12  93.36  93.89 
ExNet100  92.71  92.55  93.04  92.96  93.41  93.89 
Perfect model  93.62  93.51  93.71  93.75  93.52  94.82 
Table 1 contains the results for all the tested implementations. As the binary classification considered here is balanced, we use the accuracy as evaluation metric. This table reports results on train, validation and test splits of the dataset, and a view of the test results on the three different clusters generated. As the generation process provides us with the probabilities of positive events, it is also possible to compute metrics for a model that would output these probabilities, denoted here as perfect model, which sets the mark of what good predictive performance is in this experiment.
We see here that the LGBM implementation fails to completely retrieve the different clusters. LGBM focused on the highactivity cluster and mixed the two others, leading to poorer predictions for both of these clusters and here particularly for the mediumactivity one. In comparison, LGBMCluster performed significantly better on the medium and lowactivity clusters. EmbedMLP better captured the structure of the problem, but appears to mix the medium and lowactivity clusters as well, albeit getting better predictive performance. ExNetOpt, found with random search, captured well all clusters and obtained the best overall performances.
Moreover, the ExNet experiment shows how the algorithm behaves as increases. ExNet successfully captured the largest cluster in terms of samples, i.e. the highactivity one, partly ignoring the two others, and therefore obtained poor overall performance. ExNet behaved as the LGBM experiment, retrieving the highactivity cluster and mixing the remaining two. ExNet perfectly retrieved the three clusters, as expected. Even better, the same holds for ExNet and ExNet: this is because the ExNet algorithm, thanks to the additional specialization loss, is not sensitive to the number of experts even if , as long as there are enough of them. Thus, when , the ExNet is able to retrieve the initial clusters and to predict the interests of these clusters satisfactorily.
4.1.3 Further analysis of specialization
The previous results show that as long as , the ExNet algorithm is able to capture the investment strategies corresponding to the underlying investor clusters efficiently. One still needs to check that the attribution to experts is working well, i.e. that the investors are mapped to a single, unique expert. To this end, we retrieved from the gating block the attribution probabilities to the experts of all the investors a posteriori. For comparison, we also analyse the investors’ embeddings of EmbedMLP. The comparison of the final embeddings of ExNetOpt and the ones trained in the EmbedMLP algorithm is shown in Fig. 3.
To visualize embeddings, we use here the UMAP algorithm (McInnes et al., 2018), which is particularly relevant as it seeks to preserve the topological structure of the embeddings’ data manifold in a lowerdimensional space, thus keeping vectors that are close in the original space close in the embedding space, and making intercluster distances meaningful in the twodimensional plot. The twodimensional map given by UMAP is therefore a helpful tool for understanding how investors relate to each other according to the each deep learning method. In these plots, highactivity investors are shown in blue, lowactivity investors in red and mediumactivity investors in green. We can see in Fig. 3 that the EmbedMLP algorithm did not make a totally clear distinction between the low and mediumactivity clusters, contrarily to the ExNet which separated these two categories with the exception of a few lowactivity investors mixed in the mediumactivity cluster. The ExNet algorithm was therefore completely able to retrieve the original clusters.
The attribution probabilities to the different experts of ExNetOpt are shown in Fig. 4. We see in this figure that the attribution structure of this ExNet instance is quite noisy, with three different behaviours clearly discernable. The first group of investors correspond to the lowactivity cluster, the second group to the mediumactivity cluster and the last one to the highactivity cluster. Attributions are here very noisy, and investors of a same cluster are not uniquely mapped to an expert.
It is however possible to achieve a more satisfactory experts attribution, as one can see in Fig. 5 with the plots of ExNet. This comes from the fact that the ExNet instance used a higher level of than ExNetOpt — at the expense of some performance, we are able to obtain far cleaner attributions to the experts. We see here on the left plot that all investors were attributed almost entirely to one expert only, with for each of the corresponding experts mean attribution probabilities , even with an initial number of experts of , i.e. the setting. One can directly see on the UMAP plot three welldefined, monochromatic clusters. We can also see here that a lowactivity investor got mixed in the mediumactivity cluster, and that two separated lowactivity clusters appear — these separated clusters originate from the fact that some low and mediumactivity investors were marginally attributed to the expert corresponding to the other cluster, as appearing on the experts distribution plot.
The ExNet therefore solved the problem that we originally defined, obtaining good predictive performance on the three original clusters and uniquely mapping investors to one expert only, thereby explicitly uncovering the initial structure of the investors, a feature that an algorithm such as EmbedMLP is unable to perform.
4.2 IBEX data
4.2.1 Constructing the dataset
This experiment uses a realworld, publicly available dataset published as part of GutiérrezRoig et al. (2019) (https://zenodo.org/record/2573031) which contains data about a few hundred private investors trading 8 Spanish equities from the IBEX index, from January 2000 to October 2007. For a given stock and each day and each investor, the dataset gives the endofthe day position, the open, close, maximum and minimum prices of the stock as well as the traded volume.
We focus here on the stock of the Spanish telecommunication company Telefónica, TEF, as it is the stock with the largest number of trades. Using this data, we try to predict, at each date, whether an investor will be interested into buying TEF or not. An investor is considered to have an interest into buying TEF when , where is the position of investor at time . We only consider here the buy interest as the sell interest of private investors can be driven by exogenous factors that cannot be modelled, such as a liquidity shortage of an investor, whereas the buy interest of a investor depends, to some extent, on market conditions. We thus face is a binary classification problem, which is highly unbalanced: on average, a buy event occurs with a frequency of .
We consider a temporal split of our data in three parts: training data is taken from January 2000 to December 2005, validation data from January 2006 to January December 2006 and test data from January 2007 to October 2007. We restrict our investor perimeter to investors that bought TEF more than times during the training period. We build two kinds of features:

Position features. Position is shifted such that at date corresponds , and is normalized for each investor using statistics computed on the training set. This normalized, shifted position is used as is as feature, along with moving averages of it with windows of 1 month, 3 months, 6 months and 1 year.

Technical analysis features. We compute all the features available in the ta package (Padial, 2018), which are grouped under 5 categories: Volume, Volatility, Trend, Momentum and Others features. As most of these features use close price information, we shift them such that features at a date only use information available up to .
We are left with rather active investors and features.
4.2.2 Results
ExNet and LightGBM are both trained using a combination of random search (Bergstra and Bengio, 2012) and hand finetuning. As the dataset is highly unbalanced, the ExNet is trained using the focal loss (Lin et al., 2017), an adaptive reweighting of the crossentropy loss. The parameter of this loss is treated as an hyperparameter of the network, and is also randomly searched. We also used the baseline buy activity rate of each investor in the training period as a benchmark.
Algorithm  Train  Val  Test 

Historical  9.68  4.55  2.49 
LGBM  22.22  7.53  5.35 
ExNet4  18.37  8.63  6.45 
The LightGBM shown in Table 2 used leaves with a minimum of samples per leaf, a maximum depth of , a learning rate of , a subsample ratio of with a frequence of , a sampling of of columns per tree, with a patience of for a maximum number of trees of . The ExNet shown used experts with two hidden layers of size and with a dropout ratio of , embeddings of size , an input dropout of , and , a focal loss of parameter , a batch size of , a learning rate of and was trained using Nadam and Lookahead, with an early stopping of patience . As can be seen on this table, both algorithms beat the historical baseline, and the ExNet achieved overall better test performance. While the precision of LightGBM is better in the training set, it is clearly inferior to that of ExNet in the validation set, a sign that ExNet is less prone to overfitting than LightGBM.
Figure 6 gives a deeper view of the results obtained by the ExNet. Three distinct behaviours appear in the left plot. Some of the investors were entirely attributed to the blue expert, some investors used a combination of the blue expert and two others, and some used combinations of the light blue and red experts. These three clusters are remarkably spaced in the UMAP plot on the right. It therefore appears that the ExNet retrieved three distinct behavioural patterns from the investors interacting on the TEF stock, leading to an overall better performance than the LightGBM who was not able to capture them, as the experiments performed in Section 4.1 show.
4.2.3 Experts analysis
We saw in Section 4.2.2 that the ExNet algorithm retrieved three different clusters. Let us investigate in more details what these clusters correspond to. First, the typical trading frequency of the traders attributed to each of these three clusters are clearly different. The investors that were mainly attributed to the blue expert in Fig. 6, corresponding to the blue cluster on the right of the UMAP plot, can be understood as ’lowactivity’ investors, trading on average of the time. The blue cluster on the lefthand side of the UMAP plot can be understood as mediumactivity investors, buying on average of the days; the red cluster on the left of the plot is made of highactivity investors (). The ExNet therefore retrieved particularly well three distinct behaviours, corresponding to three different activity patterns.
To get a better understanding of these three clusters, we can try to assess these clusters’ sensitivity to the features used in the model. We use here permutation importance, a widespread method in machine learning, whose principle was described for a similar method in Breiman (2001). The idea is to replace a given feature by permutations of all its values in the inputs, and assess how the performance of the model evolves in that setting. Here, we applied this methodology to the six groups of features: we performed the shuffle a hundred times and averaged the corresponding performance variations. For each of the three clusters, we pick the investor who traded the most frequently, and apply permutation importance to characterize the behaviour of the cluster. Results are reported in Table 3.
Feature group  Cluster 1  Cluster 2  Cluster 3 

Position  37.4%  43.1%  23.7% 
Volume  18.6%  +10.6%  19.9% 
Volatility  22.8%  4.5%  2.1% 
Trend  9%  2.4%  3.7% 
Momentum  +1.7%  +4.3%  13.5% 
Others  0.7%  +8.3%  +0.2% 
We call in this table cluster 1 the lowactivity one, cluster 2 the mediumactivity one and cluster 3 the highactivity one. We see that the three groups have different sensibilities to the groups of features that we use in this model. While all clusters are particularly sensitive to position features, the respective sensitivity of groups to the other features vary: leaving aside cluster 2 that only looks sensitive to position, cluster 1 is also sensitive to volume, volatility and trend, whereas cluster 3 is also sensitive to volume and momentum. The clusters therefore not only encode the activity rate, but also the type of information that a strategy needs, and by extension the family of the strategies used by traders, which validate the intuition that underpins the ExNet algorithm.
4.3 BNPP CIB data
The previous experiments proved the ability of the network to retrieve the structure of investors with a finite set of fixed investment strategies, and the usefulness of our approach on a realworld dataset. We now give an overview of the results we obtain on the BNPP CIB bonds’ RFQ dataset specified in Section 1 for the nonuniversality of clients study.
As a reminder, assets considered are corporate bonds. The data used ranges from early 2017 to the end of 2018 with temporal train/val/test splits, and is made of custom proprietary features using clientsrelated, assetsrelated and tradesrelated data. Our goal is, at a given day, to predict the interest of a given investor into buying and/or selling a given bond; each row of the dataset is therefore indexed by a triplet (Investor, Bond, Date). Targets are constructed as previously explained in Section 1. In the experiment conducted here, we consider different investors interacting around a total of more than distinct bonds.
The lefthand plot of Fig. 7 shows the distribution over experts for all the considered investors. We see three different patterns appearing : one which used the brown expert only, another one the green expert only and a composite one. These patterns lead to four clusters on the righthand plot. In this plot, as previously done in the IBEX experiment, each point corresponds to an investor, whose color is determined by the expert to which she is mainly attributed, with colors matching the ones of the lefthand plot. We empirically remark that these clusters all have different activities: the larger brown cluster is twice more active than the green one, the two smaller clusters having inbetween average activities. The ExNet therefore retrieved distinct behavioural patterns, confirmed by a global rescaled specialization loss below , hence negatively correlated experts.
Obtaining finer clusters could be achieved in multiple ways. A higherlevel category could be used as gating input: instead of encoding investors directly, one could encode their sector of activity, in the fashion of the nonuniversality of clients experiment. With an encoding of the investors, running an ExNet on the investors of one of the retrieved clusters only would also lead to a finer clustering of the investors — a twostage gating process could even directly lead to it, and will be part of further investigations on the ExNet algorithm. Note however that these maps (and ExNets) are built from measures of simultaneous distance, hence, do not exploit leadlag relationships — how ExNets could be adapted to a temporal setting to retrieve leadlag relationships will be worthy of future investigations as well.
On a global scale, these plots help us understand how investors relate to each other. Therefore, one can use them to obtain a better understanding of BNP Paribas CIB business, and how BNP Paribas CIB clients’ behave on a given market through a thorough analysis of the learnt experts.
5 Conclusion
We introduced a novel algorithm, ExNet, based on the financial intuition that in a given market, investors may act differently when exposed to the same signals, and cluster around a finite number of investment strategies. This algorithm is able to perform both prediction, be it regression or classification, and clustering at the same time. The fact that these operations are trained simultaneously leads to a clustering that most closely serves the prediction task, and a prediction that is improved by the clustering. Moreover, one can use this clustering a posteriori, independently, to gain knowledge as to how individual agents behave and interact with each other. To help the clustering process, we introduced an additional loss term that penalizes correlation between the inferred investment strategies. Thanks to an experiment with simulated data, we proved the usefulness of our approach, and we discussed how the ExNet algorithm performs on an opensource dataset of Spanish stock market data and on data from BNP Paribas CIB. Further research on the subject will include how such architectures could be extended and staged, and how they could be adapted to retrieve leadlag relationships in a given market.
On a final note, the ExNet architecture introduced in this article can be applied wherever one expects agents to use a finite number of decision patterns, e.g. in eshopping or movie opinion databases (Bennett et al., 2007).
6 Acknowledgements
This work was conducted under the French CIFRE PhD Programme, in collaboration between the MICS Laboratory at CentraleSupélec and BNP Paribas CIB Global Markets. We thank Sarah Lemler, Frédéric Abergel and Julien Dinh for helpful discussions and feedback on early drafts of this work.
References
 Multilayer aggregation with statistical validation: Application to investor networks. Scientific reports 8 (1), pp. 8198. Cited by: §1.
 The Netflix Prize. In Proceedings of KDD cup and workshop, Vol. 2007, pp. 35. Cited by: §5.
 Random search for hyperparameter optimization. Journal of Machine Learning Research 13 (Feb), pp. 281–305. Cited by: §4.1.2, §4.2.2.
 Random forests. Machine learning 45 (1), pp. 5–32. Cited by: §4.2.3.
 Statistically validated leadlag networks and inventory prediction in the foreign exchange market. Advances in Complex Systems 21 (08), pp. 1850019. Cited by: §1.
 The relationship between PrecisionRecall and ROC curves. In Proceedings of the 23rd international conference on Machine learning, pp. 233–240. Cited by: §1.
 Incorporating Nesterov momentum into Adam. Cited by: §4.1.2.
 Mapping individual behavior in financial markets: synchronization and anticipation. EPJ Data Science 8 (1), pp. 10. Cited by: §4.2.1, §4.
 Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §4.1.2.
 Adaptive mixtures of local experts. Neural computation 3 (1), pp. 79–87. Cited by: §2.
 LightGBM: a highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems, pp. 3146–3154. Cited by: §4.1.2.
 Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.2.
 Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §4.2.2.
 Simultaneous training of negatively correlated neural networks in an ensemble. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 29 (6), pp. 716–725. Cited by: §2, §3.3.
 Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426. Cited by: §4.1.3.
 Longterm ecology of investors in a financial market. Palgrave Communications 4 (1), pp. 92. Cited by: §1, §1.
 Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML10), pp. 807–814. Cited by: §4.1.2.
 A method for solving the convex programming problem with convergence rate . In Dokl. akad. nauk Sssr, Vol. 269, pp. 543–547. Cited by: §4.1.2.
 Technical Analysis Library using Pandas. GitHub. Note: https://github.com/bukosabino/ta Cited by: 2nd item.
 Outrageously large neural networks: The sparselygated mixtureofexperts layer. arXiv preprint arXiv:1701.06538. Cited by: §2, §3.
 Universal features of price formation in financial markets: Perspectives from deep learning. Cited by: §1.
 Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §4.1.2.
 On the importance of initialization and momentum in deep learning. In International conference on machine learning, pp. 1139–1147. Cited by: §4.1.2.
 Identification of clusters of investors from their real trading activity in a financial market. New Journal of Physics 14 (1), pp. 013041. Cited by: §1, §1.
 Twenty years of mixture of experts. IEEE transactions on neural networks and learning systems 23 (8), pp. 1177–1193. Cited by: §2.
 Lookahead optimizer: k steps forward, 1 step back. arXiv preprint arXiv:1907.08610. Cited by: §4.1.2.