Multilayer Aggregation with Statistical Validation: Application to Investor Networks
Abstract
Investor trading networks are attracting growing attention in the financial market literature. In this paper, we propose three improvements to their analysis: information aggregation, transaction bootstrapping, and investor categorization. These components can be used individually or in combination. For information aggregation, we introduce a tractable multilayer aggregation procedure to integrate securitywise and timewise information about investor category trading networks. We use transaction bootstrapping to capture the properties of the actual data generation process and to have a more robust statistical testing procedure. Investor categorization allows for inferring constant size networks and more observations for each node, which is important especially for less liquid securities. We apply this procedure by analyzing a unique data set of Finnish shareholders during the period 2004–2009. We find that households play a central role in investor networks, as they have the most synchronized trading. Furthermore, we observe that the window size used for averaging has a substantial effect on the number of inferred relationships. Importantly, the use of our proposed aggregation framework is not limited to the field of investor trading networks; in fact, it can be used for different nonfinancial applications, with both observable and inferred relationships, spanning a number of different information layers.
#1 \setdeletedmarkup\sout#1 \definechangesauthor[color=orange]KB \definechangesauthor[color=blue]FE 1,*]KÄstutis Baltakys 1]Juho Kanniainen 2,3]Frank EmmertStreib 1]Labaratory of Industrial and Information Management, Tampere University of Technology, Finland 2]Predictive Medicine and Data Analytics Lab, Department of Signal Processing, Tampere University of Technology, Finland 3]Institute of Biosciences and Medical Technology, Tampere, Finland *]Corresponding author, kestutis.baltakys@tut.fi
Introduction
Recent empirical evidence has triggered a disagreement over conventional tractable financial models[1]. In finance, complex network theory [2, 3] has mainly been applied to gauge the systemic risk posed by interconnected banks [4, 1, 5, 6, 7], but recently, multiple network inference methods have been developed to investigate trading behavior[8, 9] and portfolios[10] in investor network research.
An investor network is a representation of a realworld complex system where institutional and private investors indirectly interact with each other by trading or owning securities. In general, network science methods allow for analyzing and gaining a clearer understanding of the intricate relationships between the components of this system, and a key advantage of such an approach is that it allows for visualizing the resulting networks [11, 12]. However, estimating investor networks is not straightforward, as links between investors are not directly observable. Instead, a link represents the abstract similarity of a pair of investors in terms of trading behavior or portfolios. Therefore, the analysis requires investorlevel transaction or portfolio data and an appropriate statistical inference method for inferring such networks from the data. Even though complex network methods have begun attracting attention to investorlevel data[13], many methodological challenges remain, several of which we aim to address in this paper. For our analysis, we use data from a large shareholder registry to investigate the trading networks of different investor categories.
First, the main challenge in investor trading networks is considering multiple securities leading to a multilayer network representation. What if we wanted a simple network representation, which would have statistically significant relationships over multiple securities? Everchanging investor behavior poses difficulties for correctly inferring their relationships. Most likely, performing network inference for a whole period will not reveal the whole picture, as localized relationships between investor categories occurring at different periods might be diluted when we look at longer horizons. At the same time, static networks inferred over a whole period do not provide information on how node relationships evolve over time. In order to analyze the varying associations between investor categories, we use a simple, windowbased analysis to recover the timeevolving networks of investor category interactions. Moreover, having a sequence of network snapshots, one might want to summarize the most important reoccurring relationships over the whole period. Therefore, we propose a multilayer aggregation approach that can address this challenge and yield a network representation over multiple securities and/or estimation periods. We also consider the influence of window size on the resulting aggregated networks[14]. As we show, this approach allows for producing robust network structures using all of the transaction data over multiple securities and estimation periods without discarding a single transaction.
Second, we can think of investor trading as a data generation process that produces observations (transaction data) based on unobservable trading mechanisms. For example, trading algorithms have specific trading rules, and household investors with more or less intuitive trading strategies can have certain (stochastic) mechanisms, which are impossible to observe directly. The point is that the data set of observable transactions is just one realization of the underlying data generation process driven by certain mechanisms. Therefore, one might wonder which data sample to use for the network inferenceâall the transaction data together or one or more subsamples of the full set of trading data. In addition, in our case, the investor category consists of many investors, and we want to prevent cases where a couple of active investors or investors who trade large volumes overshadow the behaviors of other investors in the category. In our approach, we address these problems by performing the lowest resolution bootstrapping at the investor transaction level. An empirical demonstration shows that the results clearly differ between the conventional approach of using the full data set directly and our data bootstrapping approach.
Third, the transaction data for network inference suffers from a highdimension, lowsample size problem[15], as the number of investors exceeds the number of trading days. Estimating investor networks based on trading similarities requires long observation periods and sufficient data for each investor [8]. Since the majority of household investors are rather inactive, only a fraction of investorsâthe active onesâcan be included in the analysis. The exclusion of inactive investors leads to the description of a subsystem; therefore, the conclusions can be difficult to generalize at the market level. In this paper, we solve this problem by assigning investors to categories according to investor attributes that are available in the data set. Such a categorization allows us to reduce the number of variables in the system significantly, but we do not exclude data, as the categories contain aggregated data from the whole system. Importantly, this approach allows for considering inactive investors and less liquid stocks with fewer trading events. The size of such a categorized network remains the same over time, whereas the size of a network of individual investors can change over time, depending on the activeness of the investors. Since investor categories are based on real attributes, we can characterize the nature of each category. This makes the system interpretable in economic and sociological terms.
To demonstrate our multilevel aggregation approach, we use an investorlevel transaction data set obtained from Euroclear Finland Ltd for our analysis. It includes transactions from 20040101 to 20091231 of all investors that traded stocks listed on the Nasdaq OMX Helsinki Exchange. Each transaction also contains metadata about the investors (the same data set is used, for example, in refs. \citentumminello2012identification,grinblatt2000investment,berkman2014informed, while ref. \citenozsoylev2013investor uses a similar data set of trades on the Istanbul Stock Exchange). In this data set, the attributes used to categorize investors include gender, year of birth, and postal code for households and sector code for institutions. These attributes allow us to define investor categories on which our analyses are performed.
Our framework is based on several building blocks, inspired by the bagged conservative causal core (BC3NET) method [18, 19], originally introduced to infer gene regulatory networks based on genomics data. The blocks consist of the following:

Investor categorization, where all investors in the analyzed data set are assigned to only categories based on their economic and social attributes. (Investors Investor Categories)

Data bootstrapping, where the analyzed data is resampled into data sets. The advantage of the bootstrap is that it does not require any assumptions about the data distribution and it addresses the issue of finite time series. (Dataset Resampled Datasets)

Network inference, where we apply a chosen inference technique to identify edges between investor categories for each resampled data set to produce an ensemble of networks. Any network inference method[20, 21, 22, 23, 24] that produces or can be converted into binary, nonweighted networks can be applied. Our main method choice in the results section for network inference is the conservative causal core (C3NET)[18] algorithm, for its computational efficiency (see methods section for more details). (Dataset Network)

Aggregation, where a network ensemble is aggregated to identify significant relationships that appear across the set of networks. [25]. (Network Ensemble Aggregated Network)
The novelty of our approach is that we can aggregate networks in the manner displayed in Fig. 1 to capture trading relationships over multiple securities and periods. Overall, we have two different layers. The first layer indicates securities and the second one indicates time. Interestingly, there are two different ways to integrate over these variables, indicated by the blue and red arrows. We show in the following that the results highlight different characteristics of the data.
For each time step and each security, we want to extract a network. These networks cannot be directly observed, but they are estimated using the transaction data set. By bootstrapping this data set, we generate bootstrap data sets. Network inference is applied to each of these data sets, resulting in an ensemble of networks. The aggregation of the networks results in one network, indicated by (1) (see Figure 1). Adjacent networks in the main matrix are similarly inferred for other time steps and securities. Each column is an ensemble of networks that contains information about trading relationships for different securities during the same time step, while each row is a network ensemble that contains information about trading relationships in individual securities over different time steps. In the following, we first describe the securitywise integration and then the timewise integration.
Initially, we integrate the securitywise information for each time step contained in the columns. Network (2) represents an aggregated network for time step 1 over all securities. Repeating a similar analysis for each of the different time steps results in further networks for the corresponding cases. To combine these networks, we perform aggregation again, resulting in one final network, indicated by (3) in the figure. The blue arrows in the figure represent the aforementioned steps. Alternatively, one can perform a timewise integration first in a similar way. This type of integration follows the red arrows in the figure and applies 2 times the aggregation method because two integration steps are required. This leads to the final network indicated by (5).
Interestingly, even though the final networks (3) and (5) summarize the same information, because of the different aggregation order, the captured relationships might be different, as shown in the results section.
Scientific literature investigating a multilayer network aggregation is scarce, as the multilayer networks themselves have only recently started to gain more attention [26, 27, 28], especially in the financial area [29, 30]. The paper that mostly closely resembles ours regarding its topic proposes an ensemblebased network aggregation [31] method that leverages the rankproduct method [32] to improve the accuracy of gene network reconstruction. However, the algorithm is intended to integrate gene networks inferred using different methods and genomics data sets. Other trivial network ensemble aggregation procedures include maximum and mean rules[33]. Another recent paper[34] proposes a method for reducing the complexity of multilayer networks by aggregating the redundant layers while retaining the pertinent information about the whole system. In practice, the goal of their method is to combine similar layers and keep dissimilar layers apart. The objective of our research is different; we are looking for the most important relationships that span multiple layers, rather than keeping information about different layers.
The main contribution of this paper to the field is twofold. First, in terms of investor network inference, we consider investor categories, instead of individual investors, and second, in terms of network aggregation, we propose the use of a tractable multilayer and multistep aggregation procedure. Hence, our approach is aimed at integrating an ensemble of networks, resulting in a network that captures the most significant consistencies in investor relationships over multiple time snapshots and many securities. Methodologically, this framework can be used for different nonfinancial applications, with various network estimation methods, even for observable networks, such as social networks[35], different communication channels[36, 37, 38], transportation[39], and coauthorship[40] networks, where network estimation is not needed.
In the next section, we present the results from the method application to our data set. We begin by applying our proposed techniques to single security networks. We investigate the impact of transaction bootstrapping on the network inference problem and compare a network inferred over the whole period to an aggregated network from a set of network snapshots. Next, we investigate multiple security networks. First, we use the aggregation technique to summarize information about trading in multiple securities and then we perform a twolayer aggregation, summarizing the information given by a series of network snapshots for a set of securities.
Results
In this section, we describe the network inference and aggregation process over single and multiple securities by performing the analysis over the whole period of analysis and multiple nonoverlapping subperiods. Mutual information (MI) values are estimated from daily net volume time series for each investor group pair. We also choose C3NET algorithm for network inference from the MI estimates; however, other methods can be used. Combined with transaction bootstrapping, our method closely resembles the bagged conservative causal core (BC3NET) approach, except that in our case, the sampling is performed at a lower transaction level.
Single Security Networks
Network inference
We begin our results section by comparing inferred networks using C3NET algorithm with and without transaction bootstrapping. We perform this comparison for the most liquid security in the Helsinki stock exchangeânamely, Nokia. By definition, C3NET allows for establishing as many links as there are nodes in the network, if each investor group has at least one statistically significant MI estimate with some other group. In our data set from 20040101 to 20091231, using C3NET, we infer 95 links. Interestingly, even after completing the categorization, some investor categories do not have a sufficient number of Nokia transactions to estimate relationships. For the bootstrapped version of network inference, we perform 100 transaction sampling iterations and form a network for each of them using the C3NET algorithm. The resulting ensemble of 100 networks contains links with different relationships. As a statistical null model for our ensembles, we choose the canonical ErdősâRényi model, with a fixed number of nodes and an ensemble probability of a random link (see the methods section for more details). A fully connected ensemble would have links; therefore, the probability of having a random link in the ensemble is estimated to be . By choosing a significance of and adjusting it by the number of tests we perform (), we conclude, that a relationship must be observed in at least networks for it to be considered nonrandomly occurring. The bootstrapped version identifies a total of relationships that are statistically significant. Hence, the topology is no longer limited to one link per node. Almost all relationships from the nonsampled C3NET network are found also in the bootstrapped versionâthat is, out of .
The two networks are depicted in (a) and (b) subplots of Figure 2. Both networks identify the same nodes as most connected, and the four most connected nodes represent households. Specifically, the most connected node represents mature Helsinki households, followed by the same age group of western Tavastians, then middleaged western Tavastians, and finally, mature northern Finnish households. The most connected nonhousehold groups in the bootstrapped version are nonfinancial companies from western Tavastia , Ostrobothnia, and central Finland, with six relationships each. The most connected financial insurance group is from northern Savonia, with five relationships, followed by Helsinki, with four relationships in the bootstrapped version.
Timewise network aggregation
The third and fourth networks for Nokia security in Figures 2 (c) and (d) are obtained by aggregating two 12network ensembles inferred from nonoverlapping 6month periods covering the whole 6year period analyzed. As in the previous section, we compare Nokia networks inferred with and without transaction bootstrapping. The number of relationships in the nonbootstrapped version ensemble varies from to and from to in the bootstrapped version. A total of different relationships are observed throughout the 12 networks in the transaction bootstrapped network ensemble and the total number of links in the ensemble is , while in the nonbootstrapped version, the numbers of relationships and links are and , respectively. Each network contains 110 nodes, and therefore, the total possible number of links in the ensemble is equal to and the probabilities of having a random link are estimated to be and . Again, by choosing the statistical significance of and adjusting for the number of tests performed, a link must appear at least times in order to be aggregated into the final network for the bootstrapped version and times for the nonbootstrapped version. From Table 1, we see that in the bootstrapped version, links appear at least times in the networks, and Figure 3 shows the link occurrence in the ensemble. In the latter figure, we can see that some relationships are accumulated in consecutive periods while others are more scattered over time.
\topruleNumber of occurrences  12  11  10  9  8  7  6  5  4  3  2  1 
\midruleLinks  1  4  2  4  6  9  15  21  46  107  375  992 
Cumulative  1  5  7  11  17  26  41  62  108  215  590  1582 
\bottomrule 
From Figure 4 we can see that links overlap with the bootstrapped version of C3NET for the whole period under analysis. Further, for the nonbootstrapped version, 30 relationships are inferred after timewise aggregation. Of those 30 relationships, 26 also appear in the bootstrapped version. All nodes, but two nonfinancial investor groups that have relationships, are households. A visual inspection of all four networks in Figure 2 reveals that the most important set of nodes in both networks inferred from the whole transaction data set is also identified as central in networks aggregated from various time window analyses.
Multiple Security Networks
Securitywise aggregation
Here, we aim to incorporate information about investor group trading relationships in 100 securities over the whole 6year period. We start by inferring the bootstrapped version of the C3NET network for each security. The number of inferred relationships across different securities ranges from to , while the total number of detected relationships in the ensemble is . Subsequently, for the ensemble of 100 security networks, we apply the same aggregation procedure as before. From the observed number of links in the ensemble and total possible number of links in a fully connected ensemble of this size, we estimate the probability of random links to be . Then, for a significance level of , we apply Bonferroni adjustment in tests and end up with a threshold of link occurrences in the ensemble, which leaves links in the aggregated network. Households represent the majority of groups with relationships over multiple securities. Furthermore, two of the most central nodes are mature and middleaged household investor groups from Helsinki, with 52 and 38 relationships, respectively. The two most central nonhousehold investor groups are financial and nonfinancial companies in Helsinki, both with 13 relationships to other investor groups.
Two level aggregation
In this section, we leverage the previously introduced timewise and securitywise network aggregation procedures. Our goal is to produce a single network that can summarize the trading relationship information inferred for 100 securities over multiple and various sizes time windows. We investigate networks inferred over seven different nonoverlapping time windowsâthat is, 1, 2, 3, 4, 6, 12, and 24 months. Each security respectively has 72, 36, 24, 18, 12, 6, and 3 such networks, covering the whole 6year period under analysis. Our starting point is a set of network ensembles inferred using bootstrapped C3NET algorithm for 100 securities for all analyzed time window sizes. For instance, in the case of the 6month window, we have 12 networks for each of the 100 securitiesâthat is, an ensemble of networks (corresponding to the networks in the main matrix of Figure 1). We must also keep in mind that the aggregated network will differ depending on the order of information aggregationâthat is, if relationship timewise or securitywise information is summarized first. Accordingly, we describe the results of using both approaches and compare the final results. By performing the timewise aggregation first, we end up with a 100network ensemble, with one network for each security. Links in each network represent the most important reoccurring relationships in corresponding securities. Conversely, if we start with securitywise aggregation, we end up with an ensemble of 12 networks. Each of the 12 networks contains the most important relationships that are present over multiple securities, but this might be a different set of securities in each period. Next, for the two ensembles stemming from the first aggregation procedure, we perform the final aggregation, yielding a network summarizing the relationships of investor groups in their trading behavior over 100 securities for the whole period under analysis. However, the two final networks are not the same (see the networks in Figure 5). Table 2 compares the links and nodes in the final networks for various window sizes. For each of the seven time windows, we obtain two networks, depending on the order of the aggregation procedure; thus, together with the securitywise aggregated network for the whole period from the previous section, we compare 15 networks. Figure 6 summarizes the node degrees in all 15 final networks. Node degree sequences are highly correlated, with Spearmanâs correlation ranging from to . Similar to the whole period securitywise aggregated network, networks in Figure 5 identify mature and middleaged household investor groups from Helsinki as the most central groups, while financial and nonfinancial company investor groups from Helsinki are most central nonhousehold investor groups.
\topruleWindow  Nodes  Links  

size  Jaccard  Jaccard  
\midrule1  0  36  4  0.9000  2  335  87  0.7900 
2  0  42  2  0.9545  8  339  33  0.8921 
3  1  43  1  0.9555  65  314  4  0.8198 
4  2  45  0  0.9574  67  304  3  0.8128 
6  3  46  0  0.9387  98  275  3  0.7313 
12  5  47  0  0.9038  150  221  6  0.5862 
24  6  44  1  0.8627  101  132  17  0.5280 
\bottomrule 
Discussion
In this paper, we proposed some approaches to help circumvent the most common obstacles in investor network analysis. First, we extended the bootstrap aggregation approach to ensembles containing information about trading behavior in different securities and/or time windows. The advantage of the aggregation approach is that no arbitrary linkfiltering threshold is needed. Instead, the algorithm adjusts this itself depending on a chosen significance level and the properties of the investigated network ensemble. We found that timewise aggregated networks and networks inferred over the whole period significantly differed in the number of relationships inferred and the number of nodes having relationships. However, a similar set of nodes was identified as central in both cases. Securitywise aggregation revealed the investor category trading network not over one but over multiple securities. It is important to remember that the twolayer aggregation yielded different network descriptions depending on the order of information aggregation. It is worth mentioning that the aggregation of timewise and securitywise trading relationships could be performed in a single step, in which case there would be no confusion about the aggregation order. However, in that case, the meaning of network relationships would be obscure. We would be neither certain that investor categories were similarly trading over a significant number of the same securities nor that they were trading similarly over a significantly large number of the same periods; further, the definition of a single step aggregation would be somewhere inbetween, in some cases perhaps failing to meet both criteria.
Second, to the best of our knowledge, we are the first to propose the use of lowest resolutionâthat is, transactionlevelâbootstrapping as the means for statistically validating investor network relationships. Transaction bootstrapping also enables network inference over shorter time windows. Networks inferred at different time points can provide insight into the dynamics of these relationships. Most of the research has been focused on inferring static or timeinvariant investor networks, and much less has been done to infer the dynamic relationships that are constantly evolving over time. Indeed, over the course of time, multiple interchanging processes may determine the behavior of investor categories, and such processes can be dynamic and stochastic. Therefore, investor behavior at each time point is dependent on these processes, and investor networks can undergo significant topological changes, rather than being invariant over time. Using our proposed network aggregation procedure, these network snapshots can be summarized into a single static network that covers the most important information for the whole period. Transaction bootstrapping is a viable strategy for network inference because it not only allows for assigning statistical significance to link existence but also enhances the robustness of the relationships to specific realizations of the trading outcome.
Finally, we introduced investor grouping into categories based on their attributes. This approach allows for performing any analysis by discarding less information. Investor category networks based on investor attributes have not been investigated previously in the literature. The vulnerability of the investor categorization approach is that the ensuing analysis is ultimately dependent on the category definition. In practice, it is possible that the investor transaction data sets would not contain metadata about the investors, and therefore, it would be impossible to assign investors to categories or the arising categories would be economically meaningless or difficult to interpret. In that case, one can revert to the analysis of individual investors and use the multilayer aggregation procedure without a loss of generality.
In the results section, we observed that Helsinki households represented the most connected investor category, and this category, thus, has a central role in financial markets in terms of trading behavior. The central role of household investors has been identified in the literature[41, 42, 43, 44]. For example, according to \citenkaniel2008individual, households are contrarian traders (i.e., they sell when stock prices have increased and buy when prices have decreased), leading them to serve as liquidity providers to institutional investors. The contrarian nature of households is also identified in \citengrinblatt2000investment using the same data set that we used in this paper.
This method can be applied in different, even nonfinancial fields, in order to extract the most important reoccurring relationships in multilayer networks.
Methods
Dataset. We use an investorlevel transaction data set covering the period from 20040101 to 20091231 of all trades executed on the Helsinki Stock Exchange. The data set is composed of transactions belonging to investors trading in 100 securities over 6 years. The analyzed security list includes the top 100 securities ranked by number of investors and transactions. Each investor in the data set is assigned to a sector group: Financial and Insurance, Government, NonFinancial, and NonProfit companies, as well as Foreign investors, and Finnish Households. Households are further divided into five age groups: UnderAged , Young , MiddleAged , Mature , and Retired . Age attributes are derived for each transaction separately, taking into account the difference between the transaction date and the year of birth of the corresponding investor. All of these groups are also distributed geographically by assigning investor postal codes to 11 regions using Table 3. Together, these assignment rules, shown in Figure 7, form 110 investor categories.
The data that support the findings of this study are available from Euroclear Finland Ltd. Data however are not available from the authors under the nondisclosure agreement signed with the data provider.
Region  Postal code range 

Helsinki  [0, 3000) 
RestUusimaa  [3000, 11000) 
EasternTavastia  [11000, 20000) 
SouthWest  [20000, 30000) 
WesternTavastia  [30000, 40000) 
CentralFinland  [40000, 50000) 
SouthEast  [50000, 60000) 
Ostrobothnia  [60000, 70000) 
NorthernSavonia  [70000, 80000) 
EasternFinland  [80000, 90000) 
NorthernFinland  [90000, 100000) 
Transaction bootstrapping. For network inference, we perform bootstrap iterations. For each bootstrap iteration, we uniformly resample with replacement the whole transaction data set under investigation. Then, for each sampled transaction set, we aggregate daily transaction records for each category, resulting in net traded volume matrix , where and is the net traded volume on day of investor group .
Mutual information estimation. For simplicity, we assume that the joint distribution of net traded volumes is normal. Then we can calculate the MI analytically from Pearsonâs correlation using
(1) 
Network inference. Using the net traded volume data , we apply a network inference method. A specific requirement for such a method is that it is computationally efficient for handling a large bootstrap ensemble. For this reason, we use the C3NET[18] inference method. C3NET is intended to infer a significant maximum MI network. This algorithm comprises three basic steps. First, MI values are estimated for each investor category pair. Second, each MI value estimate is tested against a null hypothesis of vanishing MI. Finally, each investor group is allowed to keep a single link, which is the strongest statistically significant MI value. The resulting binary network has at most relationships in a system of nodes.
Null distribution of mutual information values. In order to test the statistical significance of the MI estimates, we need to procure an appropriate null distribution. Therefore, we test the following null hypothesis:
: The MI between investor group and is zero.
For each transaction, we resample dates, traded volumes, and categories to which those transactions are assigned, eliminating any relationship between them. Then we aggregate daily transaction records for each category, resulting in a net traded volume matrix . We do this multiple times and each time we estimate MI values for pairs of investor groups. These values result in an estimate of the null distribution, which we use to find statistically significant MI values.
Multiple hypothesis test correction (MTC). In order to control the familywise error rate, we leverage the strict Bonferroni MTC procedure. We use MTC at each stage of aggregation when testing edge occurrences in the ensembles. Following the Bonferroni procedure, we adjust the chosen significance level by the number of tests we perform:
Aggregation. Following ref. \citende2012bagging, the aggregation procedure takes an ensemble of independent undirected binary networks as an input and gives a single network as an output. First, the network ensemble is aggregated into a weighted network . The edge weights in the weighted network correspond to the number of particular edge occurrences in the ensemble. For example, the weight of an edge between investor groups and is defined as , where may assume integer values between and . Next, we conduct a statistical hypothesis test to remove the need for an arbitrary link threshold parameter:
: The number of networks in the ensemble with an edge between and is less than , where is the significance level.
If we define as the probability of two investor groups being randomly connected, then follows a binomial distribution, . Then is the probability of observing by chance the link between investor groups and more than times. Then the nodes in the final network are connected if , where is the significance level.
We estimate the probability of for two groups to be connected by chance in an network ensemble, as the fraction of the actual number of edges in the ensemble to the number of all possible links in the ensemble , where is the number of investor groups.
Multilayer aggregation procedure. For a set of securities and a number of inference periods , we infer networks . Each network represents significant relationships between investor groups for different securities at different periods. If we then apply the network aggregation procedure over securities for each period ,
we end up with an ensemble of networks . Each of the networks represents significant relationships between investor groups that occur over multiple securities during period . Similarly, if we apply the network aggregation procedure over time for each security ,
we end up with an ensemble of networks , where each of the networks represents the most important over time reoccurring relationships between investor groups in security .
Next, we aggregate the second layer of information. and appropriately. Both aggregation sequences load to unique networks.
Both and are accordingly equivalent to the illustrated networks (3) and (5) in Figure 1.
Acknowledgements
The research project leading to these results received funding from the EU Research and Innovation Programme Horizon 2020 under grant agreement No. 675044 (BigDataFinance).
Competing financial interests
The authors declare no competing financial interests.
Contributions
All authors designed the experiment, wrote and reviewed the main manuscript text. K.B. prepared all figures (figure 1 together with F.E.), and conducted the empirical analysis.
References
 Battiston, S. et al. Complexity theory and financial regulation. Science 351, 818–819 (2016).
 Newman, M. E. J. The structure and function of complex networks. SIAM Review 45, 167–256 (2003).
 Dehmer, M. & EmmertStreib, F. (eds.) Analysis of Complex Networks: From Biology to Linguistics (WileyVCH, Weinheim, 2009).
 Cimini, G., Squartini, T., Garlaschelli, D. & Gabrielli, A. Systemic risk analysis on reconstructed economic and financial networks. Scientific reports 5, 15758 (2015).
 Barucca, P. et al. Network valuation in financial systems (2016).
 Haldane, A. G. & May, R. M. Systemic risk in banking ecosystems. Nature 469, 351 (2011).
 Cont, R., Moussa, A. et al. Network structure and systemic risk in banking systems (2010).
 Tumminello, M., Lillo, F., Piilo, J. & Mantegna, R. N. Identification of clusters of investors from their real trading activity in a financial market. New Journal of Physics 14, 013041 (2012).
 Ozsoylev, H. N., Walden, J., Yavuz, M. D. & Bildik, R. Investor networks in the stock market. The Review of Financial Studies 27, 1323–1366 (2013).
 Gualdi, S., Cimini, G., Primicerio, K., Di Clemente, R. & Challet, D. Statistically validated network of portfolio overlaps and systemic risk. Scientific reports 6 (2016).
 Bastian, M., Heymann, S., Jacomy, M. et al. Gephi: an open source software for exploring and manipulating networks. Icwsm 8, 361–362 (2009).
 Jacomy, M., Venturini, T., Heymann, S. & Bastian, M. Forceatlas2, a continuous graph layout algorithm for handy network visualization designed for the gephi software. PloS one 9, e98679 (2014).
 Ranganathan, S., Kivelä, M. & Kanniainen, J. Dynamics of investor spanning trees around dotcom bubble. arXiv preprint arXiv:1708.04430 (2017).
 EmmertStreib, F. & Dehmer, M. Influence of the time scale on the construction of financial networks. PloS one 5, e12884 (2010).
 Bernardo, J. et al. Bayesian factor regression models in the âlarge p, small nâ paradigm. Bayesian statistics 7, 733–742 (2003).
 Grinblatt, M. & Keloharju, M. The investment behavior and performance of various investor types: a study of finland’s unique data set. Journal of financial economics 55, 43–67 (2000).
 Berkman, H., Koch, P. D. & Westerholm, P. J. Informed trading through the accounts of children. The Journal of Finance 69, 363–404 (2014).
 Altay, G. & EmmertStreib, F. Inferring the conservative causal core of gene regulatory networks. BMC Systems Biology 4, 132 (2010).
 de Matos Simoes, R. & EmmertStreib, F. Bagging statistical network inference from largescale gene expression data. PLoS One 7, e33624 (2012).
 Mantegna, R. N. Hierarchical structure in financial markets. The European Physical Journal BCondensed Matter and Complex Systems 11, 193–197 (1999).
 Tumminello, M., Aste, T., Di Matteo, T. & Mantegna, R. N. A tool for filtering information in complex systems. Proceedings of the National Academy of Sciences of the United States of America 102, 10421–10426 (2005).
 Peng, J., Wang, P., Zhou, N. & Zhu, J. Partial correlation estimation by joint sparse regression models. Journal of the American Statistical Association 104, 735–746 (2009).
 Boginski, V., Butenko, S. & Pardalos, P. M. Statistical analysis of financial networks. Computational statistics & data analysis 48, 431–443 (2005).
 Onnela, J.P., Kaski, K. & Kertész, J. Clustering and information in correlation based financial networks. The European Physical Journal BCondensed Matter and Complex Systems 38, 353–362 (2004).
 Breiman, L. Bagging predictors. Machine learning 24, 123–140 (1996).
 Kivelä, M. et al. Multilayer networks. Journal of Complex Networks 2, 203–271 (2014).
 De Domenico, M. et al. Mathematical formulation of multilayer networks. Physical Review X 3, 041022 (2013).
 Boccaletti, S. et al. The structure and dynamics of multilayer networks. Physics Reports 544, 1–122 (2014).
 Bargigli, L., Di Iasio, G., Infante, L., Lillo, F. & Pierobon, F. The multiplex structure of interbank networks. Quantitative Finance 15, 673–691 (2015).
 Musmeci, N., Nicosia, V., Aste, T., Di Matteo, T. & Latora, V. The multiplex dependency structure of financial markets. arXiv preprint arXiv:1606.04872 (2016).
 Zhong, R., Allen, J. D., Xiao, G. & Xie, Y. Ensemblebased network aggregation improves the accuracy of gene network reconstruction. PloS one 9, e106319 (2014).
 Breitling, R., Armengaud, P., Amtmann, A. & Herzyk, P. Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments. FEBS letters 573, 83–92 (2004).
 Polikar, R. Ensemble based systems in decision making. IEEE Circuits and systems magazine 6, 21–45 (2006).
 De Domenico, M., Nicosia, V., Arenas, A. & Latora, V. Structural reducibility of multilayer networks. Nature communications 6, 6864 (2015).
 Scott, J. Social network analysis (Sage, 2017).
 Onnela, J.P. et al. Structure and tie strengths in mobile communication networks. Proceedings of the national academy of sciences 104, 7332–7336 (2007).
 Newman, M. E., Forrest, S. & Balthrop, J. Email networks and the spread of computer viruses. Physical Review E 66, 035101 (2002).
 Isella, L. et al. What’s in a crowd? analysis of facetoface behavioral networks. Journal of theoretical biology 271, 166–180 (2011).
 Guimera, R., Mossa, S., Turtschi, A. & Amaral, L. N. The worldwide air transportation network: Anomalous centrality, community structure, and cities’ global roles. Proceedings of the National Academy of Sciences 102, 7794–7799 (2005).
 Liu, X., Bollen, J., Nelson, M. L. & Van de Sompel, H. Coauthorship networks in the digital library research community. Information processing & management 41, 1462–1480 (2005).
 Grinblatt, M. & Keloharju, M. How distance, language, and culture influence stockholdings and trades. The Journal of Finance 56, 1053–1073 (2001).
 Grinblatt, M. & Keloharju, M. What makes investors trade? The Journal of Finance 56, 589–616 (2001).
 Kaniel, R., Liu, S., Saar, G. & Titman, S. Individual investor trading and return patterns around earnings announcements. The Journal of Finance 67, 639–680 (2012).
 Kaniel, R., Saar, G. & Titman, S. Individual investor trading and stock returns. The Journal of Finance 63, 273–310 (2008).