Fast Super-Paramagnetic Clustering
We map stock market interactions to spin models to recover their hierarchical structure using a simulated annealing based Super-Paramagnetic Clustering (SPC) algorithm. This is directly compared to a modified implementation of a maximum likelihood approach we call fast Super-Paramagnetic Clustering (f-SPC). The methods are first applied standard toy test-case problems, and then to a data-set of 447 stocks traded on the New York Stock Exchange (NYSE) over 1249 days. The signal to noise ratio of stock market correlation matrices is briefly considered. Our result recover approximately clusters representative of standard economic sectors and mixed ones whose dynamics shine light on the adaptive nature of financial markets and raise concerns relating to the effectiveness of industry based static financial market classification in the world of real-time data analytics. A key result is that we show that f-SPC maximum likelihood solutions converge to ones found within the Super-Paramagnetic Phase where the entropy is maximum, and those solutions are qualitatively better for high dimensionality data-sets.
pacs:05.10.Ln, 75.10.Nr, 89.65.Gh
We consider the problem of unsupervised statistical learning for feature selection and classification of financial time-series data from similarity measures that can be represented as correlation matrices. Concretely, we consider Potts model Wu (1982) based Blatt et al. (1996, 1997) methods optimized for performance Giada and Marsili (2001); Marsili (2002); Hendricks et al. (2016a, b) using a Maximum Likelihood Estimation (MLE) approach based on the ground-state Noh Ansatz Noh (2000) compared to the finite-temperature approach using a cooling schedule to select configurations based on the susceptibility Blatt et al. (1996, 1997).
We compare the cluster configuration from the fast clustering algorithms based on the ground state configurations Hendricks et al. (2016a, b) (Sec. A.2) with those based on finite temperature Simulated Annealing (SA) based Monte-Carlo Markov-Chain (MCMC) Swendsen and Wang (1987); Hoshen and Kopelman (1976) (see Sec. A.1) to generate the full dendrogram of cluster configurations Blatt et al. (1996, 1997). It was shown that the clustering structure of financial assets are time horizon dependent in Bonanno et al. (2001), and given a translation of correlations into the Euclidean distance, they can be represented using Minimal Spanning Trees (MST) Kruskal (1956).
The Super-Paramagnetic Clustering (SPC) method originally developed in the early 1990s is a universal data clustering method Blatt et al. (1996) and has acquired a certain popularity for its implementation of the Maximum Entropy Principle (MEP) Jaynes (2003) where no assumptions are made about the distributions found in the data, and the number of clusters is revealed rather than predefined. SPC can be used in any environment as long as the features are embedded in an appropriate similarity metric. The method has been applied to chemical data using a sequential method and the Tanimoto similarity measure Ott et al. (2004), the detection and classification of spiking activity in neuronal tissue in Quiroga et al. (2004), the identification of regions of the brain with shared functionality in Stanberry et al. (2007), yeast genes profiles in Getz et al. (2000), histone modification data Li (2011), and image segmentation in Abramov et al. (2010).
Marsili and Giada Giada and Marsili (2001) were able to developed an efficient maximum likelihood clustering method for high dimensional data using the same spin-model inspired approach used in the SPC method. However, given the ill-posed nature of clustering they chose to evaluate the likelihood model for robustness relative to different optimization methods in Giada and Marsili (2002). The method is then applied to the detection of clusters of assets, and financial markets states in Marsili (2002) to uncover collective behavior of agents in a complex systems. Hendricks et al created a GPU based Parallel Genetic Algorithm (PGA) implementation to maximize in near real time clustering of market states for quantitative trading applications Hendricks et al. (2016a).
In this work we explore the relationship between the SPC method of Domany-Wiseman-Blatt Blatt et al. (1996) as compared to that of Giada-Marsili Giada and Marsili (2001). A key finding here is that we are able to confirm that the likelihood method is Super-Paramagnetic and conforms well to the entropy method when dimensionality is sufficiently high (see Sec. V). We are also able to further optimize the algorithm based on the PGA implementation Hendricks et al. (2016a). However, more importantly, what we are really able to demonstrate is the variety of clustering problems that the SPC can quickly and easily handle, with the advantage of leveraging a mature and well-understood foundational theoretical framework from statistical mechanics that has many, if not most, viable alternative algorithms, as special cases. We are of the view that SPC type models grounded in the maximum entropy principle and within the general framework of energy-based machine learning offer a variety of research and development opportunities for building better and faster unsupervised learning algorithms.
Towards building this argument the paper is organized as follows: Section II discusses a brief overview of Erdos-Renyi’s Random Graphs (see Sec. II.1), and their connection to the Potts models as special cases. We describe the inhomogeneous Potts Model as a data-clustering engine (see Sec. II.2). In Section II.3, the implementations with the SPC algorithm (Sec. II.3.1) followed by Marsili’s maximum likelihood methods (Sec. II.3.2). We then proceed to discussion validations procedures linking maximum likelihood methods to Super-Paramagnetic Clustering in Sec. II.4. Section III goes over our choice of similarity metric (Sec. III.1), data-preprocessing (Sec. III.2), and time-series noise filtering (Sec. III.3). Section IV provides toy test cases, and stock market applications. In Section V a summary analysis of the results and their implications, and finally in Section VI the conclusion, and potential directions for further research on similar topics are mentioned.
Ii Potts Clustering
ii.1 Random Graphs
A graph is a mathematical model which formalizes pairwise relationships between objects West et al. (2001). Graphs are popular in complexity sciences because they provide a framework to model large systems of interacting components by representing the system directly through the pair-wise relationships between the components. The field has seen the rise of many models, each with their own assumptions and various nuance, almost all are of the generative form based on the premise of bottom-up causation. One feature that is useful in our context is that one can observe certain types of emergent dynamics i.e. “phase transitions” BRUSH (1967).
A general class of models called the Random Cluster model 111See Grimmett Grimmett (2006) and references therein. was developed by Fortuin-Kasteleyn in 1972 Fortuin and Kasteleyn (1972). It is a “random” graph generated by a probability distribution
where is a given edge configuration (or adjacency matrix of ), is the number of clusters given , is the probability of nodes and being connected, and is the normalization constant (also called the partition function in statistical mechanics Sokal (2005) ). The adjacency matrix is linked to the probabilities by picking a value such that if then or otherwise. is essentially the probability of the graph being connected given an adjacency matrix .
If we now set , the random cluster model reduces to a Erdos-Renyi’s random graph Erdos and Rényi (1960). Given the existence of only one class on the graph, bonds are linked independently from their respective states (i.e. they all have the same state) with equal probability with the number of nodes on the graph Erdos and Rényi (1960). An important generalization of this idea is the BarabÃ¡si-Albert model Barabási and Albert (1999); this model works slightly differently: it starts with a low number of connected nodes , adds new nodes one at a time, one new node is able to connect to nodes, and every time a node is connected its degree increases. The probability of a node connecting to another is . This means as a node makes connections it becomes “popular” and succeeding nodes have a higher likelihood of connecting to that same node: this is the principle of “preferential attachment” which hopes to explains how some social networks are formed. These models are all based on a generative model that builds on microscopic causal relationships from the system components to the bulk.
The Fortuin-Kasteleyn random cluster model is closely related to the Potts model via its distribution
which is the conditional probability of the spin configuration given the edge configuration Edwards and Sokal (1988). The marginal probability is recovered by summing over all spin configurations. The major difference the Potts model brings is entropy maximization which assumes an exponential Boltzmann distribution of edges connections with as the pairwise probabilities. The strength captures the closeness between nodes, and, with the clusters membership, defines the topology of the graph. It’s a central variable of the model. This is similar to the Bianconi-BarabÃ¡si model Bianconi and Barabási (2001) which introduces a fitness which plays a related role as an add-on to the BarabÃ¡si-Albert model Barabási and Albert (1999).
ii.2 The Potts Model
The Ising model BRUSH (1967) simulates the existence of phase transitions in large systems of interacting particles. The model consists in the representation of a n-Dimensional plane. Ising’s PhD Thesis BRUSH (1967) solved the 1D problem, which showed no phase transition, while Onsager provided an analytical solution using a transfer-matrix method Onsager (1944) for the 2D case. If we consider observations in our data sets as nodes with edges which link nodes together it becomes natural to consider the data-set in the context of a Potts model Wu (1982). An edge is active or inactive with probability dependent on the distance between two nodes. The collection of nodes and edges form the graph which is navigated for clustering. Every node can be assigned, for example, a +1 or -1 spin (for the Ising model). Interactions are permitted by randomly changing the spin values in the graph, and then accepting or rejecting new configurations is implemented using the Swendsen-Wang MCMC algorithm at every step Swendsen and Wang (1987).
The Potts model Wu (1982) is a generalization of the Ising BRUSH (1967) model allowing the system to accept a higher spin values instead of 2. The parameter can be compared to the value in K-means used to fix the number of clusters. is the maximum number of classes: it must be chosen to be big enough to avoid clusters forcefully merged together. The only inconvenience to a relatively high is the additional computational cost needed to perform the statistics after the system reaches thermal equilibrium.
The model is governed by a Hamiltonian Energy 222 The more general Potts Hamiltonian can be contrasted with one of its special cases in the form of the energy of a Boltzmann machine with bias and weight matrix for features : with partition function Hinton and Salakhutdinov (2006); Osogami (2017); Ackley et al. (1985); Rumelhart and McClelland (1987) equation Wu (1982)
with: the spin vector assigned to our data, spins , and nodes. The Kronecker delta which is 1 for equal spins and 0 otherwise. For data embedded in a metric space the Euclidean distance function is computed between two nodes.
is fed to the strength function which, in turn, measures similarity. Many models for strength exist but their central feature is they must decrease with distance. This is typically achieved with a function of the type or a power law as seen in Blatt et al. (1996):
where is the average number of neighbors per node, and is a local length scale: the average of all nearest-neighbors.
There are alternative choices for local characteristic length scaleBlatt et al. (1996). We only report the results obtained with the previous definition, and note that the adjustments to are problem dependent: higher values of ensure the 1st phase of the simulation is ferromagnetic while lower values start the simulation in the Super-Paramagnetic (SP) phase.
The objective is to compute averages of thermodynamic quantities after simulating the system at a given temperature for a set number of MCMC iterations until thermal equilibrium.
The first simulation serves to uncover the existence of a critical temperature at which a first transition occurs. At all spins are strongly correlated, and all have the same state: It is called spontaneous magnetization (ferromagnetic phase). At the single cluster breaks down into smaller ones (SP-phase). Furthermore, inside the temperature range where the SP-phase exists, a system can go through additional transitions: These reflect the different hierarchical structures present in the data. Finally at we go through a final transition into complete disorder (Paramagnetic phase): The energy is high, all clusters dissolve, and .
The magnetization of the system is given by
This quantity, which ranges from 0 to 1, expresses how dominated the system is by the largest cluster. The order parameter of the system is the average magnetization , and its variance is called the susceptibility density. Both can be used to detect a phase transition: dives down while peaks at every transition.
For a quantity the thermal average will be
Here each represents a single MCMC step. If is large enough, Eqn. (5) is equivalent to the arithmetic mean . The probability of a system being in a particular state (referring to the energy of the system ) is:
where is the Boltzmann factor, is the partition function and the normalization constant of the Gibbs-Boltzmann distribution.
Numerically, we use a mean-field mode of the Hamiltonian such that . The motivation being that high levels of lead to Boltzmann factors close to 0, also which by definition makes the computation of impossible, also the value of impacts the temperature range explored.
ii.2.1 Maximum Entropy
We briefly remind ourselves of the MEP Jaynes (2003). We define a statistical mechanical system as an ensemble of objects each in their respective micro-states (spin values ) so that the resulting in microscopic state of the entire system can be used to derive parameters which characterize the distributions for macroscopic variables of interests (here the internal energy ). We assume that at equilibrium, thermodynamic systems obey conservation of energy which sets the constraints of the system such that on average is a constant, and then from Eqn. ((5)) it follows that:
We then consider that the distribution representative of the energy of the system as the one which incorporates our constraints and assumes nothing else. This maximizes the Shannon’s Entropy as defined by:
ii.3 Super-Paramagnetic Clustering (SPC)
ii.3.1 A Maximum Entropy Method
We define a neighbor on a lattice to be a node located in the vicinity of another node such that a node will have neighbors , and . A neighborhood generated using these rules is valid for a 2D lattice with a fixed for all nodes (the interaction strength is said to be homogeneous). This is the original method used in simulating Ising/Potts models of ferromagnets. As a generalization to the problem of data clustering, we will consider a neighborhood which emerges from the inhomogeneous interaction strength . Every neighborhood is a mini-graph, and their aggregation constitutes a graph whose topology is determined by the matrix : For two nodes to be neighbors they, each, must be included in their respective K-nearest neighbors.
We first define the neighborhood of size K. This is implemented following steps in Table (A) below.
The graph is traveled via nodes, and edges can be set active or inactive with probability
The next array is called links. It is the adjacency matrix where the activation status of edges is stored such that if rand, and 0 otherwise.
The original Hoshen-Kopelman (HK) algorithm Hoshen and Kopelman (1976) is the standard for labeling clusters in many statistical mechanics problems. The 2D version is mostly restricted to two neighbors per node: , and . SPC (Sec. A.1) deals with problems where is not fixed, and K can be large so we apply the extension of HK to non-lattice environments found in Al-Futaisi and Patzek (2003).
The clusters are labeled using the extended HK algorithm (See Table B below).
Once the graph is fully constructed, the next step is implement the Swendsen-Wang MCMC algorithm (See C1 in C below.
Finally, the spin-spin correlation is the average probability of two spins being in the same cluster is computed in two steps (See Table (D)).
The spin-spin correlation is probably the most important quantity as it is used to build the final clusters. The threshold , for which two nodes are members of the same cluster, is picked to be higher than but less than . The bounds on that range are explained by the distribution of which peaks at those two values: They are respectively the peak inter and intra cluster correlations 333For results in , and . Uncorrelated nodes have , and correlated nodes .. It is typical to use as it does not significantly affect the results in previous examples Blatt et al. (1996).
ii.3.2 A Maximum-Likelihood Method
Based on an analysis of the spectral properties of stock market correlations matrices, Noh Noh (2000) makes the following statistical ansatz: let’s assume an existing market hierarchy where individual stocks dynamics are dependent on clusters of similar stocks. This can be illustrated by a simple model as follows:
where are the stock’s features, the cluster-related influence, and the node’s specific effect.
In Giada and Marsili (2001) Giada, and Marsili formally develop a Potts model using Noh’s idea, and in Hendricks et al. (2016a) Hendricks, Gebbie, and Wilcox solved the optimization problem using a PGA for unsupervised learning for quantitative trading. Let’s consider a group of observations, embedded in a space with dimensionality as the features, and as with SPC (Sec. A.1), every observation is assigned a spin value. One version of the ansatz models the observation features such that
where is one feature, the intra-cluster coupling parameter 444 The thermal average can be used to reconstruct data-sets sharing identical statistical features of the original time-series by using Eqn. (11) Giada and Marsili (2001) , the cluster-related influence, and the observation’s specific effect, and measurement error. A covariance analysis yields additional terms such as the size of cluster , and the intra-cluster correlation 555Here , , and Giada and Marsili (2001); Hendricks et al. (2016a)..
We explicitly mention that must be enforced: the lower bound is required because is undefined for values of , and the upper bound requires a strict inequality because Eqn. (13) is undefined when . We introduce a Dirac-delta function 666Let , and a Dirac delta function of which is 1 when , and 0 otherwise. to model the probability of observing data in a configuration close to criticality:
This joint likelihood is the probability of a cluster configuration matching the observed data for every observations, and for every feature. The log-likelihood derived from can be thought of the Hamiltonian of this Potts system:
The sum is computed for every feature, and represents the amount of structure present in the data. The value of is indirectly dependent on spins via the terms .
A-priori advantages of this method over industry standard alternatives: First, that is completely dependent on , and the dimensionality of the dataset only plays a part in computing , and Second, it is unsupervised: There are no preset number of clusters. Clustering configurations are randomly generated, and that which maximizes provides us with the number of clusters, and their compositions.
Further modification of the model can be made to reduce the Hamiltonian to that of the standard K-means algorithm Lloyd (1982):
The principles of our GA are given in Table (E) below.
The original PGA algorithm implemented in Gebbie et al. () contained a mating step involving a bespoke cross-over function, and a restriction: Only parents with the same number of clusters could be mated, and the resulting children should maintain the same characteristic. The enforcement of this rule restricted clusters merging and splitting through mutations only. Our f-SPC implementation removes this intermediary step. This was implemented in order to decrease the computational cost 777 The algorithm is able to double the number of generations from 250 to 500 for a 5 mins simulation (Iris see Sec. IV.3): .
In addition to the diversity of individuals present in the population, mutations serve as GA diversity operators: They increase the genetic diversity, and send the system onto another path toward higher local maxima. We used six equally weighted types of mutations: i) New: A complete new individual, ii) Split: a random cluster split into two, iii) Merge: two clusters merged at random, iv) Swap: two spin labels are exchanged, v) Scramble: where a sequence of spins have their labels re-assigned in reverse order, and, vi) Flip: cluster (spin) labels are randomly re-assigned using the total cluster number (See SW Table C ).
At last, the simulation has converged once the fitness of the best individual hasn’t increased for a pre-determined number of iterations called “Stall Generations”. It should also be noted that although we did not recode a CUDA implementation for direct GPU implementation of our refined PGA algorithm this can be implemented using a modified version of the original CUDA implementation Hendricks et al. (2016a).
ii.4 Super-Paramagnetic Phase Validation
The goal of this project was to effectively validate the existing link between the solutions obtained using and the original Potts Hamiltonian . We proceeded by comparing clustering configurations obtained using the two models for the data set in Sec. IV.5.
Shown in Fig. (0(a)) , Fig. (0(b)) , and Fig. (0(c)) are the Adjusted Rand Indices as functions of temperature for our three cases. We compute the ARI at every temperature taking f-SPC as the true classification, and SPC as the candidates. Maximal ARI values are respectively 0.175, 0.6, and 0.05 for temperatures , , and . These temperature values are all located in the SP-phase where the susceptibility is non-zero confirming the claim in Giada and Marsili (2001) that the maximization of should recover a clustering configuration of the a system in the vicinity of the phase transition. We also note that despite SPC neighborhood search restrictions the Normalized f-SPC solution has the highest similarity to its SPC’s counterparts. The “Market Mode” case of f-SPC has a large mixed cluster, and the RMT one has a its largest cluster mixed with securities from every economic sector. This would require further analysis but we think the main difference between that and SPC’s equivalent is this cluster which could be of low correlation.
We push further the analysis by considering as a clustering quality evaluator. As was demonstrated in Giada and Marsili (2002) is a consistent objective function which if maximized can discriminate between clustering algorithms. Similarly to the “Silhouette Coefficient”, and the “Calinski-Harabaz Index” which are methods used to evaluate the clusters definition when the ground truth class is not known, plays a similar role.
We test for this by evaluating all SPC candidates for their values, and we add a horizontal line on each plot indicating the respective f-SPC’s . In every case, SPC’s start low at low T, reaches a maximum at intermediate T and decreases slowly at T increases into Paramagnetic territory. This is yet another confirmation that higher values are located in a intermediate temperature regime which coincides with the SP-phase when the system is critical. SPC’s maxima are 105, 43, and 170 respectively in Fig. (1(a)) Fig. (1(b)) Fig. (1(c)), and we observe that solutions recovered using f-SPC all have higher likelihoods than SPC’s. Based on the result in this paper, and in Giada and Marsili (2002) one could argue that f-SPC produces better clustering candidates than SPC at least in this case.
ii.4.1 Free-Energy Validation
We now define the Helmholtz Free Energy for a thermodynamic process of an isolated system.
The free energy can also be computed using (16) with Z as the partition function like in Eqn. (6). 888Here we are implicitly using a course-grained approach to truncation over the monte-carlo replications; numerically computing the thermodyanimc averages over can often be more effectively computed using the replica method Castellani and Cavagna (2005) with a prudent choice of the number of replications .
We consider an isothermal process a system exchanging energy with a reservoir at constant by absorbing heat until its own temperature converges to that of the reservoir. For processes such as the one just described Eqn. (16) tells us that some of the energy needed for the system to be realized can be spontaneously transferred from the reservoir by heating “”. In this sense, for systems on which no work is done and thermal equilibrium is reached if the free energy reaches a minimum.
Using Mean Field Models in Blatt et al. (1996), and Giada and Marsili (2001) It was shown that the free energy reaches a local minimum within the Super-Paramagnetic or Clustered Ferromagnetic Phase, and a maximum at the Paramagnetic Phase transition. We argue that the temperature at which the previously mentioned minimum happens is synonymous to the heat-bath inside of which the system is in its “lowest level”.
Eqn. (16) requires computing whereas Eqn. (15) needs the average energy , and the entropy . Although we don’t show it these two methods agree. At any given temperature the system doesn’t converge to a specific energy level but displays a distribution (see Fig. (2(a))). The task at hand is now about picking a number of bins for our MCMC simulation which is consistent and not arbitrary. We borrow a “low bias” methods from Hacine-Gharbi et al. (2012) which follows:
where is the number of bins, and
with as the number of samples, here the total number of MCMC steps. Because we fix to 2000 the number of bins remains set for every temperatures, and the bin edges are set on the minimum and maximum possible energies depending on the problem. The Hamiltonian minimum energy is always 0 (for the ferromagnetic case), and in the case of the BRICS data the maximum was (see Fig. (2(b)) ). We then determine the distribution energy levels by picking the bin centers which we compute by taking the mean of the distribution inside each bin. Once obtained we can now compute , the Boltzmann distribution of energies, which we use to compute the thermal average Energy , the entropy using Eqn. (8) and the free energy .
In Figures (3(a)), and (3(b)) we show the results of a SPC simulation on the BRICS data (See Sec. IV.6 ): We show the free energy as a function of temperature for the SPC simulation, and we also plot the Adjusted Rand Index of the f-SPC solutions against the SPC ones. We quickly describe the behavior of the free energy which decreases and reaches a minimum at , and a maximum at . A quick look at Fig. (3(a)) shows that before the giant cluster is breaking down inside the Super-Paramagnetic Phase until their sizes are comparable and seemingly stable. After The clusters sizes are unstable, start decreasing and it’s become impossible to significantly distinguish clusters which signals a transition into the Paramagnetic Phase. More importantly the ARI curve peaks at , close to the minimum free energy temperature within the Super-Paramagnetic Phase thus revealing that f-SPC’s algorithm and objective function minimizes the free energy (maximizes entropy) of the system as it maximizes .
ii.4.2 Sci-Kit learn: Varying Density Clusters
The problem consists of 3 clusters using Sci-kit learn samples generator 999B. Thirion, G. Varoquaux, A. Gramfort, V. Michel, O. Grisel, G. Louppe, J. Nothman, ’make_blobs’, 2017. [Online]. Available: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html. [Accessed: 12-Jun-2018] with , and . We observed two cases of this problem: a 3D case, and a 500D case.
Figure Fig. (3(c)) respectively show the susceptibility, and the clusters size as a functions of temperature. At , in the SP-phase, we observe 3 clusters in Fig. (3(c)) of size 167, 166, and 166 with an ARI of 1.
We follow this with ’s solution which scores 1460.47, an ARI of 0.20, while the expected likelihood was 599.91. Yet again our solution’s likelihood is higher than our expectation, and the real clusters are divided in smaller ones. In comparison K-Means and DBSCAN respectively achieve ARI of 1, and 0.8. In light of this, we decided to try again the same problem but with the dimensionality set to . As expected SPC’s results do not change. However in this instance f-SPC’s solution quickly converges to the best classification. A further investigation was done by simulating both the 3D, and 500D cases using SPC, and computing the likelihood for every configurations. The assumption being that the maximum likelihood should be found within the temperature range where the system is in the SP-phase.
Figures (3(d)) and (3(e)) respectively show the likelihood as functions of temperature. We notice that in the 500D case in Fig. (3(e)), the maximum likelihood is found at temperatures within the SP-phase, and the monotonously decreases at higher temperatures. The opposite happens in Fig. (3(d)) where the best classification doesn’t correspond to the maximum likelihood of which in this case is found at high temperatures which by looking at in Fig. (3(c)) means we are effectively in the Paramagnetic phase. We provide additional comments in the discussion section of this paper.
Iii Data Pre-Processing
iii.1 The Distance Function
The wide variety of problems our clustering methods can tackle necessitates a careful choice of pairwise distances if we are to properly identify shared behavior. We will proceed by using the Euclidean distances whenever we assume independence of the features, and the Pearson correlations otherwise especially for problems where the features consist of time-series.
This has implications for both algorithms such that we use the Euclidean distance or the Pearson correlation distance for SPC, and for f-SPC, we use the Pearson correlation matrix, and the similarity matrix, which is the Euclidean distance matrix on , and subtracted from 1.
Raw data sets often contain features on different order of magnitude of scales, outliers, and missing data which can have significant impact on Machine Learning algorithms. One way to deal with these issues is through normalization of the features. This was achieved using the Min-Max Scaling technique which puts all features on a 0 to 1 scale by performing the following operation:
Scaling has significant effects on the feature space: one example is seen in Fig. (8(a)), and Fig. (8(b)) which respectively show the unscaled and scaled plots of the 3 wines problem. The unscaled data set has two classes completely inseparable whereas scaling the data effectively dissociates all three classes with minimal overlap.
The next and final pre-processing task consist in removing any noise present in our data. This is especially important for financial market time-series which exhibit extreme randomness and possibly chaotic behavior. Stock market correlation matrices are noisy, and positively skewed Fig. (4(a)) which translates into what is referred as the “Market Mode”. We consider an intermediary step which consist in “removing” the market mode, thus ensuring we are able to recover the underlying correlation structures, if any, present in the system.
iii.3.1 Random Matrix Theory (RMT)
We follow the predictions of RMT in Wilcox and Gebbie (2007) by assuming that stock market returns are IID random variables with zero mean and unit variance. These assumptions lead us to the conclusion that stock market correlations should all be zeros, and if the assumptions are indeed true, RMT predicts that the eigenvalues of the random matrices are Wishart distributed such that:
where , and
Shown in Fig. (6) is the distribution of eigenvalues of the correlation matrix of our stock market data Sec. IV.5. As can be observed the eigenvalues inside the Wishart range are responsible for the noise whereas those outside of the range are potentially correlated signal which shouldn’t be discarded.
We consider the eigenvalues represent the linear, and 1st order relations between time-series while it is unclear what those on the left side () of the Wishart distribution are. The linear signals are the signals shared by assets at the sectoral level.
The RMT “Market Mode” removal method is implemented in the five following steps bellow in Table F:
We tested different time-series lengths ranging from 89 to 1249, and we note that the size of the Wishart Range increases with dimensionality, and the lower left tail decreases on the other hand. The higher the dimensionality the easier it is to rule out eigenvalues as random signals, and the more important the biggest eigenvalues are to the noise-less data reconstruction 101010 In Giada and Marsili (2001), The authors achieve a similar result by using the model in Sec. (II.3.2) confirming that “noise cleaning” mainly affects the small eigenvalues of stock market correlation matrices.. An example of a cleaned correlation matrix resulting from this method Fig. (4(c)).
iii.3.2 Iterative Matrix Normalization (IMN)
Another “Market Mode” removal method Olshen and Rajaratnam (2010) IID random variables with zero mean and unit variance.
The Iterative Matrix Normalization “Market Mode” removal method is implemented in the three following steps bellow in G:
We observe in Fig. (4(b)) that the distribution of correlations is now centered around 0.
Iv The Data test-cases
The following examples are used as a stress test for both methods. We obtained both synthetic, and real data which enabled us to discuss the features of each models.
As a comparison tool we use the Adjusted Rand Index (ARI) Hubert and Arabie (1985) which given two classifications measures their similarity. The ARI operates on a scale with positive values signifying increasing similarity. Where a true classification exists we will use the ARI to measure the quality of clustering of both methods but also industry standards such as “K-Means” Lloyd (1982), and “DBSCAN” Ester et al. (1996). Using SPC (Sec. A.1) we cluster a temperature range which we then compare against the (13) solution recovered. We then select the SPC temperature with the highest ARI for a closer comparison with the solution in the stock market case where a true classification is not available.
For visualization, where possible, we provide the plots or we make use of a nonlinear dimensionality reduction package called UMAP McInnes et al. (2018). The graph of the MST is also provided as it is a faithful representation of clusters on a 2D plane. The MST takes in the graph of our data, and find the unique shortest path linking every nodes.
iv.1 Sci-Kit learn: Concentric Circles
Our first problem is the identification of two concentric circles on a 2D plane using Sci-kit learn Pedregosa et al. (2011) samples generator 111111B. Thirion, G. Varoquaux, A. Gramfort, V. Michel, O. Grisel, G. Louppe, J. Nothman, ’make_circles’, 2017. [Online]. Available: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_circles.html. [Accessed: 12-Jun-2018] with , 0.5 for the noise parameter, and the 2 dimensions represent the X and Y coordinates of the observations.
Judging by observing Fig. (6(b)) , and we see no overlap between the two clusters present in the data, and we expect to recover close to perfect clusters after applying the algorithms.
We obtained the susceptibility as a function of temperature in Fig. (6(a)). Within the SP-phase at we observe two clusters in figure Fig. (7(a)) both contain 250 nodes with an ARI of 1. Once the temperature gets relatively high, near , the system is deemed at “high energy”, and the clusters dissolve almost simultaneously. Unlike this particular example, this does not generally happen with real data where clusters have different densities.
On this data, f-SPC runs for 10000 generations maximizing to 639 while the real classification scores 317. The f-SPC configuration is presented in Fig. (7(b)) with an ARI of 0.085. In comparison K-Means and DBSCAN respectively achieve 0.16, and 1. K-Means has low clustering quality despite specifying the correct number of clusters. This is due to its inability to deal with non-spherical and non-Gaussian shaped clusters. Despite the high likelihood, Fig. (7(b)) shows a high number of clusters. The clusters are not mixed and ultimately we fail to recover the initial two clusters.
iv.2 Sci-Kit learn: Wine data
The second problem consists of a data set containing three clusters: , and . It is a reputed easy problem illustrating the importance of Normalizing/Standardizing features. There are 59, 71, and 48 samples respectively for class 1, 2 and 3, and the data is generated using Sci-kit learn loader 121212D. Cournapeau, F. Pedregosa, O. Grisel ’load_wine’, 2007-2010. [Online]. Available: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html. [Accessed: 12-Jun-2018]. the 13 features are quantities extracted from a chemical analysis of 3 types of italian wines: One Alcohol, Malic acid, Ash, Alcalinity of ash, Magnesium, Total phenols, Flavanoids, Nonflavanoid phenols, Proanthocyanins, Color intensity, Hue, OD280/OD315 of diluted wines, and Proline.
At first sight in Fig. (8(a)), 2 clusters are merged whereas once the features have been normalized Fig. (8(b)) the 3 clusters occupy separate regions of the space. Each cluster has one extremity close to its neighboring cluster which may induce some misclassification error, and because of this we expect to recover 3 imperfect clusters.
Between and we observe three clusters, and the best classification recovered in Fig. (9(a)) provides the MST of the SPC’s solution with an ARI of 0.82.
Figure Fig. (9(b)) presents ’s solution with a likelihood of 83.94, an ARI of 0.51, and an expected likelihood of 66.97. As with the circle problem our solution’s is higher than our expectation, and it has 7 clusters instead of 3. 1 cluster contains observations of cluster 0, while clusters 1 and 2 are split in smaller ones without much misclassification. In comparison K-Means and DBSCAN respectively achieve ARI of 0.85, and 0.42.
iv.3 Sci-Kit learn: Fishers Iris data
Fisher’ Iris Data using Sci-kit learn loader 131313D. Cournapeau, F. Pedregosa, O. Grisel ’load_iris’, 2007-2010. [Online]. Available: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html. [Accessed: 12-Jun-2018] which includes individuals from 3 species: Iris Setosa, Virginica, and Versicolor. , , and there are 50 nodes per cluster.
As we can see in Fig. (10(b)), It’s one of the more challenging toy problems because two of the three clusters, Virginica, and Versicolor, are not linearly separable. We set , and observe two phases in Fig. (10(a)): for there are 2 clusters. The largest contains the Virginica, and Versicolor nodes while the smaller one includes most Seratosa nodes. The phase transition occurs at right before , and is followed by the separation of most Virginica nodes into their own cluster. This SPC solution Fig. (11(a)) has an ARI of 0.65.
The solution in Fig. (11(b)) has an ARI of 0.627, a likelihood of 132, and our expected is 104. In comparison K-Means and DBSCAN respectively achieve ARI of 0.73, and 0.52. Similarly to the precedent examples, we recover five large clusters: Cluster 1 contains Seratosa individuals, while the Virginica, and Versicolor clusters are split into 4 smaller ones with minimal misclassification.
iv.4 Sci-Kit learn: MNIST digits
The hand-written digits dataset, generated with Sci-kit learn loader 141414D. Cournapeau, F. Pedregosa, O. Grisel ’load_digits’, 2007-2010. [Online]. Available: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html. [Accessed: 12-Jun-2018], is mainly used to test classification algorithms in supervised learning but we are interested in how well both SPC, and f-SPC deal with the nonlinear nature of hand-writing. The data contains 10 classes of digits ranging from 0 to 9. The full set has close to 2000 nodes from which we select 500, and 50 of each class. The images are 8 by 8, and .
The MST in Fig. (12(a)), and the UMAP McInnes et al. (2018) plot in Fig. (12(b)) show us all digit classes are linearly separable, the data contains outliers especially “1”’s and “9”’s which may be the result of different writing styles.
We see that the big cluster breaks down in multiple steps Fig. (12(c)) due to the differing densities of clusters present in the data. Fig. (12(c)) shows that at right after the final peak the configuration’s clusters in Fig. (13(a)) has an ARI , and show no significant misclassification.
’s solution in Fig. (13(b)), after 25k generations, has a likelihood of 149.47, an ARI of 0.747, and an expected of 135. Once again we encounter similar results as with the previous cases with the higher likelihood, and the number of clusters. The solution has close to 20 clusters, and while there is one main cluster per digit which is the case for digits 0 and 6, and mostly for 3, 7, 8, and 9, the digits 1, 2, 4, and 5 are all split in two clusters. We explain this by the inconsistent nature of hand-writing which produces different writing styles. In comparison K-Means and DBSCAN respectively achieve ARI of 0.56, and 0. There are many reasons why DBSCAN fails this problem: DBSCAN classifies some observations as noise into one cluster, it also has issues tackling problems with clusters of different densities.
iv.5 Kaggle: NYSE Data
We obtained publicly available NYSE stock market data on Kaggle151515C. Nugent ’S&P 500 stock data - Historical stock data for all current S&P 500 companies’, 2017-2018. [Online]. Available: https://www.kaggle.com/camnugent/sandp500. [Accessed: 01-Dec-2017]. The original data contains daily open, high, low, closing prices, and volume from 8/13/2012 to 8/11/2017. Because not all stocks traded for the whole duration we only select the stocks which did for the last 1250 days ( years) which left us with 447 stocks, and furthermore we, in this case, were interested in a time horizon of 5 years in trading days from 8/23/2012 to 8/11/2017. We consider the daily trading closing prices which are use to compute the daily returns such that :
The final data set has a time-series Length . Using Eqn. (20) we consider three cases for the correlation matrix: i.) the full correlations, ii.) denoising using IMN (See Sec. III.3.2), and iii.) cleaning the matrix using a RMT method (See Sec. III.3.1).
Financial markets are perpetually evolving living ecosystems, and this is illustrated in the lack of available true sectoral classification of publicly traded companies. In the process of clustering stock market data 161616A very nice review of clustering methods applied to financial datasets is available at Marti et al. (2017) , we wanted to compare the results of our algorithms with industry standard classification but we faced the following difficulties:
We consider the following industry classifications 171717 There is no consensus industry classification used in the financial services industry.: The New York Stock Exchange (NYSE) uses the Global Industry Classification Standard (GICS) 181818 The GICS counts 11 sectors, 24 industry groups, 68 industries, 157 sub-industries, and is updated annually. For more details on their hierarchical industry classification system https://www.msci.com/documents/10199/4547797/GICS+Structure+effective+Sep+1%2C+2016.xls (which we use here). The National Association of Securities Dealers Automated Quotations (NASDAQ) and the London Stock Exchange (LSE) both use the Industry Classification Benchmark (ICB)191919 The ICB counts 10 industries, 19 super-sectors, 41 sectors, 114 sub-sectors, and is For details on their hierarchical industry classification system https://www.ftse.com/products/downloads/ICB_Rules.pdf.Industry classifications have sectoral, industrial and sub-industrial levels. Although commonalities exists one is left to determine the equivalences when information aggregation is required across different markets.
GICS, and ICB are static classifications which are updated at irregular intervals (i.e. GICS every year, ICB from weekly to yearly updates). The focus of these companies is to provide long term structural trends of financial markets. As such they lose their usefulness if one wants to consider the impact of rare events such as financial crashes which significantly alter the behavior of businesses. They also do not consider how the diversification of investments and activities affect their respective classifications. The case of Amazon can be argued to illustrate the idea behind the Adaptive Market Hypothesis (AMH) Lo (2004): Amazon’s GICS’ sector is Consumer Discretionary. GICS uses this sector to classify companies whose activity they deem “most sensitive to economic cycles” 202020 A description of GICS sector is available at https://www.msci.com/documents/10199/4547797/GICS+Sector+definitions-Sep+2016.pdf . It is unclear what is meant by “sensitive” in this instance as there are many possible interpretations, and this sector is very heterogeneous. Perhaps it highlights the adaptive nature of Amazon’s business interests which started first as an order-to-delivery e-commerce bookstore but based on Fig. (14(a)) is now closest to the Information Technology sector.
The life cycle of publicly traded companies can be short. Firms go public and private at relatively high frequency when compared to biological evolution on a human timescale as motivated by Farmer in Farmer and Geanakoplos (2009). The inclusion or exclusion of individuals in an ecosystem can and should have an impact on its structure based on how important the individuals are to the groups. When we looked for GICS data for our time-series, a number of companies had gone private since Aug 2017, and GICS classification had been updated without reflecting these new changes for these companies. Gathering data on these companies which translated into the newer nomenclatures was thus rendered more difficult. Yet again illustrating the need for expert-free unsupervised methods.
Finally, while as previously stated, GICS and ICB intend on providing data which capture long term trends. Financial markets are populated with participants (i.e. pension funds, high frequency trader, asset managers etc…) each holding a diverse set of objectives, who do not necessarily operate on the same time scales or have the same investment horizons. If one goal is to provide comprehensive analyses of the multiple existing dynamics in markets, tools which capture these trends, and methods which subsequently find relations between them should be prioritized.
This motivates us to argue that the highly dynamic nature of financial markets renders the use of static classifications problematic to a certain extent.
We use GICS’s 11 sectors as the “true” economic sectors of the US financial market. These include Consumer Discretionary (74 stocks), Consumer Staples (31 stocks), Energy (28 stocks), Financials (62 stocks), Health Care (51 stocks), Information Technology (IT) (59 stocks), Industrials (58 stocks), Materials (26 stocks), Real Estate (26 stocks), Telecoms (4 stocks), and Utilities (28 stocks)212121 Colors used for the 11 GICS economic sectors: Consumer Discretionary (royal blue), Consumer Staples (sky blue), Energy (orange) , Financials (beige), Health Care (dark green), Information Technology (light green), Industrials (red), Materials (pink), Real Estate (purple), Telecom (magenta), Utilities (brown).. Although as previously mentioned we do not believe this classification to be valid, here we make use of it as benchmark.
Looking at MSTs in figures (14(a)), (14(b)), and (14(c)), and aided by the GICS classification as legend, we notice nodes belonging to the same economic sectors are mostly located in proximity of each other as one would expect in a static world or over time-scales where the static model is a reasonable approximation.
We briefly mention again one of SPC’s features which consists in linking a node to its closest neighbor based on the spin-spin correlations. Using the condition we construct a graph but in the case where a node has no correlations meeting our condition, it is linked to its neighbor of highest spin-spin correlation. This feature forces SPC to produce graphs without isolated nodes. At the same time, and because of this fact, we consider small size clusters are equivalent to noisy, insufficiently correlated, or unclassified observations.
SPC solutions recover GICS information as seen by their respective ARI: 0.317, 0.479, and 0.33. The solution with highest number of noisy or unclustered stocks is Fig. (15(a)), The financial sector is merged with many other stocks from other sectors, whereas most industries are found in one or two clusters. The complexity goes down when we move to Fig. (15(c)) where every sector have mostly separated into their own unique cluster, and Fig. (15(e)) which gives a similar picture although with more smaller unclassified clusters present.
results were simulated for 25k generations, and we obtained values of 113.92, 54.13, and 367.93 respectively for the Full Fig. (15(b)), the Normalized Fig. (15(d)), and the RMT Correlations Fig. (15(f)). Their economic GICS information recovered via the ARI were, following the same order, 0.25, 0.35 and 0.41. While Fig. (15(b)) has the smallest number of clusters 15, we find clusters which mostly contain firms from single industries such as the financial, utilities, Real Estate, and Energy sectors. The other clusters more or less mixed including a very large one which we could refer to as the “market”. Recall that Fig. (4(a)) shows the correlations of the “Market Mode” are mostly positive, and one can easily infer this kind of result. Fig. (15(d)) and Fig. (15(f)) provide a cleaner pictures of the market: in Fig. (15(d)) there is no large “market” cluster and every industry is mostly represented in their own respective clusters. Firms, previously found in the “market” are now for most of them located in clusters representative of their respective industries. Similar situation in Fig. (15(f)) except the industry sectors have a better definition while a large mixed cluster remains present similarly to Fig. (15(b)).
Th neighborhood search SPC performs constrains the scope of the optimization. In SPC’s case, SA can only minimize the Hamiltonian over the neighborhoods necessitates an additional decision in picking the neighborhood size which acts as a hyper-parameter. The likelihood is optimized over the whole range of observations effectively removing such need, and making the optimization fully unsupervised. This also means that there exists a possibility that nodes which wouldn’t cluster together, because of neighborhood limitation, would in this particular case. It is unclear which is the best way to proceed.
Our second goal is to explore clustering differences which arise in our 3 cases. In Fig. (15(a)), and Fig. (15(b)) The biggest clusters have significant overlap with economic sectors except for a few large ones such as the financials cluster which also houses stocks from other sectors. The picture gets cleaner once we look at Fig. (15(c)), and Fig. (15(d)) where we now have less mixing in most clusters, and finally in Fig. (15(e)), and Fig. (15(f)) some of the clusters such as the Real Estate, Utilities, Health Care, and Consumer Staples found in the Normalized case are split. We noticed a similar result in Sec. IV.3 where the solution had a higher number of clusters, but they were essentially subgroup within the ones found by SPC.
iv.6 BRICS data
We obtained publicly available BRICS (Brazil, Russia, India, China and South Africa) stock market data 222222C. Nugent ’S&P 500 stock data - Historical stock data for all current S&P 500 companies’, 2017-2018. [Online]. Available: https://www.kaggle.com/camnugent/sandp500. [Accessed: 01-Dec-2017]. The original data contains daily closing prices of 226 stocks:Brazil (60), China (50), India (30), Russia (43), and South Africa (43). We will refer to BRICS as a way of listing the mentioned countries in the previously given specific order. The window spans 2005 to 2015 from which we retained the last 5 years of daily trading. The data set suffers from a missing data problem which we would make it impossible to compute correlation matrices. We deal with the problem by using the time-series missing-data which consists on computing correlations only on overlapping sections of time-series. The resulting correlation matrix is then made positive definite, and cleaned using IMN (See Sec III.3.2 ).
Using MR the Likelihood’s local maxima reached was 7.56, with 24 clusters: very little mixing, and the 5 countries are concentrated inside 1 to 3 clusters per country (see Fig. (16(b)) ). This confirms our expectations which were that same country stocks should mostly belong to the same clusters, and the existence of multiple clusters per country as evidence of meso-scale industry/sectoral level classification.
We continue with the SPC result given for in Figures (16(a)), and (3(a)) with 9 clusters. The ARI between the SPC and f-SPC solutions is 0.5. Globally speaking the same clusters are recovered except for their size being slightly smaller for the MR solution, and clusters “purity” is higher in the MR candidate potentially (stocks which belong to different countries do not mix). In both candidates Brazil financial market is divided in 2 clusters which upon a more detailed cluster analysis could reveal industry (or sector) economic subdivisions.
In this paper, we were able to successfully implement SPC, and f-SPC, and we tested those methods on synthetic, and real data. If there exists significantly different structures within the data, SPC will exhibit multiple transitions within the SP-phase. As the temperature is varied, the couple Susceptibility and Average Magnetization signal the occurrence of phase transitions. The spin-spin correlation is indirectly linked to the interaction strength and the densities found in the data. The method has the advantage of being unsupervised for the most part, it does not necessitate a-priori knowledge of the number of clusters, and makes no assumption about the distributions of the data. While SPC performs a neighborhood search, which requires picking a value for , it does not affect the simulation significantly for large data sets; as was previously seen in Blatt et al. (1997). The parameter is set to 0.5 and helps decide clusters membership. We clustered at every temperature within a pre-determined range such that we do not need to identify “clustering temperatures ” like in Blatt et al. (1996) but we do so at the expense of additional computational cost. We note that in the literature there are modifications of the Potts model clustering which automate parameter selections for the clustering temperature , the local length scale , and the cluster membership threshold through validation based calibration. Murua and Wicker (2014).
Once we have a hierarchy of configurations, such as in Fig. (12(c)), one needs to select appropriate clusters representative of the different regimens of the SP-phase. This is an easy task as can be seen in Fig. (8(c)) when the number of cluster is low, and the clusters have similar densities which establishes a stable phase for a relatively wide temperature range. On the other hand if the number of clusters is high, and the data is composed of clusters of different densities, the susceptibility, which tracks the variance of biggest cluster has its limitations Kullmann et al. (2000). As the size of the data-set increases the susceptibility is useful as a tool to locate the final transition and lowest clustering level.
f-SPC only requires the correlation matrix and is completely unsupervised. The randomly generated population is diversified at every iteration by applying as many as 7 different mutations. It’s a fast and deterministic algorithm while SPC is MCMC based and requires statistical averages. The computation time is affected by the order of the observations in the data Blondel et al. (2008). We noticed that ordering our data based on the order of one of the observations’ closest “neighbors” produced better results which should motivate further exploration of potential heuristics dealing with this issue.
measures the quality of cluster configurations: its value is computed from the clusters sizes and the intra-cluster correlations . The optimization is global which, as opposed to SPC, avoids the need to determine a neighborhood size . There exist problems where choosing a sufficiently big has an non-trivial impact on SPC’s solutions. One such example would be the existence of a relatively low density and sparse cluster in a data set mostly composed of high density clusters. Low values of would fail to recover the low density clusters which would remain unclassified whereas this isn’t an issue for f-SPC which would perform much better.
f-SPC results are consistent for high dimensionality data-sets. if we consider the metric used to evaluate the noise in correlation matrices as the ratio of the dimension over the number of observations in the limit of . We recall that in Wilcox and Gebbie (2007), encodes the noise level of the eigenvalues of the sample covariance matrix. values for our problems are 250 for the two circles, 13.69 for the wines, 166.66 for the 3D blobs, 1 for the 500D blobs, 37.5 for Fisher’s Iris, 7.81 for the MNIST digits, and 0.35 for the NYSE Kaggle data. The results consistent with SPC were the 500D blobs, MNIST, and the NYSE stock data which all confirm that a low is necessary to compute appropriate correlation matrices. We want to be as small as possible, and if possible close to 0. This is not always the case, and we have tested ways to de-noise the correlation matrix (Sec. III.3.2, and III.3.1) in the case of financial time-series but it is unclear at this time what would the solutions be in other cases. In Giada and Marsili (2001) Marsili and Giada derive , and along the way they assume that which in turn means one has to consider the finite size effects of the method. Fig. (3(d)) and Fig. (3(d)) tell us that if we were to visualize the ’s objective surface, depending on dimensionality of the problem we could face a “rough” space. One could consider as a sort of modularity function just like in the Network Science literature. One major Network Science problem is the efficient detection of communities inside networks. Similar to our work, cluster configurations are the input of the modularity function which, through diverse heuristics, is maximized. However it is well known Good et al. (2010) to Network Scientists that modularity objective surfaces are degenerate: many significantly different clustering results have similar modularity, or in our case, higher likelihood than the true clustering.
We compare this to the modus operandi of SPC: The generative model of the SPC is the Gibbs-Boltzmann distribution which not only validates clusters locally using Eqn. (9). It is a bottom up approach as opposed to global optimization methods which are top down. One assumes that there exists multiple realizations (micro-states) of the generative model, the so called “equivalence classes” which are valid representation of the data. In order to link micro and macro-state one could pick any one micro-state translating into the desired macro-state: Maximum Likelihood methods essentially achieve this feature by searching the space of solutions for any candidates meeting the global objective. We argue that in complex systems, the existence of equivalence classes as illustrated by the degeneracy of clustering objective surfaces leads to the Maximum Entropy principle Jaynes (2003) as an alternative optimization device. The generative model generates equivalence classes (among which are included maximum likelihood candidates) each with differing probabilities, and one then needs to probabilistically combine them to achieve some sort of representative weighted average.
We suspect a way to deal with cases where is small compared to , and indirectly , would be a modification of by adding an additional term acting as a regularizer which could account for the number of clusters. Our rationale follows that as an objective function is degenerate with multiple spin configurations whose likelihood are equal or very close. This degeneracy comes from, if we assume the minimum number size of clusters to be 2 (no singletons), the number of possible configurations which for a case would be on the other of .
Finally one is left to decide which de-noising method is deemed optimal and as a consequence which clustering one prefers. The assumptions in both methods have their validity and should be carefully considered. Whereas IMN (Sec. III.3.2) consider the covariance matrix as IID normal random variables, RMT (Sec. III.3.1) predicts a spectrum of random matrices eigenvalues exists which is pure noisy signal. The noise is removed by reconstructing the data without the noisy eigenvalues whose number increases with dimensionality. We suspect a proper way of deciding which method is optimal is the implementation of such methods as bases of trading strategies.
In this paper we have presented two unsupervised data clustering algorithms inspired by the Potts Model Wu (1982). Using SA (SPC Sec. II.3.1) optimizes the Hamiltonian of a thermodynamics systems, and the ground state energy solution recovered provides the best clustering structure present in our data. We show that the parameter-free Marsili-Giada (Sec. II.3.2) maximum likelihood method implemented with a modified version of Hendricks et al.Hendricks et al. (2016a, b) Parallelized Genetic Algorithm recover solutions similar to those found in the super-paramagentic phase Fig. (1).
This was done by comparing the SPC solutions to the f-SPC one using the ARI Lloyd (1982). In addition to this we compare the Likelihood of SPC solutions to f-SPC ones to show that f-SPC have higher likelihood; this prompts an additional discussion on implication for statistical inference in complex systems. The methods were tested both on several standard toy test cases as well as real stock market time-series data; this illustrates their universality provided an appropriate similarity metric is selected, such as the Pearson correlation coefficients or the correlation distance. We are able to show that the results are similar to the 11 standard GICS economic sectors, however the differences in the number of clusters, and their composition should be cause for concerns with respect to the use of GICS classification. Hence, we question the effectiveness of naively using GICS classifications for risk-management and for investment decision making on both the medium and short-term.
Building on the work presented in this paper, and original aim, as in Hendricks et al. (2016a), we would have liked to perform cluster analysis and compared configuration of stock market intra-day time-series. The last thirty years have seen a magnificent increase in technological power which enabled simultaneous trading on multiple time scales. The so called High Frequency Trading (HFT) paradigm increased tenfold the amount of stock market data, and has significantly impacted the market participants behaviors at shorter time scales. It is therefore natural to consider market participants all having different operating time horizons and for this to somehow be reflected in the non-trivial composition of clusters. We assume traders use all information available to make decisions however the rate of economic information released about publicly traded firms can range from once a month to once a year. This rate is significantly lower than that of HFT which leads us to think trading is not solely based on economic information, and because of the different time-scales one can suspect different objectives may be at play Wilcox and Gebbie (2014). The key motivation here was to build and test high-speed classification methods towards the goal of unsupervised classification of traders from live market data. It is notoriously difficult to obtain trading data linked to individual market participants accounts. This kind of data would be extremely useful to directly, not only study traders’ behavior, but begin to understand the kind of ecosystem a financial market is. Unfortunately one is only left with the possibility proxy studies through the dynamics of the traded securities.
One possible approach is the unsupervised clustering of simulated technical trading agents implementing investment strategies in an artificial financial market. This has been a key motivation in the present work. One logical next step is what we call Dynamical Cluster Analysis (DCA): Events such as financial crises like the one which preceded the 2008 Great Recession can be investigated at the intra-day scale. Here again we conjecture shocks to the system irremediably affect strategies, and clustering structures are less persistent with time as in Musmeci et al. (2015). Ultimately some sort of quantification of clustering on different temporal scales could be useful towards probing potential hierarchical causal affects given that different effective theories may dominate at different scales Wilcox and Gebbie (2014).
We have so far worked with changes in price returns as our single factor model. The derivation and formulation of the Giada-Marsili allows for multivariate clustering: We can use F by N by N Correlation matrices where is the number of factors we want to include. Another challenge however would be that at this time we are not aware of an implementation of multi-factor correlation based SPC. In closing, we promote the idea that a promising future research direction would be quantized spin-models and ultimately building unsupervised learning algorithms that more effectively accommodate state-interference and phase information within some likelihood method, as well as additional optimization refinements to the algorithm, perhaps using community detection algorithms to enhance performance Blondel et al. (2008).
The authors thank Etienne Pienaar, Micheal Gant, Jeff Murugan, Nic Murphy and Diane Wilcox for discussions and comments. LY would like to thank Fangqiang Zhu for first introducing him to the elegance of Ising models and their simulation. TG acknowledges the University of Cape Town for FRC funding (fund 459282).
Appendix A The Algorithms
The algorithms implemented in this paper have been coded in python and are available on a github repository at Yelibi (2017).
a.1 SPC Algorithms
We provide a pseudo-code for the SPC Blatt et al. (1996) algorithm introduced in Sec. II.3.1. Given a distance matrix, and a neighborhood of size K, the algorithm uses the Swendsen-Wang Swendsen and Wang (1987) MCMC method to optimize the thermodynamic system.
Hoshen-Kopelman Al-Futaisi and Patzek (2003) is the mechanism behind clusters discovery in SPC. It allows for the graph to be traveled via its nodes and neighborhoods while clustering the system locally.
Swendsen-Wang flips clusters at each iteration, allowing the quick convergence toward a local maxima.
a.2 f-SPC Algorithms
- Abramov et al. (2010) Abramov, A., Kulvicius, T., Wörgötter, F., Dellen, B., 2010. Facing the multicore-challenge, in: Keller, R., Kramer, D., Weiss, J.P. (Eds.), Facing the Multicore-challenge. Springer-Verlag, Berlin, Heidelberg. chapter Real-time Image Segmentation on a GPU, pp. 131–142. URL: http://dl.acm.org/citation.cfm?id=1980597.1980615.
- Ackley et al. (1985) Ackley, D.H., Hinton, G.E., Sejnowski, T.J., 1985. A learning algorithm for boltzmann machines. Cognitive Science 9, 147 – 169. URL: http://www.sciencedirect.com/science/article/pii/S0364021385800124, doi:https://doi.org/10.1016/S0364-0213(85)80012-4.
- Al-Futaisi and Patzek (2003) Al-Futaisi, A., Patzek, T.W., 2003. Extension of hoshen-kopelman algorithm to non-lattice environments. Physica A: Statistical Mechanics and its Applications 321, 665 – 678. URL: http://www.sciencedirect.com/science/article/pii/S0378437102015868, doi:https://doi.org/10.1016/S0378-4371(02)01586-8.
- Barabási and Albert (1999) Barabási, A.L., Albert, R., 1999. Emergence of scaling in random networks. Science 286, 509–512. doi:10.1126/science.286.5439.509.
- Bianconi and Barabási (2001) Bianconi, G., Barabási, A.L., 2001. Bose-einstein condensation in complex networks. Phys. Rev. Lett. 86, 5632–5635. URL: https://link.aps.org/doi/10.1103/PhysRevLett.86.5632, doi:10.1103/PhysRevLett.86.5632.
- Blatt et al. (1996) Blatt, M., Wiseman, S., Domany, E., 1996. Superparamagnetic clustering of data. Phys. Rev. Lett. 76, 3251–3254. URL: https://link.aps.org/doi/10.1103/PhysRevLett.76.3251, doi:10.1103/PhysRevLett.76.3251.
- Blatt et al. (1997) Blatt, M., Wiseman, S., Domany, E., 1997. Data clustering using a model granular magnet. Neural Computation 9, 1805–1842. doi:10.1162/neco.19184.108.40.2065.
- Blondel et al. (2008) Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E., 2008. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment 2008, P10008. URL: http://stacks.iop.org/1742-5468/2008/i=10/a=P10008.
- Bonanno et al. (2001) Bonanno, G., Lillo, F., Mantegna, R., 2001. High-frequency cross-correlation in a set of stocks. Quantitative Finance 1, 96–104. URL: https://doi.org/10.1080/713665554, doi:10.1080/713665554, arXiv:https://doi.org/10.1080/713665554.
- BRUSH (1967) BRUSH, S.G., 1967. History of the lenz-ising model. Rev. Mod. Phys. 39, 883--893. URL: https://link.aps.org/doi/10.1103/RevModPhys.39.883, doi:10.1103/RevModPhys.39.883.
- Castellani and Cavagna (2005) Castellani, T., Cavagna, A., 2005. Spin-glass theory for pedestrians. Journal of Statistical Mechanics: Theory and Experiment 2005, P05012. URL: https://doi.org/10.1088%2F1742-5468%2F2005%2F05%2Fp05012, doi:10.1088/1742-5468/2005/05/p05012.
- Edwards and Sokal (1988) Edwards, R.G., Sokal, A.D., 1988. Generalization of the fortuin-kasteleyn-swendsen-wang representation and monte carlo algorithm. Phys. Rev. D 38, 2009--2012. URL: https://link.aps.org/doi/10.1103/PhysRevD.38.2009, doi:10.1103/PhysRevD.38.2009.
- Erdos and Rényi (1960) Erdos, P., Rényi, A., 1960. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci 5, 17--61.
- Ester et al. (1996) Ester, M., Kriegel, H.P., Sander, J., Xu, X., 1996. A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise, in: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, AAAI Press. pp. 226--231. URL: http://dl.acm.org/citation.cfm?id=3001460.3001507.
- Farmer and Geanakoplos (2009) Farmer, J.D., Geanakoplos, J., 2009. The virtues and vices of equilibrium and the future of financial economics. Complexity 14, 11--38. URL: https://onlinelibrary.wiley.com/doi/abs/10.1002/cplx.20261, doi:10.1002/cplx.20261, arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/cplx.20261.
- Fortuin and Kasteleyn (1972) Fortuin, C., Kasteleyn, P., 1972. On the random-cluster model: I. introduction and relation to other models. Physica 57, 536 -- 564. URL: http://www.sciencedirect.com/science/article/pii/0031891472900456, doi:https://doi.org/10.1016/0031-8914(72)90045-6.
- (17) Gebbie, T., Wilcox, D., Mbambiso, B., . Spin, stochastic factor models, and a ga, Citeseer.
- Getz et al. (2000) Getz, G., Levine, E., Domany, E., Zhang, M., 2000. Super-paramagnetic clustering of yeast gene expression profiles. Physica A: Statistical Mechanics and its Applications 279, 457 -- 464. URL: http://www.sciencedirect.com/science/article/pii/S0378437199005245, doi:https://doi.org/10.1016/S0378-4371(99)00524-5.
- Giada and Marsili (2001) Giada, L., Marsili, M., 2001. Data clustering and noise undressing of correlation matrices. Phys. Rev. E 63, 061101. URL: https://link.aps.org/doi/10.1103/PhysRevE.63.061101, doi:10.1103/PhysRevE.63.061101.
- Giada and Marsili (2002) Giada, L., Marsili, M., 2002. Algorithms of maximum likelihood data clustering with applications. Physica A: Statistical Mechanics and its Applications 315, 650 -- 664. URL: http://www.sciencedirect.com/science/article/pii/S0378437102009743, doi:https://doi.org/10.1016/S0378-4371(02)00974-3.
- Good et al. (2010) Good, B.H., de Montjoye, Y.A., Clauset, A., 2010. Performance of modularity maximization in practical contexts. Phys. Rev. E 81, 046106. URL: https://link.aps.org/doi/10.1103/PhysRevE.81.046106, doi:10.1103/PhysRevE.81.046106.
- Grimmett (2006) Grimmett, G.R., 2006. The random-cluster model. volume 333. Springer Science & Business Media.
- Hacine-Gharbi et al. (2012) Hacine-Gharbi, A., Ravier, P., Harba, R., Mohamadi, T., 2012. Low bias histogram-based estimation of mutual information for feature selection. Pattern Recognition Letters 33, 1302 -- 1308. URL: http://www.sciencedirect.com/science/article/pii/S0167865512000761, doi:https://doi.org/10.1016/j.patrec.2012.02.022.
- Hendricks et al. (2016a) Hendricks, D., Gebbie, T., Wilcox, D., 2016a. Detecting intraday financial market states using temporal clustering. Quantitative Finance 16, 1657--1678. URL: https://doi.org/10.1080/14697688.2016.1171378, doi:10.1080/14697688.2016.1171378, arXiv:https://doi.org/10.1080/14697688.2016.1171378.
- Hendricks et al. (2016b) Hendricks, D., Gebbie, T., Wilcox, D., 2016b. High-speed detection of emergent market clustering via an unsupervised parallel genetic algorithm. South African Journal of Science 112, 9. URL: https://www.sajs.co.za/article/view/4149, doi:10.17159/sajs.2016/20140340.
- Hinton and Salakhutdinov (2006) Hinton, G.E., Salakhutdinov, R.R., 2006. Reducing the dimensionality of data with neural networks. Science 313, 504--507. URL: http://science.sciencemag.org/content/313/5786/504, doi:10.1126/science.1127647, arXiv:http://science.sciencemag.org/content/313/5786/504.full.pdf.
- Hoshen and Kopelman (1976) Hoshen, J., Kopelman, R., 1976. Percolation and cluster distribution. i. cluster multiple labeling technique and critical concentration algorithm. Phys. Rev. B 14, 3438--3445. URL: https://link.aps.org/doi/10.1103/PhysRevB.14.3438, doi:10.1103/PhysRevB.14.3438.
- Hubert and Arabie (1985) Hubert, L., Arabie, P., 1985. Comparing partitions. Journal of Classification 2, 193--218.
- Jaynes (2003) Jaynes, E.T., 2003. Probability theory: The logic of science. volume II. Cambridge university press.
- Kruskal (1956) Kruskal, J.B., 1956. On the shortest spanning subtree of a graph and the traveling salesman problem. Proceedings of the American Mathematical Society 7, 48--50. URL: http://www.jstor.org/stable/2033241.
- Kullmann et al. (2000) Kullmann, L., Kertesz, J., Mantegna, R., 2000. Identification of clusters of companies in stock indices via potts super-paramagnetic transitions. Physica A: Statistical Mechanics and its Applications 287, 412 -- 419. URL: http://www.sciencedirect.com/science/article/pii/S0378437100003800, doi:https://doi.org/10.1016/S0378-4371(00)00380-0.
- Li (2011) Li, J., 2011. Potts model clustering for discovering patterns of epigenetic marks. Ph.D. thesis. Rutgers University-Graduate School-New Brunswick.
- Lloyd (1982) Lloyd, S., 1982. Least squares quantization in pcm. IEEE Transactions on Information Theory 28, 129--137. doi:10.1109/TIT.1982.1056489.
- Lo (2004) Lo, A.W., 2004. The adaptive markets hypothesis. The Journal of Portfolio Management 30, 15--29. URL: http://jpm.iijournals.com/content/30/5/15, doi:10.3905/jpm.2004.442611, arXiv:http://jpm.iijournals.com/content/30/5/15.full.pdf.
- Marsili (2002) Marsili, M., 2002. Dissecting financial markets: sectors and states. Quantitative Finance 2, 297--302. URL: https://doi.org/10.1088/1469-7688/2/4/305, doi:10.1088/1469-7688/2/4/305, arXiv:https://doi.org/10.1088/1469-7688/2/4/305.
- Marti et al. (2017) Marti, G., Nielsen, F., Bińkowski, M., Donnat, P., 2017. A review of two decades of correlations, hierarchies, networks and clustering in financial markets. arXiv preprint arXiv:1703.00485 .
- McInnes et al. (2018) McInnes, L., Healy, J., Saul, N., Groibberger, L., 2018. Umap: Uniform manifold approximation and projection. The Journal of Open Source Software 3, 861.
- Murua and Wicker (2014) Murua, A., Wicker, N., 2014. The conditional-potts clustering model. Journal of Computational and Graphical Statistics 23, 717--739. URL: https://doi.org/10.1080/10618600.2013.837828, doi:10.1080/10618600.2013.837828, arXiv:https://doi.org/10.1080/10618600.2013.837828.
- Musmeci et al. (2015) Musmeci, N., Aste, T., Di Matteo, T., 2015. Risk diversification: a study of persistence with a filtered correlation-network approach. Journal of Network Theory in Finance 1, 77--98. doi:10.21314/JNTF.2015.005.
- Noh (2000) Noh, J.D., 2000. Model for correlations in stock markets. Phys. Rev. E 61, 5981--5982. URL: https://link.aps.org/doi/10.1103/PhysRevE.61.5981, doi:10.1103/PhysRevE.61.5981.
- Olshen and Rajaratnam (2010) Olshen, R.A., Rajaratnam, B., 2010. Successive normalization of rectangular arrays. Annals of Statistics 38, 1638.
- Onsager (1944) Onsager, L., 1944. Crystal statistics. i. a two-dimensional model with an order-disorder transition. Phys. Rev. 65, 117--149. URL: https://link.aps.org/doi/10.1103/PhysRev.65.117, doi:10.1103/PhysRev.65.117.
- Osogami (2017) Osogami, T., 2017. Boltzmann machines and energy-based models. arXiv preprint arXiv:1708.06008 .
- Ott et al. (2004) Ott, T., Kern, A., Schuffenhauer, A., Popov, M., Acklin, P., Jacoby, E., Stoop, R., 2004. Sequential superparamagnetic clustering for unbiased classification of high-dimensional chemical data. Journal of Chemical Information and Computer Sciences 44, 1358--1364. URL: https://doi.org/10.1021/ci049905c, doi:10.1021/ci049905c, arXiv:https://doi.org/10.1021/ci049905c. pMID: 15272844.
- Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E., 2011. Scikit-learn: Machine Learning in Python . Journal of Machine Learning Research 12, 2825--2830.
- Quiroga et al. (2004) Quiroga, R.Q., Nadasdy, Z., Ben-Shaul, Y., 2004. Unsupervised spike detection and sorting with wavelets and superparamagnetic clustering. Neural Computation 16, 1661--1687.
- Rumelhart and McClelland (1987) Rumelhart, D.E., McClelland, J.L., 1987. Information Processing in Dynamical Systems: Foundations of Harmony Theory. Technical Report. University of Colorado at Boulder. URL: https://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6302931.
- Sokal (2005) Sokal, A.D., 2005. The multivariate tutte polynomial (alias potts model) for graphs and matroids , 173--226doi:10.1017/CBO9780511734885.009.
- Stanberry et al. (2007) Stanberry, L., Murua, A., Cordes, D., 2007. Functional connectivity mapping using the ferromagnetic potts spin model. Human Brain Mapping 29, 422--440. URL: https://onlinelibrary.wiley.com/doi/abs/10.1002/hbm.20397, doi:10.1002/hbm.20397, arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/hbm.20397.
- Swendsen and Wang (1987) Swendsen, R.H., Wang, J.S., 1987. Nonuniversal critical dynamics in monte carlo simulations. Phys. Rev. Lett. 58, 86--88. URL: https://link.aps.org/doi/10.1103/PhysRevLett.58.86, doi:10.1103/PhysRevLett.58.86.
- West et al. (2001) West, D.B., et al., 2001. Introduction to graph theory. volume 2. Prentice hall Upper Saddle River.
- Wilcox and Gebbie (2007) Wilcox, D., Gebbie, T., 2007. An analysis of cross-correlations in an emerging market. Physica A: Statistical Mechanics and its Applications 375, 584 -- 598. URL: http://www.sciencedirect.com/science/article/pii/S0378437106010351, doi:https://doi.org/10.1016/j.physa.2006.10.030.
- Wilcox and Gebbie (2014) Wilcox, D., Gebbie, T., 2014. Hierarchical causality in financial economics. https://ssrn.com/abstract=2544327 .
- Wu (1982) Wu, F.Y., 1982. The potts model. Rev. Mod. Phys. 54, 235--268. URL: https://link.aps.org/doi/10.1103/RevModPhys.54.235, doi:10.1103/RevModPhys.54.235.
- Yelibi (2017) Yelibi, L., 2017. Potts model clustering. https://github.com/tehraio/potts-model-clustering.