Fast Super-Paramagnetic Clustering

Fast Super-Paramagnetic Clustering

Lionel Yelibi ylblio001@myuct.ac.za    Tim Gebbie tim.gebbie@uct.ac.za Department of Statistical Science, University of Cape Town, Rondebosch, South Africa
Abstract

We map stock market interactions to spin models to recover their hierarchical structure using a simulated annealing based Super-Paramagnetic Clustering (SPC) algorithm. This is directly compared to a modified implementation of a maximum likelihood approach we call fast Super-Paramagnetic Clustering (f-SPC). The methods are first applied standard toy test-case problems, and then to a data-set of 447 stocks traded on the New York Stock Exchange (NYSE) over 1249 days. The signal to noise ratio of stock market correlation matrices is briefly considered. Our result recover approximately clusters representative of standard economic sectors and mixed ones whose dynamics shine light on the adaptive nature of financial markets and raise concerns relating to the effectiveness of industry based static financial market classification in the world of real-time data analytics. A key result is that we show that f-SPC maximum likelihood solutions converge to ones found within the Super-Paramagnetic Phase where the entropy is maximum, and those solutions are qualitatively better for high dimensionality data-sets.

Econophysics, Potts Models, Unsupervised Learning, Clustering, Phase Trasitions, Simulated Annealing
pacs:
05.10.Ln, 75.10.Nr, 89.65.Gh
\extrafloats

100

I Introduction

We consider the problem of unsupervised statistical learning for feature selection and classification of financial time-series data from similarity measures that can be represented as correlation matrices. Concretely, we consider Potts model Wu (1982) based Blatt et al. (1996, 1997) methods optimized for performance Giada and Marsili (2001); Marsili (2002); Hendricks et al. (2016a, b) using a Maximum Likelihood Estimation (MLE) approach based on the ground-state Noh Ansatz Noh (2000) compared to the finite-temperature approach using a cooling schedule to select configurations based on the susceptibility Blatt et al. (1996, 1997).

We compare the cluster configuration from the fast clustering algorithms based on the ground state configurations Hendricks et al. (2016a, b) (Sec. A.2) with those based on finite temperature Simulated Annealing (SA) based Monte-Carlo Markov-Chain (MCMC) Swendsen and Wang (1987); Hoshen and Kopelman (1976) (see Sec. A.1) to generate the full dendrogram of cluster configurations Blatt et al. (1996, 1997). It was shown that the clustering structure of financial assets are time horizon dependent in Bonanno et al. (2001), and given a translation of correlations into the Euclidean distance, they can be represented using Minimal Spanning Trees (MST) Kruskal (1956).

The Super-Paramagnetic Clustering (SPC) method originally developed in the early 1990s is a universal data clustering method Blatt et al. (1996) and has acquired a certain popularity for its implementation of the Maximum Entropy Principle (MEP) Jaynes (2003) where no assumptions are made about the distributions found in the data, and the number of clusters is revealed rather than predefined. SPC can be used in any environment as long as the features are embedded in an appropriate similarity metric. The method has been applied to chemical data using a sequential method and the Tanimoto similarity measure Ott et al. (2004), the detection and classification of spiking activity in neuronal tissue in Quiroga et al. (2004), the identification of regions of the brain with shared functionality in Stanberry et al. (2007), yeast genes profiles in Getz et al. (2000), histone modification data Li (2011), and image segmentation in Abramov et al. (2010).

Marsili and Giada Giada and Marsili (2001) were able to developed an efficient maximum likelihood clustering method for high dimensional data using the same spin-model inspired approach used in the SPC method. However, given the ill-posed nature of clustering they chose to evaluate the likelihood model for robustness relative to different optimization methods in Giada and Marsili (2002). The method is then applied to the detection of clusters of assets, and financial markets states in Marsili (2002) to uncover collective behavior of agents in a complex systems. Hendricks et al created a GPU based Parallel Genetic Algorithm (PGA) implementation to maximize in near real time clustering of market states for quantitative trading applications Hendricks et al. (2016a).

In this work we explore the relationship between the SPC method of Domany-Wiseman-Blatt Blatt et al. (1996) as compared to that of Giada-Marsili Giada and Marsili (2001). A key finding here is that we are able to confirm that the likelihood method is Super-Paramagnetic and conforms well to the entropy method when dimensionality is sufficiently high (see Sec. V). We are also able to further optimize the algorithm based on the PGA implementation Hendricks et al. (2016a). However, more importantly, what we are really able to demonstrate is the variety of clustering problems that the SPC can quickly and easily handle, with the advantage of leveraging a mature and well-understood foundational theoretical framework from statistical mechanics that has many, if not most, viable alternative algorithms, as special cases. We are of the view that SPC type models grounded in the maximum entropy principle and within the general framework of energy-based machine learning offer a variety of research and development opportunities for building better and faster unsupervised learning algorithms.

Towards building this argument the paper is organized as follows: Section II discusses a brief overview of Erdos-Renyi’s Random Graphs (see Sec. II.1), and their connection to the Potts models as special cases. We describe the inhomogeneous Potts Model as a data-clustering engine (see Sec. II.2). In Section II.3, the implementations with the SPC algorithm (Sec. II.3.1) followed by Marsili’s maximum likelihood methods (Sec. II.3.2). We then proceed to discussion validations procedures linking maximum likelihood methods to Super-Paramagnetic Clustering in Sec. II.4. Section III goes over our choice of similarity metric (Sec. III.1), data-preprocessing (Sec. III.2), and time-series noise filtering (Sec. III.3). Section IV provides toy test cases, and stock market applications. In Section V a summary analysis of the results and their implications, and finally in Section VI the conclusion, and potential directions for further research on similar topics are mentioned.

Ii Potts Clustering

ii.1 Random Graphs

A graph is a mathematical model which formalizes pairwise relationships between objects West et al. (2001). Graphs are popular in complexity sciences because they provide a framework to model large systems of interacting components by representing the system directly through the pair-wise relationships between the components. The field has seen the rise of many models, each with their own assumptions and various nuance, almost all are of the generative form based on the premise of bottom-up causation. One feature that is useful in our context is that one can observe certain types of emergent dynamics i.e. “phase transitions” BRUSH (1967).

A general class of models called the Random Cluster model 111See Grimmett Grimmett (2006) and references therein. was developed by Fortuin-Kasteleyn in 1972 Fortuin and Kasteleyn (1972). It is a “random” graph generated by a probability distribution

(1)

where is a given edge configuration (or adjacency matrix of ), is the number of clusters given , is the probability of nodes and being connected, and is the normalization constant (also called the partition function in statistical mechanics Sokal (2005) ). The adjacency matrix is linked to the probabilities by picking a value such that if then or otherwise. is essentially the probability of the graph being connected given an adjacency matrix .

If we now set , the random cluster model reduces to a Erdos-Renyi’s random graph Erdos and Rényi (1960). Given the existence of only one class on the graph, bonds are linked independently from their respective states (i.e. they all have the same state) with equal probability with the number of nodes on the graph Erdos and Rényi (1960). An important generalization of this idea is the Barabási-Albert model Barabási and Albert (1999); this model works slightly differently: it starts with a low number of connected nodes , adds new nodes one at a time, one new node is able to connect to nodes, and every time a node is connected its degree increases. The probability of a node connecting to another is . This means as a node makes connections it becomes “popular” and succeeding nodes have a higher likelihood of connecting to that same node: this is the principle of “preferential attachment” which hopes to explains how some social networks are formed. These models are all based on a generative model that builds on microscopic causal relationships from the system components to the bulk.

The Fortuin-Kasteleyn random cluster model is closely related to the Potts model via its distribution

(2)

which is the conditional probability of the spin configuration given the edge configuration Edwards and Sokal (1988). The marginal probability is recovered by summing over all spin configurations. The major difference the Potts model brings is entropy maximization which assumes an exponential Boltzmann distribution of edges connections with as the pairwise probabilities. The strength captures the closeness between nodes, and, with the clusters membership, defines the topology of the graph. It’s a central variable of the model. This is similar to the Bianconi-Barabási model Bianconi and Barabási (2001) which introduces a fitness which plays a related role as an add-on to the Barabási-Albert model Barabási and Albert (1999).

ii.2 The Potts Model

The Ising model BRUSH (1967) simulates the existence of phase transitions in large systems of interacting particles. The model consists in the representation of a n-Dimensional plane. Ising’s PhD Thesis BRUSH (1967) solved the 1D problem, which showed no phase transition, while Onsager provided an analytical solution using a transfer-matrix method Onsager (1944) for the 2D case. If we consider observations in our data sets as nodes with edges which link nodes together it becomes natural to consider the data-set in the context of a Potts model Wu (1982). An edge is active or inactive with probability dependent on the distance between two nodes. The collection of nodes and edges form the graph which is navigated for clustering. Every node can be assigned, for example, a +1 or -1 spin (for the Ising model). Interactions are permitted by randomly changing the spin values in the graph, and then accepting or rejecting new configurations is implemented using the Swendsen-Wang MCMC algorithm at every step Swendsen and Wang (1987).

The Potts model Wu (1982) is a generalization of the Ising BRUSH (1967) model allowing the system to accept a higher spin values instead of 2. The parameter can be compared to the value in K-means used to fix the number of clusters. is the maximum number of classes: it must be chosen to be big enough to avoid clusters forcefully merged together. The only inconvenience to a relatively high is the additional computational cost needed to perform the statistics after the system reaches thermal equilibrium.

The model is governed by a Hamiltonian Energy 222 The more general Potts Hamiltonian can be contrasted with one of its special cases in the form of the energy of a Boltzmann machine with bias and weight matrix for features : with partition function Hinton and Salakhutdinov (2006); Osogami (2017); Ackley et al. (1985); Rumelhart and McClelland (1987) equation Wu (1982)

(3)

with: the spin vector assigned to our data, spins , and nodes. The Kronecker delta which is 1 for equal spins and 0 otherwise. For data embedded in a metric space the Euclidean distance function is computed between two nodes.

is fed to the strength function which, in turn, measures similarity. Many models for strength exist but their central feature is they must decrease with distance. This is typically achieved with a function of the type or a power law as seen in Blatt et al. (1996):

(4)

where is the average number of neighbors per node, and is a local length scale: the average of all nearest-neighbors.

There are alternative choices for local characteristic length scaleBlatt et al. (1996). We only report the results obtained with the previous definition, and note that the adjustments to are problem dependent: higher values of ensure the 1st phase of the simulation is ferromagnetic while lower values start the simulation in the Super-Paramagnetic (SP) phase.

The objective is to compute averages of thermodynamic quantities after simulating the system at a given temperature for a set number of MCMC iterations until thermal equilibrium.

The first simulation serves to uncover the existence of a critical temperature at which a first transition occurs. At all spins are strongly correlated, and all have the same state: It is called spontaneous magnetization (ferromagnetic phase). At the single cluster breaks down into smaller ones (SP-phase). Furthermore, inside the temperature range where the SP-phase exists, a system can go through additional transitions: These reflect the different hierarchical structures present in the data. Finally at we go through a final transition into complete disorder (Paramagnetic phase): The energy is high, all clusters dissolve, and .

The magnetization of the system is given by

This quantity, which ranges from 0 to 1, expresses how dominated the system is by the largest cluster. The order parameter of the system is the average magnetization , and its variance is called the susceptibility density. Both can be used to detect a phase transition: dives down while peaks at every transition.

For a quantity the thermal average will be

(5)

Here each represents a single MCMC step. If is large enough, Eqn. (5) is equivalent to the arithmetic mean . The probability of a system being in a particular state (referring to the energy of the system ) is:

(6)

where is the Boltzmann factor, is the partition function and the normalization constant of the Gibbs-Boltzmann distribution.

Numerically, we use a mean-field mode of the Hamiltonian such that . The motivation being that high levels of lead to Boltzmann factors close to 0, also which by definition makes the computation of impossible, also the value of impacts the temperature range explored.

ii.2.1 Maximum Entropy

We briefly remind ourselves of the MEP Jaynes (2003). We define a statistical mechanical system as an ensemble of objects each in their respective micro-states (spin values ) so that the resulting in microscopic state of the entire system can be used to derive parameters which characterize the distributions for macroscopic variables of interests (here the internal energy ). We assume that at equilibrium, thermodynamic systems obey conservation of energy which sets the constraints of the system such that on average is a constant, and then from Eqn. ((5)) it follows that:

(7)

We then consider that the distribution representative of the energy of the system as the one which incorporates our constraints and assumes nothing else. This maximizes the Shannon’s Entropy as defined by:

(8)

The problem can then be reduced to a Lagrange optimization task for which the exponential family of distributions is a well known solution (see Eqn. (6) ) with the inverse temperature as its Lagrange multiplier Jaynes (2003).

ii.3 Super-Paramagnetic Clustering (SPC)

ii.3.1 A Maximum Entropy Method

We define a neighbor on a lattice to be a node located in the vicinity of another node such that a node will have neighbors , and . A neighborhood generated using these rules is valid for a 2D lattice with a fixed for all nodes (the interaction strength is said to be homogeneous). This is the original method used in simulating Ising/Potts models of ferromagnets. As a generalization to the problem of data clustering, we will consider a neighborhood which emerges from the inhomogeneous interaction strength . Every neighborhood is a mini-graph, and their aggregation constitutes a graph whose topology is determined by the matrix : For two nodes to be neighbors they, each, must be included in their respective K-nearest neighbors.

We first define the neighborhood of size K. This is implemented following steps in Table (A) below.

Pick a : Build the nodenext matrix as the array containing the locations of every node’s neighbors. We use (except when otherwise explicitly stated). Form the MST: Add edges to nodenext to make every graph connected regardless of .

Table A: Setting the neighborhood size K for SPC: The neighbor determines the scope of the algorithm. It effectively cancels the pairwise interaction strengths of the spins outside the spins respective neighbors thus producing a speed-up in the computation of the Hamiltonian. Blatt et al. (1996)

The graph is traveled via nodes, and edges can be set active or inactive with probability

(9)

The next array is called links. It is the adjacency matrix where the activation status of edges is stored such that if rand, and 0 otherwise.

The original Hoshen-Kopelman (HK) algorithm Hoshen and Kopelman (1976) is the standard for labeling clusters in many statistical mechanics problems. The 2D version is mostly restricted to two neighbors per node: , and . SPC (Sec. A.1) deals with problems where is not fixed, and K can be large so we apply the extension of HK to non-lattice environments found in Al-Futaisi and Patzek (2003).

The clusters are labeled using the extended HK algorithm (See Table B below).

Initialize counter labels: nodel as 1xN array to store labels set to 0. Check for activated edges: If none (unlinked to occupied nodes) (2a.) create new cluster label, else (2b.) find the root node and its labelled neighbour and store smallest root and nodel and replace nodelp Sequentialize the recorded node labels: nodelp Relabel nodes.nodel nodelp

Table B: Labeling: The extended Hoshen-Kopelman (HK) algorithm: reads data from a matrix of bonds indicating spins pairwise associations and creates the clusters (also see Appendix for code patterns 2). The HK algorithm is inspired by the union-find algorithm which “finds” the root class of nodes and “unites” nodes belonging to the same class. Al-Futaisi and Patzek (2003)

Once the graph is fully constructed, the next step is implement the Swendsen-Wang MCMC algorithm (See C1 in C below.

Assign Spin Values: The resulting clusters are all flipped independently from each other and each get assigned a new spin value between 0 and .

Table C: Flipping Clusters using Swendsen-Wang MCMC Swendsen and Wang (1987) (see code-pattern in 3 )

Finally, the spin-spin correlation is the average probability of two spins being in the same cluster is computed in two steps (See Table (D)).

Increment two-point correlations: At every MCMC step , the two-point connectedness, is incremented if and are clustered together. Compute spin-spin correlations:: Once the simulation ends for the temperature explored we compute

Table D: Compute the Spin-Spin Correlation from the incremented two-point correlations . Blatt et al. (1996)

The spin-spin correlation is probably the most important quantity as it is used to build the final clusters. The threshold , for which two nodes are members of the same cluster, is picked to be higher than but less than . The bounds on that range are explained by the distribution of which peaks at those two values: They are respectively the peak inter and intra cluster correlations 333For results in , and . Uncorrelated nodes have , and correlated nodes .. It is typical to use as it does not significantly affect the results in previous examples Blatt et al. (1996).

ii.3.2 A Maximum-Likelihood Method

Based on an analysis of the spectral properties of stock market correlations matrices, Noh Noh (2000) makes the following statistical ansatz: let’s assume an existing market hierarchy where individual stocks dynamics are dependent on clusters of similar stocks. This can be illustrated by a simple model as follows:

(10)

where are the stock’s features, the cluster-related influence, and the node’s specific effect.

In Giada and Marsili (2001) Giada, and Marsili formally develop a Potts model using Noh’s idea, and in Hendricks et al. (2016a) Hendricks, Gebbie, and Wilcox solved the optimization problem using a PGA for unsupervised learning for quantitative trading. Let’s consider a group of observations, embedded in a space with dimensionality as the features, and as with SPC (Sec. A.1), every observation is assigned a spin value. One version of the ansatz models the observation features such that

(11)

where is one feature, the intra-cluster coupling parameter 444 The thermal average can be used to reconstruct data-sets sharing identical statistical features of the original time-series by using Eqn. (11) Giada and Marsili (2001) , the cluster-related influence, and the observation’s specific effect, and measurement error. A covariance analysis yields additional terms such as the size of cluster , and the intra-cluster correlation 555Here , , and Giada and Marsili (2001); Hendricks et al. (2016a)..

We explicitly mention that must be enforced: the lower bound is required because is undefined for values of , and the upper bound requires a strict inequality because Eqn. (13) is undefined when . We introduce a Dirac-delta function 666Let , and a Dirac delta function of which is 1 when , and 0 otherwise. to model the probability of observing data in a configuration close to criticality:

(12)

This joint likelihood is the probability of a cluster configuration matching the observed data for every observations, and for every feature. The log-likelihood derived from can be thought of the Hamiltonian of this Potts system:

(13)

The sum is computed for every feature, and represents the amount of structure present in the data. The value of is indirectly dependent on spins via the terms .

A-priori advantages of this method over industry standard alternatives: First, that is completely dependent on , and the dimensionality of the dataset only plays a part in computing , and Second, it is unsupervised: There are no preset number of clusters. Clustering configurations are randomly generated, and that which maximizes provides us with the number of clusters, and their compositions.

Further modification of the model can be made to reduce the Hamiltonian to that of the standard K-means algorithm Lloyd (1982):

(14)

The f-SPC algorithm (Sec. 4) uses a PGA to find the global optimum of the likelihood (13).

The principles of our GA are given in Table (E) below.

Generate Populations: Generate the populations as a set of randomly generated Potts configuration with spin values ranging form to , 2.) Evaluate Fitness: Use the computation of Select the Best Individuals Mutate: A set number of individuals in the populations are mutated Recombine: The parent and child generations are recombined and again selection of the best individuals takes place. Iterative Convergence: Repeat 2.) to 5.) until sufficient convergence has been achieved.

Table E: f-SPC PGA Implementation: This Genetic Algorithm has no crossing step where parents would be mated. The mutations are the main genetic diversity operator. Mutations and Likelihood computations are evaluated in parallel Hendricks et al. (2016a) (see code-pattern 4).

The original PGA algorithm implemented in Gebbie et al. () contained a mating step involving a bespoke cross-over function, and a restriction: Only parents with the same number of clusters could be mated, and the resulting children should maintain the same characteristic. The enforcement of this rule restricted clusters merging and splitting through mutations only. Our f-SPC implementation removes this intermediary step. This was implemented in order to decrease the computational cost 777 The algorithm is able to double the number of generations from 250 to 500 for a 5 mins simulation (Iris see Sec. IV.3): .

In addition to the diversity of individuals present in the population, mutations serve as GA diversity operators: They increase the genetic diversity, and send the system onto another path toward higher local maxima. We used six equally weighted types of mutations: i) New: A complete new individual, ii) Split: a random cluster split into two, iii) Merge: two clusters merged at random, iv) Swap: two spin labels are exchanged, v) Scramble: where a sequence of spins have their labels re-assigned in reverse order, and, vi) Flip: cluster (spin) labels are randomly re-assigned using the total cluster number (See SW Table C ).

At last, the simulation has converged once the fitness of the best individual hasn’t increased for a pre-determined number of iterations called “Stall Generations”. It should also be noted that although we did not recode a CUDA implementation for direct GPU implementation of our refined PGA algorithm this can be implemented using a modified version of the original CUDA implementation Hendricks et al. (2016a).

ii.4 Super-Paramagnetic Phase Validation

The goal of this project was to effectively validate the existing link between the solutions obtained using and the original Potts Hamiltonian . We proceeded by comparing clustering configurations obtained using the two models for the data set in Sec. IV.5.

Shown in Fig. (0(a)) , Fig. (0(b)) , and Fig. (0(c)) are the Adjusted Rand Indices as functions of temperature for our three cases. We compute the ARI at every temperature taking f-SPC as the true classification, and SPC as the candidates. Maximal ARI values are respectively 0.175, 0.6, and 0.05 for temperatures , , and . These temperature values are all located in the SP-phase where the susceptibility is non-zero confirming the claim in Giada and Marsili (2001) that the maximization of should recover a clustering configuration of the a system in the vicinity of the phase transition. We also note that despite SPC neighborhood search restrictions the Normalized f-SPC solution has the highest similarity to its SPC’s counterparts. The “Market Mode” case of f-SPC has a large mixed cluster, and the RMT one has a its largest cluster mixed with securities from every economic sector. This would require further analysis but we think the main difference between that and SPC’s equivalent is this cluster which could be of low correlation.

(a) Market Mode
(b) IMN
(c) RMT
Figure 1: In figures (a), (b) and (c) find the ARI (See Sec. IV) for the following cases : (a), with a market mode (See Sec. III.3), (b) de-noising with IMN (see Sec. III.3.2) and (c) when a RMT method is used to clean the correlation matrix (See Sec. III.3.1). The ARI index expresses configuration similarity on [0,1] Hubert and Arabie (1985). Blue dots represent ARI values, and the red line the curve liking them all. We looked at NYSE S&P500 447 Stocks Data. In all 3 cases we compare the f-SPC method (See Sec. II.3.2) to each of SPC candidates (See Sec. IV.5). This demonstrated that in all 3 cases the maximum likelihood candidates are close to solutions recovered within the Super-Paramagnetic Phase.

We push further the analysis by considering as a clustering quality evaluator. As was demonstrated in Giada and Marsili (2002) is a consistent objective function which if maximized can discriminate between clustering algorithms. Similarly to the “Silhouette Coefficient”, and the “Calinski-Harabaz Index” which are methods used to evaluate the clusters definition when the ground truth class is not known, plays a similar role.

(a) Market Mode
(b) IMN
(c) RMT
Figure 2: S& P500 (Sec. IV.5) N = 447 Stocks, D = 1249 trading days: We computed the Likelihood Giada and Marsili (2001) (13) of every SPC solutions for all temperatures (red curve, and blue dots), and Likelihood of f-SPC’s solution (blue horizontal line). in a) the Market Mode case, in b) the Normalized case, and in c) the RMT case. Every f-SPC solutions has higher likelihood than the SPC entire temperature range in every case. f-SPC solutions are composed of clusters with higher correlation than SPC candidates.

We test for this by evaluating all SPC candidates for their values, and we add a horizontal line on each plot indicating the respective f-SPC’s . In every case, SPC’s start low at low T, reaches a maximum at intermediate T and decreases slowly at T increases into Paramagnetic territory. This is yet another confirmation that higher values are located in a intermediate temperature regime which coincides with the SP-phase when the system is critical. SPC’s maxima are 105, 43, and 170 respectively in Fig. (1(a)) Fig. (1(b)) Fig. (1(c)), and we observe that solutions recovered using f-SPC all have higher likelihoods than SPC’s. Based on the result in this paper, and in Giada and Marsili (2002) one could argue that f-SPC produces better clustering candidates than SPC at least in this case.

ii.4.1 Free-Energy Validation

We now define the Helmholtz Free Energy for a thermodynamic process of an isolated system.

(15)

where is the internal energy of the system (see Eqn. (3), and (13) ), is the temperature of the heat bath or reservoir in contact the system, and is the entropy.

(16)

The free energy can also be computed using (16) with Z as the partition function like in Eqn. (6). 888Here we are implicitly using a course-grained approach to truncation over the monte-carlo replications; numerically computing the thermodyanimc averages over can often be more effectively computed using the replica method Castellani and Cavagna (2005) with a prudent choice of the number of replications .

We consider an isothermal process a system exchanging energy with a reservoir at constant by absorbing heat until its own temperature converges to that of the reservoir. For processes such as the one just described Eqn. (16) tells us that some of the energy needed for the system to be realized can be spontaneously transferred from the reservoir by heating “”. In this sense, for systems on which no work is done and thermal equilibrium is reached if the free energy reaches a minimum.

Using Mean Field Models in Blatt et al. (1996), and Giada and Marsili (2001) It was shown that the free energy reaches a local minimum within the Super-Paramagnetic or Clustered Ferromagnetic Phase, and a maximum at the Paramagnetic Phase transition. We argue that the temperature at which the previously mentioned minimum happens is synonymous to the heat-bath inside of which the system is in its “lowest level”.

(a) Evolution of the internal energy at as MCMC steps occur. No convergence per se to a unique value, but oscillations around a level. The blue line is the thermal average
(b) Histogram of the distribution of the internal energy at with the number of bins calculated using (17)
Figure 3: BRICS Data: SPC’s internal energy at . In a) the energy as it converges, and in b) its binned distribution.

Eqn. (16) requires computing whereas Eqn. (15) needs the average energy , and the entropy . Although we don’t show it these two methods agree. At any given temperature the system doesn’t converge to a specific energy level but displays a distribution (see Fig. (2(a))). The task at hand is now about picking a number of bins for our MCMC simulation which is consistent and not arbitrary. We borrow a “low bias” methods from Hacine-Gharbi et al. (2012) which follows:

(17)

where is the number of bins, and

with as the number of samples, here the total number of MCMC steps. Because we fix to 2000 the number of bins remains set for every temperatures, and the bin edges are set on the minimum and maximum possible energies depending on the problem. The Hamiltonian minimum energy is always 0 (for the ferromagnetic case), and in the case of the BRICS data the maximum was (see Fig. (2(b)) ). We then determine the distribution energy levels by picking the bin centers which we compute by taking the mean of the distribution inside each bin. Once obtained we can now compute , the Boltzmann distribution of energies, which we use to compute the thermal average Energy , the entropy using Eqn. (8) and the free energy .

(a) BRICS DATA, Cluster sizes as function of Temperature for SPC solutions. The vertical lines respectively show the temperatures of maximum ARI, and minimum Free Energy / maximum entropy.
(b) BRICS DATA, Free Energy as function of Temperature for SPC solutions (in red). Insets: (top right) Entropy, (center right) Susceptibility, (bottom right) ARI. The vertical lines respectively show the temperatures of maximum ARI, and minimum Free Energy / maximum entropy.

In Figures (3(a)), and (3(b)) we show the results of a SPC simulation on the BRICS data (See Sec. IV.6 ): We show the free energy as a function of temperature for the SPC simulation, and we also plot the Adjusted Rand Index of the f-SPC solutions against the SPC ones. We quickly describe the behavior of the free energy which decreases and reaches a minimum at , and a maximum at . A quick look at Fig. (3(a)) shows that before the giant cluster is breaking down inside the Super-Paramagnetic Phase until their sizes are comparable and seemingly stable. After The clusters sizes are unstable, start decreasing and it’s become impossible to significantly distinguish clusters which signals a transition into the Paramagnetic Phase. More importantly the ARI curve peaks at , close to the minimum free energy temperature within the Super-Paramagnetic Phase thus revealing that f-SPC’s algorithm and objective function minimizes the free energy (maximizes entropy) of the system as it maximizes .

ii.4.2 Sci-Kit learn: Varying Density Clusters

The problem consists of 3 clusters using Sci-kit learn samples generator 999B. Thirion, G. Varoquaux, A. Gramfort, V. Michel, O. Grisel, G. Louppe, J. Nothman, ’make_blobs’, 2017. [Online]. Available: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html. [Accessed: 12-Jun-2018] with , and . We observed two cases of this problem: a 3D case, and a 500D case.

Figure Fig. (3(c)) respectively show the susceptibility, and the clusters size as a functions of temperature. At , in the SP-phase, we observe 3 clusters in Fig. (3(c)) of size 167, 166, and 166 with an ARI of 1.

We follow this with ’s solution which scores 1460.47, an ARI of 0.20, while the expected likelihood was 599.91. Yet again our solution’s likelihood is higher than our expectation, and the real clusters are divided in smaller ones. In comparison K-Means and DBSCAN respectively achieve ARI of 1, and 0.8. In light of this, we decided to try again the same problem but with the dimensionality set to . As expected SPC’s results do not change. However in this instance f-SPC’s solution quickly converges to the best classification. A further investigation was done by simulating both the 3D, and 500D cases using SPC, and computing the likelihood for every configurations. The assumption being that the maximum likelihood should be found within the temperature range where the system is in the SP-phase.

(c) Size vs
(d) 3D Case: vs
(e) 500D Case: vs
Figure 4: in a) 3D blobs (Sec. II.4.2): Size vs Temperature using SPC (Sec. II.3.1). 1 giant cluster at , and 3 stable clusters from to . Insets: (in a) above right) Susceptibility , and (in a) below right) Average Magnetization at . peaks at into the SP-phase, remains stable until , and then slowly decreases to 0 into the Paramagnetic Phase. in b) the likelihood of SPC for 3D clusters, and 500D in c). When D is low, is stable until and transitions into the Paramagnetic Phase. Where we would expect a decrease in , we see an increase as goes up until which as can be seen from a) signaling the Paramagnetic Phase. The 500D case in c), on the other hand, peaks early within the SP-phase then transitions into the Paramagnetic Phase at . Once , a net decrease in happens, and as goes up the slope of remains negative.

Figures (3(d)) and (3(e)) respectively show the likelihood as functions of temperature. We notice that in the 500D case in Fig. (3(e)), the maximum likelihood is found at temperatures within the SP-phase, and the monotonously decreases at higher temperatures. The opposite happens in Fig. (3(d)) where the best classification doesn’t correspond to the maximum likelihood of which in this case is found at high temperatures which by looking at in Fig. (3(c)) means we are effectively in the Paramagnetic phase. We provide additional comments in the discussion section of this paper.

Iii Data Pre-Processing

iii.1 The Distance Function

The wide variety of problems our clustering methods can tackle necessitates a careful choice of pairwise distances if we are to properly identify shared behavior. We will proceed by using the Euclidean distances whenever we assume independence of the features, and the Pearson correlations otherwise especially for problems where the features consist of time-series.

This has implications for both algorithms such that we use the Euclidean distance or the Pearson correlation distance for SPC, and for f-SPC, we use the Pearson correlation matrix, and the similarity matrix, which is the Euclidean distance matrix on , and subtracted from 1.

We note that from Kullmann et al. (2000) that our Eqn. (4) can be modified to incorporate negative correlations, but the authors explain this only affects the results at the ground state (i.e. ).

iii.2 Scaling

Raw data sets often contain features on different order of magnitude of scales, outliers, and missing data which can have significant impact on Machine Learning algorithms. One way to deal with these issues is through normalization of the features. This was achieved using the Min-Max Scaling technique which puts all features on a 0 to 1 scale by performing the following operation:

(18)

Scaling has significant effects on the feature space: one example is seen in Fig. (8(a)), and Fig. (8(b)) which respectively show the unscaled and scaled plots of the 3 wines problem. The unscaled data set has two classes completely inseparable whereas scaling the data effectively dissociates all three classes with minimal overlap.

(a) Market Mode
(b) IMN
(c) RMT
Figure 5: Distribution of Pearson correlations of daily returns for 447 publicly traded companies on the S&P 500 stock exchange from 8/13/2012 to 8/11/2017 (Sec. IV.5). The “Market Mode” (Sec. III.3) : Noisy markets like in a) are highly correlated with most . The noise is cleaned by removing the “Market Mode” either by IMN (Sec. III.3.2) in b) or RMT methods (Sec. III.3.1) in c), and produces distributions centered around 0.

iii.3 Noise

The next and final pre-processing task consist in removing any noise present in our data. This is especially important for financial market time-series which exhibit extreme randomness and possibly chaotic behavior. Stock market correlation matrices are noisy, and positively skewed Fig. (4(a)) which translates into what is referred as the “Market Mode”. We consider an intermediary step which consist in “removing” the market mode, thus ensuring we are able to recover the underlying correlation structures, if any, present in the system.

iii.3.1 Random Matrix Theory (RMT)

We follow the predictions of RMT in Wilcox and Gebbie (2007) by assuming that stock market returns are IID random variables with zero mean and unit variance. These assumptions lead us to the conclusion that stock market correlations should all be zeros, and if the assumptions are indeed true, RMT predicts that the eigenvalues of the random matrices are Wishart distributed such that:

(19)

where , and

Figure 6: The Eigenvalue distribution of the Correlation Matrix of 1249 daily returns for 447 publicly traded companies in the S&P500 (Sec. IV.5).The two vertical red lines delimit the wishart range and : The eigenvalues located inside the Wishart range (see Sec. III.3.1 ) are noise whereas the ones outside aren’t. Inset: (red curve) We show that the computed Wishart PDF of a random matrix ( using Eqn. (19) ) fits well the eigenvalue distribution of our correlation matrix only inside the wishart range.

Shown in Fig. (6) is the distribution of eigenvalues of the correlation matrix of our stock market data Sec. IV.5. As can be observed the eigenvalues inside the Wishart range are responsible for the noise whereas those outside of the range are potentially correlated signal which shouldn’t be discarded.

We consider the eigenvalues represent the linear, and 1st order relations between time-series while it is unclear what those on the left side () of the Wishart distribution are. The linear signals are the signals shared by assets at the sectoral level.

The RMT “Market Mode” removal method is implemented in the five following steps bellow in Table F:

Compute the correlation matrix: Extract the eigenvalues and eigenvectors: and from Select the eigenvalues outside the Wishart Range. Reconstruct the data: Using the compressed signal: Let be our data, the matrix of eigenvectors found outside the Wishart Range, and the compressed data. The reconstructed data is then Re-compute the correlations from the reconstructed data.

Table F: Implementation of RMT Noise removal methods Wilcox and Gebbie (2007)

We tested different time-series lengths ranging from 89 to 1249, and we note that the size of the Wishart Range increases with dimensionality, and the lower left tail decreases on the other hand. The higher the dimensionality the easier it is to rule out eigenvalues as random signals, and the more important the biggest eigenvalues are to the noise-less data reconstruction 101010 In Giada and Marsili (2001), The authors achieve a similar result by using the model in Sec. (II.3.2) confirming that “noise cleaning” mainly affects the small eigenvalues of stock market correlation matrices.. An example of a cleaned correlation matrix resulting from this method Fig. (4(c)).

iii.3.2 Iterative Matrix Normalization (IMN)

Another “Market Mode” removal method Olshen and Rajaratnam (2010) IID random variables with zero mean and unit variance.

The Iterative Matrix Normalization “Market Mode” removal method is implemented in the three following steps bellow in G:

Compute the Covariance: Standardize i͡.e. standardise iteratively across rows, and then columns, for a set number of iterations (i.e. 500) or until a convergence criteria is met. Extract the adjusted correlation matrix: from

Table G: Implementation of Noise Removal via Iterative Matrix Normalization Giada and Marsili (2002)

We observe in Fig. (4(b)) that the distribution of correlations is now centered around 0.

Iv The Data test-cases

The following examples are used as a stress test for both methods. We obtained both synthetic, and real data which enabled us to discuss the features of each models.

As a comparison tool we use the Adjusted Rand Index (ARI) Hubert and Arabie (1985) which given two classifications measures their similarity. The ARI operates on a scale with positive values signifying increasing similarity. Where a true classification exists we will use the ARI to measure the quality of clustering of both methods but also industry standards such as “K-Means” Lloyd (1982), and “DBSCAN” Ester et al. (1996). Using SPC (Sec. A.1) we cluster a temperature range which we then compare against the (13) solution recovered. We then select the SPC temperature with the highest ARI for a closer comparison with the solution in the stock market case where a true classification is not available.

For visualization, where possible, we provide the plots or we make use of a nonlinear dimensionality reduction package called UMAP McInnes et al. (2018). The graph of the MST is also provided as it is a faithful representation of clusters on a 2D plane. The MST takes in the graph of our data, and find the unique shortest path linking every nodes.

iv.1 Sci-Kit learn: Concentric Circles

Our first problem is the identification of two concentric circles on a 2D plane using Sci-kit learn Pedregosa et al. (2011) samples generator 111111B. Thirion, G. Varoquaux, A. Gramfort, V. Michel, O. Grisel, G. Louppe, J. Nothman, ’make_circles’, 2017. [Online]. Available: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_circles.html. [Accessed: 12-Jun-2018] with , 0.5 for the noise parameter, and the 2 dimensions represent the X and Y coordinates of the observations.

(a) Scatter Plot
(b) Cluster size vs Temperature
Figure 7: in a) Two circle (Sec. IV.1) shaped 2D clusters each of size such that the blue points have higher density than the red ones. in b) Cluster size vs Temperature using SPC (Sec. II.3.1). As increases, the giant component successively breaks down: at we can observe 2 clusters which remain stable until . Insets: ( in a) above right ) Susceptibility at . peaks around , remains stable until inside the SP-phase, then dives down toward 0 for . (in a) below right) The Average Magnetization at . starts at 1 for then remains stable at inside the SP-phase from to . This stability only occurs when clusters have uniform or identical densities, and are linearly separable. Once , goes down to 0 inside the Paramagnetic Phase.

Judging by observing Fig. (6(b)) , and we see no overlap between the two clusters present in the data, and we expect to recover close to perfect clusters after applying the algorithms.

We obtained the susceptibility as a function of temperature in Fig. (6(a)). Within the SP-phase at we observe two clusters in figure Fig. (7(a)) both contain 250 nodes with an ARI of 1. Once the temperature gets relatively high, near , the system is deemed at “high energy”, and the clusters dissolve almost simultaneously. Unlike this particular example, this does not generally happen with real data where clusters have different densities.

(a) SPC
(b) f-SPC
Figure 8: in a) 2 Circles (Sec. IV.1): The MST of the (SPC Sec. II.3.1) Solution at shows two subtrees each representing the two clusters in the data, and in b) with the f-SPC (Sec. II.3.2) Solution, a high number of clusters are found: There is no misclassification however the 2 original clusters are pieced apart

On this data, f-SPC runs for 10000 generations maximizing to 639 while the real classification scores 317. The f-SPC configuration is presented in Fig. (7(b)) with an ARI of 0.085. In comparison K-Means and DBSCAN respectively achieve 0.16, and 1. K-Means has low clustering quality despite specifying the correct number of clusters. This is due to its inability to deal with non-spherical and non-Gaussian shaped clusters. Despite the high likelihood, Fig. (7(b)) shows a high number of clusters. The clusters are not mixed and ultimately we fail to recover the initial two clusters.

iv.2 Sci-Kit learn: Wine data

The second problem consists of a data set containing three clusters: , and . It is a reputed easy problem illustrating the importance of Normalizing/Standardizing features. There are 59, 71, and 48 samples respectively for class 1, 2 and 3, and the data is generated using Sci-kit learn loader 121212D. Cournapeau, F. Pedregosa, O. Grisel ’load_wine’, 2007-2010. [Online]. Available: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html. [Accessed: 12-Jun-2018]. the 13 features are quantities extracted from a chemical analysis of 3 types of italian wines: One Alcohol, Malic acid, Ash, Alcalinity of ash, Magnesium, Total phenols, Flavanoids, Nonflavanoid phenols, Proanthocyanins, Color intensity, Hue, OD280/OD315 of diluted wines, and Proline.

(a) Raw Data
(b) Normalized Data
(c) Clusters size vs Temperature
Figure 9: in a) 3 wines (Sec. IV.2), 2D plot of dimensionality reduction of the 13 features using UMAP McInnes et al. (2018). No scaling and/or normalizing done to the features produces 3 clusters: Wines of type 1 and 2 are found in the same clusters while Wines in cluster 0 remain isolated. in b) MinMax features re-scaling Sec. III.2. The original clusters spread out, and the Wine 1 and 2 clusters are now linearly separable. in c) Size vs Temperature . From to , The ferromagnetic Phase with one giant cluster, then from to , the SP-phase with 3 clusters which all start dissolving once . Insets: (in a) above right) Susceptibility , and (in a) below right) Average Magnetization at . peaks at into the SP-phase, decreases, and peaks one last time at to transition into the Paramagnetic Phase.

At first sight in Fig. (8(a)), 2 clusters are merged whereas once the features have been normalized Fig. (8(b)) the 3 clusters occupy separate regions of the space. Each cluster has one extremity close to its neighboring cluster which may induce some misclassification error, and because of this we expect to recover 3 imperfect clusters.

(a) SPC
(b) f-SPC
Figure 10: In a) 3 wines (Sec. IV.2) MST of SPC’s solution at : The 3 largest clusters respectively contain most of the observations from the original wine groups except for few unclassified or misclassified samples. in b) f-SPC’s solution: There are 7 clusters: 1 for Wine 0, 4 for Wine 1, and 2 for Wine 2.

Between and we observe three clusters, and the best classification recovered in Fig. (9(a)) provides the MST of the SPC’s solution with an ARI of 0.82.

Figure Fig. (9(b)) presents ’s solution with a likelihood of 83.94, an ARI of 0.51, and an expected likelihood of 66.97. As with the circle problem our solution’s is higher than our expectation, and it has 7 clusters instead of 3. 1 cluster contains observations of cluster 0, while clusters 1 and 2 are split in smaller ones without much misclassification. In comparison K-Means and DBSCAN respectively achieve ARI of 0.85, and 0.42.

iv.3 Sci-Kit learn: Fishers Iris data

Fisher’ Iris Data using Sci-kit learn loader 131313D. Cournapeau, F. Pedregosa, O. Grisel ’load_iris’, 2007-2010. [Online]. Available: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html. [Accessed: 12-Jun-2018] which includes individuals from 3 species: Iris Setosa, Virginica, and Versicolor. , , and there are 50 nodes per cluster.

As we can see in Fig. (10(b)), It’s one of the more challenging toy problems because two of the three clusters, Virginica, and Versicolor, are not linearly separable. We set , and observe two phases in Fig. (10(a)): for there are 2 clusters. The largest contains the Virginica, and Versicolor nodes while the smaller one includes most Seratosa nodes. The phase transition occurs at right before , and is followed by the separation of most Virginica nodes into their own cluster. This SPC solution Fig. (11(a)) has an ARI of 0.65.

(a) Scatter Plot
(b) Cluster sizes vs Temperature
Figure 11: In a) Iris 3 species (Cluster 0 for “Setosa”, 1 for “Versicolor”, and 2 for “Virginica”) (Sec. IV.3) : 2D plot of dimensionality reduction of 4 MinMax Scaled features. The Setosa, Versicolor, and Virginica clusters are respectively clusters 0, 1, and 2. Clusters 1 and 2 are not linearly separable whereas Cluster 0 is. in b) Cluster sizes vs Temperature using SPC Sec. II.3.1: 1 Cluster starting at , 2 cluster at , and 3 clusters at . Around , the system transitions into the Paramagnetic Phase, and clusters start dissolving. Insets: (in a) above right) Susceptibility , and (in a) below right) Average Magnetization at . peaks first at , and a second time at : at each transition one or more clusters detach from the giant component.

The solution in Fig. (11(b)) has an ARI of 0.627, a likelihood of 132, and our expected is 104. In comparison K-Means and DBSCAN respectively achieve ARI of 0.73, and 0.52. Similarly to the precedent examples, we recover five large clusters: Cluster 1 contains Seratosa individuals, while the Virginica, and Versicolor clusters are split into 4 smaller ones with minimal misclassification.

(a) SPC
(b) f-SPC
Figure 12: in a) Iris 3 species (Sec. IV.3) : MST of SPC’s solution at . 3 large clusters representing the original Iris species (Cluster 0 for “Setosa”, 1 for “Versicolor”, and 2 for “Virginica”), and 6 smaller ones. in b ) f-SPC’s solution after 10000 generations. 5 large clusters: 1 for Setosa, 2 for Versicolor, and 2 for Virginica.

iv.4 Sci-Kit learn: MNIST digits

The hand-written digits dataset, generated with Sci-kit learn loader 141414D. Cournapeau, F. Pedregosa, O. Grisel ’load_digits’, 2007-2010. [Online]. Available: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html. [Accessed: 12-Jun-2018], is mainly used to test classification algorithms in supervised learning but we are interested in how well both SPC, and f-SPC deal with the nonlinear nature of hand-writing. The data contains 10 classes of digits ranging from 0 to 9. The full set has close to 2000 nodes from which we select 500, and 50 of each class. The images are 8 by 8, and .

(a) MST
(b) Scatter Plot
(c) Cluster sizes vs Temperature
Figure 13: in a) MNIST hand-written digits (Sec. IV.4) : 2D plot of dimensionality reduction of the 64 features using UMAP McInnes et al. (2018). . 10 classes from 0 to 9. in b) the MST: Overall numbers of the same digit class are close. in c) Cluster sizes vs Temperature using SPC (Sec. II.3.1) at . Insets: (in a) above right)Susceptibility , and (in a) below right) Average Magnetization at .

The MST in Fig. (12(a)), and the UMAP McInnes et al. (2018) plot in Fig. (12(b)) show us all digit classes are linearly separable, the data contains outliers especially “1”’s and “9”’s which may be the result of different writing styles.

We see that the big cluster breaks down in multiple steps Fig. (12(c)) due to the differing densities of clusters present in the data. Fig. (12(c)) shows that at right after the final peak the configuration’s clusters in Fig. (13(a)) has an ARI , and show no significant misclassification.

(a) SPC
(b) f-SPC
Figure 14: in a) MNIST hand-written digits (Sec. IV.4): MST of SPC’s solution at . 10 classes recovered: Cluster 8 is split in two, and some observations from cluster 9 & 1 are found in one mixed cluster. There are non-linearities in how digits are drawn which may explain the closeness of 9s and 1s. in b) f-SPC’s solution: 10 classes recovered: 1 cluster for 0s and 6s, 3 for 1s, 2 for 2s, 3s, 4s, 5s, 7s, 8s and 9s.

’s solution in Fig. (13(b)), after 25k generations, has a likelihood of 149.47, an ARI of 0.747, and an expected of 135. Once again we encounter similar results as with the previous cases with the higher likelihood, and the number of clusters. The solution has close to 20 clusters, and while there is one main cluster per digit which is the case for digits 0 and 6, and mostly for 3, 7, 8, and 9, the digits 1, 2, 4, and 5 are all split in two clusters. We explain this by the inconsistent nature of hand-writing which produces different writing styles. In comparison K-Means and DBSCAN respectively achieve ARI of 0.56, and 0. There are many reasons why DBSCAN fails this problem: DBSCAN classifies some observations as noise into one cluster, it also has issues tackling problems with clusters of different densities.

iv.5 Kaggle: NYSE Data

We obtained publicly available NYSE stock market data on Kaggle151515C. Nugent ’S&P 500 stock data - Historical stock data for all current S&P 500 companies’, 2017-2018. [Online]. Available: https://www.kaggle.com/camnugent/sandp500. [Accessed: 01-Dec-2017]. The original data contains daily open, high, low, closing prices, and volume from 8/13/2012 to 8/11/2017. Because not all stocks traded for the whole duration we only select the stocks which did for the last 1250 days ( years) which left us with 447 stocks, and furthermore we, in this case, were interested in a time horizon of 5 years in trading days from 8/23/2012 to 8/11/2017. We consider the daily trading closing prices which are use to compute the daily returns such that :

(20)

The final data set has a time-series Length . Using Eqn. (20) we consider three cases for the correlation matrix: i.) the full correlations, ii.) denoising using IMN (See Sec. III.3.2), and iii.) cleaning the matrix using a RMT method (See Sec. III.3.1).

(a) Market Mode
(b) IMN
(c) RMT
Figure 15: in a) S&P500: Market Mode Correlation-based MST of 447 stocks over 1249 trading days (Sec. IV.5). in b) The Market Mode was removed using IMN (Sec. III.3.2), and in c) using RMT (Sec. III.3.1). Colors refer to GICS sectors (See footnote 21 )

Financial markets are perpetually evolving living ecosystems, and this is illustrated in the lack of available true sectoral classification of publicly traded companies. In the process of clustering stock market data 161616A very nice review of clustering methods applied to financial datasets is available at Marti et al. (2017) , we wanted to compare the results of our algorithms with industry standard classification but we faced the following difficulties:

We consider the following industry classifications 171717 There is no consensus industry classification used in the financial services industry.: The New York Stock Exchange (NYSE) uses the Global Industry Classification Standard (GICS) 181818 The GICS counts 11 sectors, 24 industry groups, 68 industries, 157 sub-industries, and is updated annually. For more details on their hierarchical industry classification system https://www.msci.com/documents/10199/4547797/GICS+Structure+effective+Sep+1%2C+2016.xls (which we use here). The National Association of Securities Dealers Automated Quotations (NASDAQ) and the London Stock Exchange (LSE) both use the Industry Classification Benchmark (ICB)191919 The ICB counts 10 industries, 19 super-sectors, 41 sectors, 114 sub-sectors, and is For details on their hierarchical industry classification system https://www.ftse.com/products/downloads/ICB_Rules.pdf.Industry classifications have sectoral, industrial and sub-industrial levels. Although commonalities exists one is left to determine the equivalences when information aggregation is required across different markets.

GICS, and ICB are static classifications which are updated at irregular intervals (i.e. GICS every year, ICB from weekly to yearly updates). The focus of these companies is to provide long term structural trends of financial markets. As such they lose their usefulness if one wants to consider the impact of rare events such as financial crashes which significantly alter the behavior of businesses. They also do not consider how the diversification of investments and activities affect their respective classifications. The case of Amazon can be argued to illustrate the idea behind the Adaptive Market Hypothesis (AMH) Lo (2004): Amazon’s GICS’ sector is Consumer Discretionary. GICS uses this sector to classify companies whose activity they deem “most sensitive to economic cycles” 202020 A description of GICS sector is available at https://www.msci.com/documents/10199/4547797/GICS+Sector+definitions-Sep+2016.pdf . It is unclear what is meant by “sensitive” in this instance as there are many possible interpretations, and this sector is very heterogeneous. Perhaps it highlights the adaptive nature of Amazon’s business interests which started first as an order-to-delivery e-commerce bookstore but based on Fig. (14(a)) is now closest to the Information Technology sector.

The life cycle of publicly traded companies can be short. Firms go public and private at relatively high frequency when compared to biological evolution on a human timescale as motivated by Farmer in Farmer and Geanakoplos (2009). The inclusion or exclusion of individuals in an ecosystem can and should have an impact on its structure based on how important the individuals are to the groups. When we looked for GICS data for our time-series, a number of companies had gone private since Aug 2017, and GICS classification had been updated without reflecting these new changes for these companies. Gathering data on these companies which translated into the newer nomenclatures was thus rendered more difficult. Yet again illustrating the need for expert-free unsupervised methods.

Finally, while as previously stated, GICS and ICB intend on providing data which capture long term trends. Financial markets are populated with participants (i.e. pension funds, high frequency trader, asset managers etc…) each holding a diverse set of objectives, who do not necessarily operate on the same time scales or have the same investment horizons. If one goal is to provide comprehensive analyses of the multiple existing dynamics in markets, tools which capture these trends, and methods which subsequently find relations between them should be prioritized.

This motivates us to argue that the highly dynamic nature of financial markets renders the use of static classifications problematic to a certain extent.

We use GICS’s 11 sectors as the “true” economic sectors of the US financial market. These include Consumer Discretionary (74 stocks), Consumer Staples (31 stocks), Energy (28 stocks), Financials (62 stocks), Health Care (51 stocks), Information Technology (IT) (59 stocks), Industrials (58 stocks), Materials (26 stocks), Real Estate (26 stocks), Telecoms (4 stocks), and Utilities (28 stocks)212121 Colors used for the 11 GICS economic sectors: Consumer Discretionary (royal blue), Consumer Staples (sky blue), Energy (orange) , Financials (beige), Health Care (dark green), Information Technology (light green), Industrials (red), Materials (pink), Real Estate (purple), Telecom (magenta), Utilities (brown).. Although as previously mentioned we do not believe this classification to be valid, here we make use of it as benchmark.

Looking at MSTs in figures (14(a)), (14(b)), and (14(c)), and aided by the GICS classification as legend, we notice nodes belonging to the same economic sectors are mostly located in proximity of each other as one would expect in a static world or over time-scales where the static model is a reasonable approximation.

We report SPC results in figures (15(a)), (15(c)), and (15(e)) respectively at ,, and for the full (K=5), normalized, and RMT cases.

(a) Market Mode - SPC
(b) Market Mode - f-SPC
(c) IMN - SPC
(d) IMN - f-SPC
(e) RMT - SPC
(f) RMT - f-SPC
Figure 16: S&P500: stocks traded over 1249 days (Sec. IV.5). in a), c), and e) SPC’s solution at , , and respectively for the Full “Market Mode” sample Correlation Matrix, the iteratively normalized (Sec. III.3.2) , and Noise cleaned RMT (Sec. III.3.1) cases. And in b), d), and f) the f-SPC’s solutions after 25k generations. Colors refer to GICS sectors (See footnote 21 ).

We briefly mention again one of SPC’s features which consists in linking a node to its closest neighbor based on the spin-spin correlations. Using the condition we construct a graph but in the case where a node has no correlations meeting our condition, it is linked to its neighbor of highest spin-spin correlation. This feature forces SPC to produce graphs without isolated nodes. At the same time, and because of this fact, we consider small size clusters are equivalent to noisy, insufficiently correlated, or unclassified observations.

SPC solutions recover GICS information as seen by their respective ARI: 0.317, 0.479, and 0.33. The solution with highest number of noisy or unclustered stocks is Fig. (15(a)), The financial sector is merged with many other stocks from other sectors, whereas most industries are found in one or two clusters. The complexity goes down when we move to Fig. (15(c)) where every sector have mostly separated into their own unique cluster, and Fig. (15(e)) which gives a similar picture although with more smaller unclassified clusters present.

results were simulated for 25k generations, and we obtained values of 113.92, 54.13, and 367.93 respectively for the Full Fig. (15(b)), the Normalized Fig. (15(d)), and the RMT Correlations Fig. (15(f)). Their economic GICS information recovered via the ARI were, following the same order, 0.25, 0.35 and 0.41. While Fig. (15(b)) has the smallest number of clusters 15, we find clusters which mostly contain firms from single industries such as the financial, utilities, Real Estate, and Energy sectors. The other clusters more or less mixed including a very large one which we could refer to as the “market”. Recall that Fig. (4(a)) shows the correlations of the “Market Mode” are mostly positive, and one can easily infer this kind of result. Fig. (15(d)) and Fig. (15(f)) provide a cleaner pictures of the market: in Fig. (15(d)) there is no large “market” cluster and every industry is mostly represented in their own respective clusters. Firms, previously found in the “market” are now for most of them located in clusters representative of their respective industries. Similar situation in Fig. (15(f)) except the industry sectors have a better definition while a large mixed cluster remains present similarly to Fig. (15(b)).

Th neighborhood search SPC performs constrains the scope of the optimization. In SPC’s case, SA can only minimize the Hamiltonian over the neighborhoods necessitates an additional decision in picking the neighborhood size which acts as a hyper-parameter. The likelihood is optimized over the whole range of observations effectively removing such need, and making the optimization fully unsupervised. This also means that there exists a possibility that nodes which wouldn’t cluster together, because of neighborhood limitation, would in this particular case. It is unclear which is the best way to proceed.

Our second goal is to explore clustering differences which arise in our 3 cases. In Fig. (15(a)), and Fig. (15(b)) The biggest clusters have significant overlap with economic sectors except for a few large ones such as the financials cluster which also houses stocks from other sectors. The picture gets cleaner once we look at Fig. (15(c)), and Fig. (15(d)) where we now have less mixing in most clusters, and finally in Fig. (15(e)), and Fig. (15(f)) some of the clusters such as the Real Estate, Utilities, Health Care, and Consumer Staples found in the Normalized case are split. We noticed a similar result in Sec. IV.3 where the solution had a higher number of clusters, but they were essentially subgroup within the ones found by SPC.

iv.6 BRICS data

We obtained publicly available BRICS (Brazil, Russia, India, China and South Africa) stock market data 222222C. Nugent ’S&P 500 stock data - Historical stock data for all current S&P 500 companies’, 2017-2018. [Online]. Available: https://www.kaggle.com/camnugent/sandp500. [Accessed: 01-Dec-2017]. The original data contains daily closing prices of 226 stocks:Brazil (60), China (50), India (30), Russia (43), and South Africa (43). We will refer to BRICS as a way of listing the mentioned countries in the previously given specific order. The window spans 2005 to 2015 from which we retained the last 5 years of daily trading. The data set suffers from a missing data problem which we would make it impossible to compute correlation matrices. We deal with the problem by using the time-series missing-data which consists on computing correlations only on overlapping sections of time-series. The resulting correlation matrix is then made positive definite, and cleaned using IMN (See Sec III.3.2 ).

(a) SPC BRICS
(b) FSPC BRICS
Figure 17: 226 BRICS stocks. Data cleaned using IMN (See Sec. (III.3.2)). in a) SPC solution at , and in b) f-SPC’s solution with

Using MR the Likelihood’s local maxima reached was 7.56, with 24 clusters: very little mixing, and the 5 countries are concentrated inside 1 to 3 clusters per country (see Fig. (16(b)) ). This confirms our expectations which were that same country stocks should mostly belong to the same clusters, and the existence of multiple clusters per country as evidence of meso-scale industry/sectoral level classification.

We continue with the SPC result given for in Figures (16(a)), and (3(a)) with 9 clusters. The ARI between the SPC and f-SPC solutions is 0.5. Globally speaking the same clusters are recovered except for their size being slightly smaller for the MR solution, and clusters “purity” is higher in the MR candidate potentially (stocks which belong to different countries do not mix). In both candidates Brazil financial market is divided in 2 clusters which upon a more detailed cluster analysis could reveal industry (or sector) economic subdivisions.

V Discussion

In this paper, we were able to successfully implement SPC, and f-SPC, and we tested those methods on synthetic, and real data. If there exists significantly different structures within the data, SPC will exhibit multiple transitions within the SP-phase. As the temperature is varied, the couple Susceptibility and Average Magnetization signal the occurrence of phase transitions. The spin-spin correlation is indirectly linked to the interaction strength and the densities found in the data. The method has the advantage of being unsupervised for the most part, it does not necessitate a-priori knowledge of the number of clusters, and makes no assumption about the distributions of the data. While SPC performs a neighborhood search, which requires picking a value for , it does not affect the simulation significantly for large data sets; as was previously seen in Blatt et al. (1997). The parameter is set to 0.5 and helps decide clusters membership. We clustered at every temperature within a pre-determined range such that we do not need to identify “clustering temperatures ” like in Blatt et al. (1996) but we do so at the expense of additional computational cost. We note that in the literature there are modifications of the Potts model clustering which automate parameter selections for the clustering temperature , the local length scale , and the cluster membership threshold through validation based calibration. Murua and Wicker (2014).

Once we have a hierarchy of configurations, such as in Fig. (12(c)), one needs to select appropriate clusters representative of the different regimens of the SP-phase. This is an easy task as can be seen in Fig. (8(c)) when the number of cluster is low, and the clusters have similar densities which establishes a stable phase for a relatively wide temperature range. On the other hand if the number of clusters is high, and the data is composed of clusters of different densities, the susceptibility, which tracks the variance of biggest cluster has its limitations Kullmann et al. (2000). As the size of the data-set increases the susceptibility is useful as a tool to locate the final transition and lowest clustering level.

f-SPC only requires the correlation matrix and is completely unsupervised. The randomly generated population is diversified at every iteration by applying as many as 7 different mutations. It’s a fast and deterministic algorithm while SPC is MCMC based and requires statistical averages. The computation time is affected by the order of the observations in the data Blondel et al. (2008). We noticed that ordering our data based on the order of one of the observations’ closest “neighbors” produced better results which should motivate further exploration of potential heuristics dealing with this issue.

measures the quality of cluster configurations: its value is computed from the clusters sizes and the intra-cluster correlations . The optimization is global which, as opposed to SPC, avoids the need to determine a neighborhood size . There exist problems where choosing a sufficiently big has an non-trivial impact on SPC’s solutions. One such example would be the existence of a relatively low density and sparse cluster in a data set mostly composed of high density clusters. Low values of would fail to recover the low density clusters which would remain unclassified whereas this isn’t an issue for f-SPC which would perform much better.

f-SPC results are consistent for high dimensionality data-sets. if we consider the metric used to evaluate the noise in correlation matrices as the ratio of the dimension over the number of observations in the limit of . We recall that in Wilcox and Gebbie (2007), encodes the noise level of the eigenvalues of the sample covariance matrix. values for our problems are 250 for the two circles, 13.69 for the wines, 166.66 for the 3D blobs, 1 for the 500D blobs, 37.5 for Fisher’s Iris, 7.81 for the MNIST digits, and 0.35 for the NYSE Kaggle data. The results consistent with SPC were the 500D blobs, MNIST, and the NYSE stock data which all confirm that a low is necessary to compute appropriate correlation matrices. We want to be as small as possible, and if possible close to 0. This is not always the case, and we have tested ways to de-noise the correlation matrix (Sec. III.3.2, and III.3.1) in the case of financial time-series but it is unclear at this time what would the solutions be in other cases. In Giada and Marsili (2001) Marsili and Giada derive , and along the way they assume that which in turn means one has to consider the finite size effects of the method. Fig. (3(d)) and Fig. (3(d)) tell us that if we were to visualize the ’s objective surface, depending on dimensionality of the problem we could face a “rough” space. One could consider as a sort of modularity function just like in the Network Science literature. One major Network Science problem is the efficient detection of communities inside networks. Similar to our work, cluster configurations are the input of the modularity function which, through diverse heuristics, is maximized. However it is well known Good et al. (2010) to Network Scientists that modularity objective surfaces are degenerate: many significantly different clustering results have similar modularity, or in our case, higher likelihood than the true clustering.

We compare this to the modus operandi of SPC: The generative model of the SPC is the Gibbs-Boltzmann distribution which not only validates clusters locally using Eqn. (9). It is a bottom up approach as opposed to global optimization methods which are top down. One assumes that there exists multiple realizations (micro-states) of the generative model, the so called “equivalence classes” which are valid representation of the data. In order to link micro and macro-state one could pick any one micro-state translating into the desired macro-state: Maximum Likelihood methods essentially achieve this feature by searching the space of solutions for any candidates meeting the global objective. We argue that in complex systems, the existence of equivalence classes as illustrated by the degeneracy of clustering objective surfaces leads to the Maximum Entropy principle Jaynes (2003) as an alternative optimization device. The generative model generates equivalence classes (among which are included maximum likelihood candidates) each with differing probabilities, and one then needs to probabilistically combine them to achieve some sort of representative weighted average.

We suspect a way to deal with cases where is small compared to , and indirectly , would be a modification of by adding an additional term acting as a regularizer which could account for the number of clusters. Our rationale follows that as an objective function is degenerate with multiple spin configurations whose likelihood are equal or very close. This degeneracy comes from, if we assume the minimum number size of clusters to be 2 (no singletons), the number of possible configurations which for a case would be on the other of .

Finally one is left to decide which de-noising method is deemed optimal and as a consequence which clustering one prefers. The assumptions in both methods have their validity and should be carefully considered. Whereas IMN (Sec. III.3.2) consider the covariance matrix as IID normal random variables, RMT (Sec. III.3.1) predicts a spectrum of random matrices eigenvalues exists which is pure noisy signal. The noise is removed by reconstructing the data without the noisy eigenvalues whose number increases with dimensionality. We suspect a proper way of deciding which method is optimal is the implementation of such methods as bases of trading strategies.

Vi Conclusion

In this paper we have presented two unsupervised data clustering algorithms inspired by the Potts Model Wu (1982). Using SA (SPC Sec. II.3.1) optimizes the Hamiltonian of a thermodynamics systems, and the ground state energy solution recovered provides the best clustering structure present in our data. We show that the parameter-free Marsili-Giada (Sec. II.3.2) maximum likelihood method implemented with a modified version of Hendricks et al.Hendricks et al. (2016a, b) Parallelized Genetic Algorithm recover solutions similar to those found in the super-paramagentic phase Fig. (1).

This was done by comparing the SPC solutions to the f-SPC one using the ARI Lloyd (1982). In addition to this we compare the Likelihood of SPC solutions to f-SPC ones to show that f-SPC have higher likelihood; this prompts an additional discussion on implication for statistical inference in complex systems. The methods were tested both on several standard toy test cases as well as real stock market time-series data; this illustrates their universality provided an appropriate similarity metric is selected, such as the Pearson correlation coefficients or the correlation distance. We are able to show that the results are similar to the 11 standard GICS economic sectors, however the differences in the number of clusters, and their composition should be cause for concerns with respect to the use of GICS classification. Hence, we question the effectiveness of naively using GICS classifications for risk-management and for investment decision making on both the medium and short-term.

Building on the work presented in this paper, and original aim, as in Hendricks et al. (2016a), we would have liked to perform cluster analysis and compared configuration of stock market intra-day time-series. The last thirty years have seen a magnificent increase in technological power which enabled simultaneous trading on multiple time scales. The so called High Frequency Trading (HFT) paradigm increased tenfold the amount of stock market data, and has significantly impacted the market participants behaviors at shorter time scales. It is therefore natural to consider market participants all having different operating time horizons and for this to somehow be reflected in the non-trivial composition of clusters. We assume traders use all information available to make decisions however the rate of economic information released about publicly traded firms can range from once a month to once a year. This rate is significantly lower than that of HFT which leads us to think trading is not solely based on economic information, and because of the different time-scales one can suspect different objectives may be at play Wilcox and Gebbie (2014). The key motivation here was to build and test high-speed classification methods towards the goal of unsupervised classification of traders from live market data. It is notoriously difficult to obtain trading data linked to individual market participants accounts. This kind of data would be extremely useful to directly, not only study traders’ behavior, but begin to understand the kind of ecosystem a financial market is. Unfortunately one is only left with the possibility proxy studies through the dynamics of the traded securities.

One possible approach is the unsupervised clustering of simulated technical trading agents implementing investment strategies in an artificial financial market. This has been a key motivation in the present work. One logical next step is what we call Dynamical Cluster Analysis (DCA): Events such as financial crises like the one which preceded the 2008 Great Recession can be investigated at the intra-day scale. Here again we conjecture shocks to the system irremediably affect strategies, and clustering structures are less persistent with time as in Musmeci et al. (2015). Ultimately some sort of quantification of clustering on different temporal scales could be useful towards probing potential hierarchical causal affects given that different effective theories may dominate at different scales Wilcox and Gebbie (2014).

We have so far worked with changes in price returns as our single factor model. The derivation and formulation of the Giada-Marsili allows for multivariate clustering: We can use F by N by N Correlation matrices where is the number of factors we want to include. Another challenge however would be that at this time we are not aware of an implementation of multi-factor correlation based SPC. In closing, we promote the idea that a promising future research direction would be quantized spin-models and ultimately building unsupervised learning algorithms that more effectively accommodate state-interference and phase information within some likelihood method, as well as additional optimization refinements to the algorithm, perhaps using community detection algorithms to enhance performance Blondel et al. (2008).

Vii Acknowledgements

The authors thank Etienne Pienaar, Micheal Gant, Jeff Murugan, Nic Murphy and Diane Wilcox for discussions and comments. LY would like to thank Fangqiang Zhu for first introducing him to the elegance of Ising models and their simulation. TG acknowledges the University of Cape Town for FRC funding (fund 459282).

Appendix A The Algorithms

The algorithms implemented in this paper have been coded in python and are available on a github repository at Yelibi (2017).

a.1 SPC Algorithms

We provide a pseudo-code for the SPC Blatt et al. (1996) algorithm introduced in Sec. II.3.1. Given a distance matrix, and a neighborhood of size K, the algorithm uses the Swendsen-Wang Swendsen and Wang (1987) MCMC method to optimize the thermodynamic system.

1:INPUT: Strength Matrix ; Neighbor list nodenext; Temperature ;
2:for  to  do
3:    Create edge configuration matrix “link
4:    Create Swendsen-Wang clusters
5:    Compute, and store thermodynamic quantities i.e.
6:end for
7:OUTPUT: ;;;Spin-spin correlation matrix
Algorithm 1 SPC (Sec. II.3, and Blatt et al. (1996)), and function “runz” in ‘‘Super-Paramagnetic-clustering.py’’ in Yelibi (2017)

Hoshen-Kopelman Al-Futaisi and Patzek (2003) is the mechanism behind clusters discovery in SPC. It allows for the graph to be traveled via its nodes and neighborhoods while clustering the system locally.

1:INPUT: Matrix of links
2:set label counter to 0, Initialize nodel Nx1 array
3:Create nodelp, an empty array
4:for  to  do
5:   if node i isn’t linked at all then
6:      Set label counter to i’s label
7:       Store i’s label to nodelp
8:      Increase label counter by 1
9:   else
10:      Find i’s linked neighbors, and store their nodelp labels
11:      if None are labeled then
12:         Set label counter to i’s label
13:          Store i’s label to nodelp
14:         Increase label counter by 1
15:      else
16:         store the labels of the linked neighbors
17:         Store root of the labels of the linked neighbors
18:         Set min the smallest root label
19:         Set the nodel of i to min
20:         In nodelp change linked neighbors root labels to min
21:      end if
22:   end if{Make nodelp sequential}
23:   for  to  do
24:      
25:      while  The root of n is less than n  do
26:         Set n to the root of n
27:      end while
28:       Set the root of y to n
29:   end for{ Relabel the labels with their roots }
30:   for  to  do
31:       Find labels in nodel == i
32:       Update them with their root in nodelp
33:   end for
34:end for
35:OUTPUT: Spin configuration
Algorithm 2 Extended Hoshen-Kopelman (Table B, and Al-Futaisi and Patzek (2003)), and Swendsen and Wang (1987)), and function “eHK” in “Super-Paramagnetic-clustering.py” in Yelibi (2017)

Swendsen-Wang flips clusters at each iteration, allowing the quick convergence toward a local maxima.

1:INPUT: Strength Matrix
2:Create new spin configuration using Hoshen-Kopelman
3:Flip all clusters
4:OUTPUT: Flipped Spin Configuration
Algorithm 3 Swendsen-Wang (Table C, and Swendsen and Wang (1987)), and function “eHK” in “Super-Paramagnetic-clustering.py” in Yelibi (2017)

a.2 f-SPC Algorithms

We discussed f-SPC in Sec. II.3.2, and here we provide a pseudo-code for the implementation of the parallelized genetic algorithm which generates clustering candidates, evaluates their likelihood Giada and Marsili (2001) using Eqn. (13), selects the best candidates and discards the others.

1:INPUT: N individuals; G generations; likelihood function; Mutation function; Recombination function; Correlation Matrix
2:Create Initial Population of N individuals
3:for  (In parallel ) All individuals do
4:   Evaluate Fitness
5:end for
6:for Number of Generations G do
7:   Create Offspring (from the entire population)
8:   Mutate offspring
9:   for  (In parallel ) All offspring do
10:       Evaluate Fitness
11:   end for
12:    Recombine Parents and Offspring
13:    Select Next Population (N individuals)
14:end for
15:OUTPUT: Generation with highest likelihoods
Algorithm 4 f-SPC PGA (Sec. II.3.2, and Hendricks et al. (2016a)), and ‘‘fast-SP-clustering.py’’ in Yelibi (2017)

References

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
378426
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description