Corrgan: Sampling Realistic Financial Correlation Matrices Using Generative Adversarial Networks


We propose a novel approach for sampling realistic financial correlation matrices. This approach is based on generative adversarial networks. Experiments demonstrate that generative adversarial networks are able to recover most of the known stylized facts about empirical correlation matrices estimated on asset returns. This is the first time such results are documented in the literature. Practical financial applications range from trading strategies enhancement to risk and portfolio stress testing. Such generative models can also help ground empirical finance deeper into science by allowing for falsifiability of statements and more objective comparison of empirical methods.


Gautier Marti\addressShell Street Labs {keywords} generative adversarial networks, correlation matrices, stock returns, random matrices, hierarchical clustering

1 Introduction

In [12], we can read:

“To the best of our knowledge, there is no algorithm available for the generation of reasonably random [financial] correlation matrices with the Perron-Frobenius property.

Concerning the generation of [financial] correlation matrices whose [Minimum Spanning Trees (MSTs)] exhibit the scale-free property, to the best of our knowledge there is no algorithm available, and due to the generating mechanism of the MST we expect the task of finding such correlation matrices to be highly complex.”

In this paper, we propose a novel approach to solve the problem of generating realistic financial correlation matrices. Using Generative Adversarial Networks (GANs) to sample realistic financial correlation matrices has never been documented, to the best of our knowledge, despite the importance of the problem. Simulating financial data, and correlation matrices in particular, have many applications: Testing robustness of trading strategies, stress testing portfolios. Another major application could be the objective comparison of empirical methods (combination of signals and strategies, statistical filtering methods [21]) which would otherwise be claimed superior based on a given arbitrary chosen sample. This endemic problem in empirical finance prevents the field to become a science in the Popperian terminology: one cannot easily contradict such results [16].

Generating multivariate financial time series is more general and difficult than to focus on their correlations: Besides the dependence structure (relatively static in comparison), one has to correctly capture the univariate time series features (e.g. autocorrelation) and the distributional properties of the margins altogether. In this work, we only focus on generating empirical correlation matrices, which may already be an approximation of the dependence structure between several financial assets (cf. copula theory [19]).

Despite the importance of the problem, we can explain the lack of research (and results) as GANs, a recent class of generative modelling approaches (seminal paper in 2014 [9]) which stemmed from the computer science community, are not yet part of the econometrician, risk and quant analysts toolbox. This work can also be relevant for the signal processing community as robust estimation of large covariance matrices , since a correlation matrix , is a common problem [2, 1].


The contributions of this article are:

  • [noitemsep]

  • sampling financial correlation matrices using GANs, and documenting results for the first time,

  • showing that the samples generated look realistic, and verify the stylized facts known in the econophysics literature,

  • using S&P 500 stock returns which are widely available for reproducibility of the experiments.

2 Related work

To the best of our knowlege, there are no previous attempt at generating realistic financial correlation matrices using GANs. No known model is able to capture, even approximately, all the known characteristics of financial correlation matrices [12]. We briefly review in the following subsection typical applications of GANs, and we highlight the lack of published results concerning financial data. Then, we describe the stylized facts of financial correlation matrices which will be useful to evaluate the samples generated by the different GAN-based approaches tested.

2.1 Generative Adversarial Networks

Generative Adversarial Networks (GANs) were introduced in [9]. Two networks (the generative model) and (a discriminative model) are trained simultaneously: is trained to maximize the probability of making a mistake; is trained to estimate the probability that a sample comes from the training data rather than . These models are notoriously complex to train and evaluate. Their greatest success so far has been to generate realistic pictures. There are few successful applications published outside natural images, e.g. [22, 3]; results produced by GANs are not competitive in natural language generation, for example.

Another recent field of applications for GANs is graph generation. NetGAN [4] is a novel approach to generate graphs based on an input graph of nodes, defined by a binary adjacency matrix . Random walks of length are sampled from . This collection of random walks consists in the training set. The generator , a LSTM, learns to generate similar sequences which are then converted back to binary adjacency matrices describing the graphs. A -by- correlation matrix can be viewed as complete graph weighted by edges in . Adopting an approach similar to NetGAN would require to extend their work in two aspects: (i) many (mid-sized) graphs as input instead of a single (large) one; (complete) weighted graph (edges in ) instead of (sparse) unweighted one. This is not the approach adopted in this paper.

Related to finance, authors are aware of [10] which aims at simulating SABR (a stochastic volatility model) parameters, and [14] generating univariate time series of asset returns using a conditional GAN. The paper totally discards the whole dependence structure, e.g. correlations, existing between the time series of many assets. It may not matter when focusing on time series strategies (actively trading a single asset through time), but is useless when considering cross-sectional strategies, or large portfolio and risk management. In other words, it doesn’t model the multivariate joint distribution of the co-movements of many assets.

Unlike for natural images which lend themselves well to a visual inspection, samples obtained from GANs are in general hard to evaluate. Researchers are for now limited to check a few statistics, for example degree distribution of the graph nodes in [4] or ACF/PACF of the time series in [14]. However, one needs to know which statistics are important to verify. Fortunately, financial correlation matrices have been extensively researched over the past two decades.

2.2 Financial correlation matrices

Financial correlation matrices have been extensively studied in econophysics, an empirical field applying statistical physics methods to economy and finance. Around 1999, Bouchaud et al. [15] showed how Random Matrix Theory (RMT) can be used to better understand financial correlations, and they started a two-decade research long program developing and refining methods using tools from RMT to clean large empirical correlation matrices [5]. About the same time, Mantegna, another econophysicist, discovered the hierarchical structure of financial correlations [17] whose seminal and influential work sparked a rich empirical research in financial networks and clustering. An extensive review of this literature can be found in [18].

This body of knowledge about financial correlation matrices can be summarized in a few stylized facts:

  • [noitemsep]

  • Distribution of pairwise correlations is significantly shifted to the positive,

  • Eigenvalues follow the Marchenko-Pastur distribution [15], but for

    • [noitemsep]

    • a very large first eigenvalue (the market),

    • a couple of other large eigenvalues (industries),

  • Perron-Frobenius property (first eigenvector has positive entries),

  • Hierarchical structure of correlations [17],

  • Scale-free property of the corresponding Minimum Spanning Tree (MST) [6].

It is possible that some stylized facts are still to be discovered. Exploring the latent space of GANs [7] could help finding unknown properties of financial correlations; However, generative adversarial networks, alongside deep learning in general, are not yet part of the toolkit in empirical finance. This paper is meant to show that they are a relevant tool, and that the problem of sampling financial correlation matrices using GANs deserves further exploration.

3 The space of correlation matrices

Let be a correlation matrix, that is , , , , .

Let the elliptope of dimension be the set corresponding to the coefficients of -correlation matrix upper triangular. More formally,

A -correlation matrix can be viewed as a point in .

3.1 case

To build intuition, let’s first consider the case. We can visually verify that a simple GAN is able to recover the whole space of empirical correlations.

In Figure 1, 10,000 blue points are sampled uniformly (in the Lebesgue measure sense) from using the onion method [8], where is

In orange, 10,000 3-by-3 matrices obtained by selecting randomly (without replacement) 3 stocks among the 500 possible in the S&P 500, and then estimating the correlations between their daily returns on one year (252 business days). We can notice that the orange set (empirical correlations) is a strict subset of the blue set (whole space of valid correlation matrices) concentrated around average to high positive values. A simple GAN is able to recover this distribution (green points): It generates only valid correlation matrices, with a support matching closely the empirical ones, and a higher concentration around the average to high positive values.

The D elliptope Empirical correlation matrices Sampling correlations from a GAN (green points)
Figure 1: We can visually inspect the results: A simple GAN is able to sample realistic financial correlation matrices

3.2 case

In financial applications, ranges typically from a few dozens to a couple of hundreds, a few thousands in the most extreme cases. The large case is more difficult for many reasons, from statistical to computational. For our concerns, it is (i) harder to assess quality and coverage of samples generated and (ii) harder to train GANs as the standard neural networks are data inefficient on correlation matrices linked to their matrix equivalence property: When estimating a correlation matrix on a set of stock returns, the order of these stocks is arbitrary. There are such possible orders, and therefore different correlation matrices. But they all essentially describe the same correlation structure. We would like that the output of a neural network (here, the GAN discriminator (or critic) decision: fake or real) is invariant to permutations.

To solve this problem, we need to enforce permutation invariance either in the network (some early tentative in the literature [23]) or in the representation of the correlation matrix. The latter is the approach chosen in this work, namely we choose a representative for the equivalence class. We propose to consider , where is a permutation induced by a hierarchical clustering algorithm (inspired by one of the stylized facts, namely the hierarchical structure of financial correlations [17]) We show in Figure 2 the result of applying to a given correlation matrix.

Figure 2: Two equivalent correlation matrices. The one on the left has been obtained by estimation on returns of arbitrarily ordered stocks; The one on the right by applying .

4 Results and Evaluation

We apply a deep convolutional generative adversarial network (DCGAN) [20], whose architecture is known to be able to learn a hierarchy of representations from object parts to scenes in natural images, on approximately 10,000 empirical correlation matrices estimated on S&P 500 returns sorted using . Note that the matrices generated by the GAN models are not exactly correlation matrices: Their diagonal is not exactly equal to 1 (coefficients obtained are around 0.998); the matrices look visually symmetric but are not; Small negative eigenvalues can be found. We post process the results using an alternating projections method described in [11] to find the nearest correlation matrix.

Results obtained are evaluated using the stylized facts described in section 2.2: Do we recover the main characteristics of financial correlation matrices? Essentially, yes. Tails of the distributions are not perfectly simulated though. Comparison between empirical and synthetic samples are displayed in Figures 3456.

Distribution of correlations Log distribution
Figure 3: The distributions of empirical and DCGAN-generated correlation coefficients match closely: They have approximately the same mean (0.36) and standard deviation (0.13). We can notice in the log-plot some discrepancies in the tails.
Distribution of eigenvalues First eigenvector entries
Figure 4: (Left) We can notice that the synthetic eigenvalues distribution share similar characteristics, i.e. a very large first eigenvalue, and a few ones outside the bulk of the distribution; (Right) All entries of the dominant eigenvector are positives.
Figure 5: Top row: Three randomly selected empirical correlation matrices; Bottom row: Three DCGAN-generated correlation matrices. We can notice the existence of hierarchical clusters in both set of matrices.
Figure 6: Log-log plot of the distribution of node degrees in the MST. The DCGAN captures well the distribution of degrees (seemingly following a power law) but for the tail: A very few nodes have very high degrees. Typically, General Electric is known to occupy a central position in the S&P 500 MST.

5 Discussion

We have proposed a novel approach using GANs to generate realistic financial correlation matrices. The approach can be perfected, notably by spending more time and resources on experimental settings, but results showcased in this work are convincing. With this new tool, we can, for example, revisit the results described in [12] quoted in our introduction, which compares portfolios based on graphs to Markowitz-optimal portfolios.

It would be interesting to explore the use of Topological Data Analysis to compare the empirical data manifold to the synthetic data manifold as proposed in [13]. In this paper, we verified that the generated samples are realistic, but do they span the whole subspace of realistic financial correlation matrices? We might only sample from a restricted part of the space, for example due to a mode collapse during the GAN training.

This work could be an important component in improving Monte Carlo backtesting [16]: Many paths can be sampled from a multivariate distribution parameterized by GAN-generated correlation matrices. Exploring conditional generation, for example conditioning on a market regime variable (risk-on, risk-off / quantitative easing, quantitative tightening / global crisis or not), could lead to new ways of stress testing portfolios.

Finally, investigating the latent space of these models could lead to a better understanding of financial correlations, and maybe the discovery of unknown stylized facts.


  • [1] A. Aubry, A. De Maio, and L. Pallotta (2017) A geometric approach to covariance matrix estimation and its applications to radar problems. IEEE Transactions on Signal Processing 66 (4), pp. 907–922. Cited by: §1.
  • [2] B. Balaji, F. Barbaresco, and A. Decurninge (2014) Information geometry and estimation of toeplitz covariance matrices. In 2014 International Radar Conference, pp. 1–4. Cited by: §1.
  • [3] M. Bińkowski, J. Donahue, S. Dieleman, A. Clark, E. Elsen, N. Casagrande, L. C. Cobo, and K. Simonyan (2019) High fidelity speech synthesis with adversarial networks. arXiv preprint arXiv:1909.11646. Cited by: §2.1.
  • [4] A. Bojchevski, O. Shchur, D. Zügner, and S. Günnemann (2018) NetGAN: generating graphs via random walks. arXiv preprint arXiv:1803.00816. Cited by: §2.1, §2.1.
  • [5] J. Bun, J. Bouchaud, and M. Potters (2017) Cleaning large correlation matrices: tools from random matrix theory. Physics Reports 666, pp. 1–109. Cited by: §2.2.
  • [6] G. Caldarelli, S. Battiston, D. Garlaschelli, and M. Catanzaro (2004) Emergence of complexity in financial networks. In Complex Networks, pp. 399–423. Cited by: 5th item.
  • [7] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel (2016) InfoGAN: interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180. Cited by: §2.2.
  • [8] S. Ghosh and S. G. Henderson (2003) Behavior of the NORTA method for correlated random vector generation as the dimension increases. ACM Transactions on Modeling and Computer Simulation (TOMACS) 13 (3), pp. 276–294. Cited by: §3.1.
  • [9] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1, §2.1.
  • [10] P. Henry-Labordere (2019) Generative models for financial data. Available at SSRN 3408007. Cited by: §2.1.
  • [11] N. J. Higham (2002) Computing the nearest correlation matrix—a problem from finance. IMA journal of Numerical Analysis 22 (3), pp. 329–343. Cited by: §4.
  • [12] A. Hüttner, J. Mai, and S. Mineo (2018) Portfolio selection based on graphs: does it align with markowitz-optimal portfolios?. Dependence Modeling 6 (1), pp. 63–87. Cited by: §1, §2, §5.
  • [13] V. Khrulkov and I. Oseledets (2018) Geometry score: a method for comparing generative adversarial networks. arXiv preprint arXiv:1802.02664. Cited by: §5.
  • [14] A. Koshiyama, N. Firoozye, and P. Treleaven (2019) Generative adversarial networks for financial trading strategies fine-tuning and combination. arXiv preprint arXiv:1901.01751. Cited by: §2.1, §2.1.
  • [15] L. Laloux, P. Cizeau, M. Potters, and J. Bouchaud (2000) Random matrix theory and financial correlations. International Journal of Theoretical and Applied Finance 3 (03), pp. 391–397. Cited by: 2nd item, §2.2.
  • [16] M. Lopez de Prado (2019) Tactical investment algorithms. Available at SSRN 3459866. Cited by: §1, §5.
  • [17] R. N. Mantegna (1999) Hierarchical structure in financial markets. The European Physical Journal B-Condensed Matter and Complex Systems 11 (1), pp. 193–197. Cited by: 4th item, §2.2, §3.2.
  • [18] G. Marti, F. Nielsen, M. Bińkowski, and P. Donnat (2017) A review of two decades of correlations, hierarchies, networks and clustering in financial markets. arXiv preprint arXiv:1703.00485. Cited by: §2.2.
  • [19] R. B. Nelsen (2007) An introduction to copulas. Springer Science & Business Media. Cited by: §1.
  • [20] A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §4.
  • [21] M. Tumminello, F. Lillo, and R. N. Mantegna (2007) Shrinkage and spectral filtering of correlation matrices: a comparison via the kullback-leibler distance. arXiv preprint arXiv:0710.0576. Cited by: §1.
  • [22] J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum (2016) Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In Advances in neural information processing systems, pp. 82–90. Cited by: §2.1.
  • [23] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola (2017) Deep sets. In Advances in neural information processing systems, pp. 3391–3401. Cited by: §3.2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description