Cauchy noise loss
for stochastic optimization of random matrix models
via free deterministic equivalents
Abstract.
Based on free probability theory and stochastic optimization, we introduce a new parameter estimation method of random matrix models. Our method is inspired by free deterministic equivalents and iterative methods for computing Cauchy transforms.
Moreover, we study an asymptotic property of a generalization gap and show numerical experiments of the optimization. We treat two random matrix models; the compound Wishart model and the informationplusnoise model. In addition, we propose a new rank recovery method for the informationplusnoise model, and experimentally demonstrate that it recovers the true rank even if the rank is not low.
Our simulation code is available at https://github.com/ThayaFluss/cnl.
Key words and phrases:
Random Matrix Theory, Free Probability Theory, Stochastic Optimization, Compound Wishart Matrix, informationplusnoise Matrix, Rank Estimation, Rank RecoveryContents
 1 Introduction
 2 Related Work
 3 Basic Notation
 4 Random Matrix Models
 5 Theory and Algorithm
 6 Results of Numerical Experiments
 7 Conclusion and Future work
1. Introduction
The development of free probability theory (FPT for short) invented by Voiculescu [26] expands the scope of research of random matrices. The free probability theory is an invaluable tool for describing the asymptotic behavior of many random matrices when their size is large. For example, consider a fixed multivariate polynomial , independent random matrices , and the following;

deterministic matrices ,

the empirical spectral distribution of .
Then FPT answers how to know (2) from (1) for many polynomials and random matrices. However, little is known about its inverse direction, that is, how to know (1) from (2). This direction is regarded as a statistical problem how to estimate parameters of a random matrix model from observed empirical spectral distributions. In addition, estimating parameters of system involving random matrices appears in several fields of engineering, for example, signal processing and machine learning. Therefore, we are interested in finding a common framework to treat several random matrix models using their algebraic structures.
Compound Wishart matrices and informationplusnoise matrices are typical classes of random matrices. In this paper, we apply our methods to their families ; compound Wishart models (CW model, for short) and informationplusnoise models (IPN model). Compound Wishart matrices are introduced by Speicher [23], which also appear as sample covariance matrices of correlated samplings. Their modifications appear in analysis of some statistical models ([9], [7], [11], and [13]). See [15] for more detail. The IPN model appears in signal precessing [22]. In addition, the matrix factorization model is a version of IPN model. The matrix factorization model appears in several fields, it is also used to a rank estimation problem in the reducedrank singular value decomposition (See [6] for the detail).
Our work is strongly inspired by the free deterministic equivalent (FDE for short) introduced by Speicher and Vargas [24][25]. Roughly speaking, we can interpret independent random matrices as deterministic matrices of operators in an infinite dimensional probability space. The origin of FDE can be found in NeuSpeciher[19] as a meanfiled approximation method of an Anderson model in statistical physics. One can also consider FDE as one of approximation methods to eliminate randomness for computing the expectation. Particularly, FDE is a “lift” of the deterministic equivalent. More precisely, FDE is an approximation of a random matrix model at the level of operators, on the other hand, the deterministic equivalent is that at the level of Cauchy transforms. Now the deterministic equivalent is known as an approximation method of of Cauchy transforms of random matrices in several literature of wirelessnetwork (see [9]). Despite its rich background in FPT, the algorithm of FDE is not complicated. Roughly speaking, its main step is to replace each Gaussian random variable in entries of a random matrix model by an “infinite size” Gaussian random matrix, which is called a circular element in FPT.
Moreover, our algorithm depends on two iterative computation methods of Cauchy transform based on HeltonFarSpeicher [14] and BelinschiMaiSpeicher [2], since analytical computations of Cauchy transforms are difficult in many cases.
In the parameter optimization step, we apply a stochastic gradient decent method to minimize the Cauchy cross entropy by using gradient of a random loss function Cauchy noise loss. The Cauchy noise loss and the the Cauchy cross entropy are main topics of this article. In fact, the Cauchy cross entropy against an empirical spectral distribution has a role of an empirical risk in our framework. However, the computation of the Cauchy cross entropy needs convolutions with Poisson kernels (note that the Poisson kernel is related with the Cauchy transform). To avoid computing convolutions directly, we consider a further random version of the Cauchy cross entropy. That is just the Cauchy noise loss. In other words, the Cauchy noise loss is regarded as an averaged loglikelihood of samples perturbed by independent Cauchy random variables. See Section 4.3 for the main idea of the Cauchy noise loss.
Here we summarize our contributions.
Our major contribution is introducing a common framework for the parameter optimization of random matrix models, which is a combination of the Cauchy noise loss, FDE, iterative methods for computing Cauchy transforms, and a stochastic gradient decent method. In addition, we also give a brief and general computing method of gradients of Cauchy transforms of FDE.
The second contribution is proving the asymptotic properties of the Cauchy noise loss and the Cauchy cross entropy.
The last one is to show experimental results of our method in random matrices models (CW model and IPN model) under several settings of the scale of the Cauchy noise. In addition, we use the FDEIPN model to a rank estimation problem.
2. Related Work
There are several applications of deterministic equivalents and FDE to the analysis of multi input multi output channels [9] [24].
Ryan [22] applied the free deconvolution to IPN models. Their method is based on evaluating the difference of moments, the mean square error of moments. Since it uses only finite number of lower order moments, the error has subtotal information of the empirical distribution. On the contrary, our method uses full information of the empirical distribution.
The variational Bayesian method (see [6]) is a deterministic approximation in the Bayesian framework. NakajimaSugiyamaBabacanTomioka [17] and [18] gave the global analytic optimal solution of the empirical variational Bayesian matrix factorization (EVBMF, for short), and used it to a rank estimation problem. Recall that the matrix factorization model is a version of IPN model. Our method gives a different way of deterministic approximation of the nonBayesian model. Note that the Bayesian framework is out of the scope of this paper.
3. Basic Notation
In this paper, we fix a probability space . A random variable is a valued Borel measurable function on the probability space.
Measures

for any integrable or nonnegative real random variable .

for a squareintegrable real random variable .

.


.

.

for .
Function Spaces
In addition, we use the following notion of function spaces.

We denote by the set of coefficient polynomials.

.

.

.
Entropy
Let us recall on the entropy for strictly positive continuous probability density functions.

the set .

For any , we denote by .

Let with . The cross entropy of against the reference is defined by

For any such that , we denote by the entropy of .

The relative entropy (or KullbackLeibler divergence) of against to is defined as .
We formulate the principle of minimum cross entropy for as follows.
Proposition 3.1.
Let with . Then
where for a valued function on a set which has a minimum point.
Proof.
Let with . Since , it holds that and the equality holds if and only if , which proves the assertion. ∎
Matrices
Let be or , and .

Let us denote by the set of rectangular matrices over .

.

for , and .
Moments and Spectral Distributions
For any , we denote the th moment of for by . For any random variable whose low is , we define its moment by . All moments of a probability measure (resp. a random variable ) are welldefined if (resp. ).
For any selfadjoint matrix , we denote by set of its all eigenvalues by . The spectral distribution of , denoted by , is defined as the discrete measure
We define the moment of as . Note that .
4. Random Matrix Models
In this section, we introduce random matrix models and our main idea.
4.1. Gaussian Random Matrix models
Definition 4.1.
A real normalized Gaussian random matrix (GRM over for short) of is the matrix whose entries are independent and identically distributed with . A complex normalized Gaussian random matrix (GRM over for short) of is the matrix whose entries are given by , where the family is independent and each element is distributed with .
Definition 4.2.
Let be a selfadjoint polynomial (that is, it is stable under replacing by ) in noncommutative dummy variables and their adjoint . Let be a family of pairs of natural numbers and be corresponding evaluation
of , where products and sums satisfy dimension compatibility.
Then the real (resp. complex) Gaussian random matrix model (GRM model, for short) of type on a subset is the restriction of the following map to ;
where the family is a collection of independent GRM over (resp. ).
We introduce examples of GRM models which are in the scope of our numerical experiments.
Definition 4.3.

A compound Wishart model (CW model for short) of type on , denoted by , is the GRM model of type on the subset , where . Note that

An informationplusnoise model (IPN model for short) of type on a subset , denoted by , is the GRM model of type on , where is a polynomial of dummy variables given by Note that
Definition 4.4.
For , the empirical spectral distribution is defined by the following random measure;
Remark 4.5.
Let be a singular value decomposition of , where is a unitary matrix, is a unitary matrix, and is a rectangular diagonal matrix. Then . Hence the empirical distribution of is equal to that of , since the joint distribution of the entries of and that of is same. Similarly, it holds that , where is a diagonalization of . Hence in the parameter estimation of CW or IPN models from each empirical spectral distribution, we cannot know such unitary matrices. Therefore, we consider the following restricted domains of parameters.
Notation 4.6.
Let with . We denote by the diagonal embedding defined by if , otherwise, . We write and .
Definition 4.7.
Let with and . Let us define the domains
(4.1)  
(4.2) 
4.2. Cauchy Transform and Cauchy Distribution
The Cauchy distribution plays a central role in this paper.
Definition 4.8.

The Cauchy distribution with scale parameter is the probability measure over whose density function is given by the following Poisson kernel;
We call a random variable on a Cauchy random variable of scale , denote by , if its probability law is equals to .

The Cauchy transform of is the holomorphic function on defined as
For any Cauchy random variable , the random variable has every absolute moments;
(4.3) 
This follows from the facts and , for any . In particular, the entropy of each Cauchy random variable is welldefined and given by .
One of the relationships between the Cauchy transform and the Cauchy distribution is given by the convolution with the measure;
(4.4) 
where is the convolution of and . Note that .
Definition 4.9.
Let . For , we call the strictly positive valued function the slice of . If there is no confusion, we also call it the slice of the Cauchy transform.
For fixed , the slices have enough information to distinguish original measures as follows.
Lemma 4.10.
Fix . Let . Then if and only if .
Proof.
For any probability measure , let us denote its Fourier transform by . Similarly, we define the Fourier transform of any probability density function. Fix . Assume that . Since the Fourier transform of the convolution of two measures is equal to the product of their individual Fourier transforms, we have . Because , we have . Since the Fourier transformation is injective, we have . ∎
See [10] for basic results of the Fourier transform.
4.3. Main Idea and Example
In our approach, to evaluate how is close to a reference probability measure , we estimate the cross entropy of the slice of against the slice of , instead of that of against . We call the former cross entropy the Cauchy cross entropy (see Definition 5.8). The first reason why we use the slices is that the Cauchy transforms of spectral distributions of random matrix models become numerically computable if we replace the random matrix model by its free deterministic equivalent (FDE for short, see Definition 5.4). Note that the convergence of the iterative methods are rigorously proven (see Section 5.4). In addition, the gradient of each slice is also numerically computable (see Section 5.5). The second reason is that the Cauchy cross entropy is always welldefined (see Section 5.3.1). Recall that the usual cross entropy possibly illdefined if the measures have singular parts or compact supports. The last reason is that slice has enough information to distinguish original measures (see Lemma 4.10).
In our statistical estimation method, we evaluate the Cauchy cross entropy for a fixed of against the empirical distribution instead of that against the slice of the deterministic reference probability measure . However, it requires the high time complexity to compute the Cauchy cross entropy against the empirical distribution by the convolution with Poisson kernel. To overcome this difficulty, we generate i.i.d Cauchy random variables (Cauchy noises) of scale and further approximate the empirical cross entropy by a random variable, which is called the Cauchy noise loss.
The left upper subfigure of Figure 1 shows a singleshot ESD perturbed by Cauchy noises and the slice of the Cauchy transform of FDE compound Wishart model (FDECW model, for short, see Definition 5.5). More precisely, the histogram indicates a perturbed singleshot empirical eigenvalues , where are eigenvalues of a singleshot observation of the compound Wishart random matrix and is an independent observation of Cauchy noises of scale . We set , and . The red dashed line indicates the slice of the Cauchy transform of FDE of the random matrix model. We observed that the perturbed singleshot ESD well approximated the slice of FDE model (see Section 5.3 for the theoretical reason). Therefore, we can expect that perturbed singleshot ESD contains enough information to estimate the parameters of the random matrix model.
The last idea of our method is using a stochastic gradient decent method (SGD, for short, see [4]) to avoid evaluating many time the cross entropy against the total perturbed singleshot ESD. Roughly speaking, we iteratively update parameters of FDE model based on the cross entropy against a subtotal perturbed singleshot ESD randomly picked from the total one at each iteration. Note that each subtotal one contains few elements ( to , in our experiments). See Algorithm 1 for the detail.
5. Theory and Algorithm
5.1. Free Deterministic Equivalents
In this section, we introduce a version of free deterministic equivalents by SpeicherVargas[24]. To implement our algorithm, infinite dimensional operators are not needed; our method works only using finite dimensional ones. However, infinite dimensional operators have theoretically critical roles.
First we summarize some definitions from operator algebras and free probability theory. See [16] for the detail.
Definition 5.1.

A Cprobability space is a pair satisfying followings.

The set is a unital algebra, that is, a possibly noncommutative subalgebra of an algebra of bounded linear operators on a Hilbert space over satisfying the following conditions:

it is stable under the adjoint ,

it is closed under the topology of the operator norm of ,

it contains the identity operator as the unit of .


The function on is a faithful tracial state, that is a valued linear functional with

for any , and the equality holds if and only if ,

,

for any .



A possibly noncommutative subalgebra of a Calgebra is called a subalgebra if is stable under the adjoint operator . Moreover, it is called a unital Csubalgebra if the subalgebra is closed under the operator norm topology and contains as its unit.

Two unital algebras are called isomorphic if there is a bijective linear map between them which preserves the operation and the multiplication.

Let us denote by the selfadjoint elements, that is, of .

, for any .

The distribution of is the probability measure determined by

For , we define its Cauchy transform by . Note that .
Definition 5.2.
A family of subalgebras of is said to be free if the following factorization rule holds: for any and indexes with , and with , it holds that
A family of selfadjoint elements is called free if the family of the subalgebras of polynomials of is free.
We introduce special elements in a noncommutative probability space.
Definition 5.3.
Let be a Cprobability space.

An element is called standard semicircular if its distribution is given by the semicircular law
where is the indicator function for any subset .

Let . An element is called circular of variance if
where is a pair of free standard semicircular elements.

A free circular family (resp. standard free circular family) is a family of circular elements such that is free (resp. and each elements is of variance ).
A free deterministic equivalent (FDE, for short) of a polynomial of GRM and deterministic matrices is constructed by replacing GRM by matrices of circular elements. Equivalently, FDE is obtained by taking the limit of the amplified models which is constructed by (1) copying deterministic matrices by taking tensor product with identity and (2) GRM are enlarged by simply increasing the number of entries. See [24] and [25, pp.19] for the detail. We reformulate FDE for random matrix models.
Definition 5.4 (Free deterministic equivalents).
Fix a Cprobability space . Let be a Gaussian random matrix model over or of type on a subset . Then its free deterministic equivalent (FDE for short) is the map defined as
where are rectangular matrices such that the collection of all rescaled entries
is a standard free circular family in .
In addition, if and FDE is selfadjoint for any elements in , the Cauchy transform of FDE is called the deterministic equivalent.
Note that each coefficient is multiplied for the compatibility with the normalization of Gaussian random matrices. In addition, we note that the FDE model does not depend on the filed or .
The original definition of FDE [24][25] treats not only Ginibre matrices but also more general random matrices.
Definition 5.5.
Let .

The free deterministic equivalent informationplusnoise model (FDEIPN model, for short) of type is defined as the FDE of . Note that
where is a standard free circular family.

The free deterministic equivalent compound Wishart model (FDECW model, for short) of type is defined as the FDE of . Note that
where is a standard free circular family.
Remark 5.6.
We reformulate the convergence of the gap between FDE and original random matrix model in the case parameters belong to bounded sets.
Proposition 5.7.

Let us consider following sequences;

such that with and ,

such that and .
Then for any , it holds that almost surely,


Let us consider following sequences;

such that with and ,

such that and ,

such that and .
Then for any , it holds that almost surely,

Proof.
Firstly let us consider the case of CW model. By the genus expansion of Gaussian random matrices (see [16] for both real and complex case) and the uniform boundedness of the sequences, there are constants such that for any ,
(5.1) 
By the expansion of the fluctuation of Gassian random matrices (see [8] for the complex case, [21] for the real case) and the uniform boundedness of the sequences, there are constants such that for any ,
(5.2) 
As the consequence, we have for any ,
(5.3) 
Since , the almostsure convergence holds.
The proof for IPN model is given by the same argument. ∎
5.2. Cauchy Noise Loss
Definition 5.8.
Let . Assume that . Then the Cauchy cross entropy of against with the scale parameter , denoted by , is defined as
We postpone the proof of the welldefinedness of Cauchy cross entropy and the fact to Section 5.3.1.
We have the principal of minimum Cauchy cross entropy as follows.
Proposition 5.9.
Fix . For any , it holds that
Proof.
By the Proposition 5.9, the Cauchy cross entropy has enough information to distinguish probability measures with compact supports. This is one of the reasons why we define the loss function based on it. That is, we are interested in solving the following minimizing problem :
(5.4) 
by only using a singleshot observation of the empirical spectral distribution. Since is unknown, we are not able to evaluate the cross entropy directly. Instead, we minimize the following empirical risk.
Definition 5.10.
Let be the pair of a selfadjoint random matrix model and its FDE. Fix . The empirical FDE risk is defined as the real random variable
Note that
Recall that . Since we compute by iterative methods, it requires high time complexity if we compute the empirical risk by the integration with the Poisson kernel. To reduce the time complexity, we approximate the empirical risk by a further random risk by using Cauchy random variables.
Definition 5.11.
Let and be real random variables. Let and . Let be a family of i.i.d. Cauchy random variables of scale which is independent of . We define the Cauchy noise loss of the scale parameter and the noise dimension as the map on given by
where
Definition 5.12.
Let be the pair of a random matrix model and its FDE. Fix and . Let be the empirical eigenvalues of for a . We define the Cauchy noise loss of FDE model, denoted by , as the map on given by
(5.5) 
We write .
Note that
(5.6) 
5.2.1. Optimization Algorithm
Algorithm 1 shows the optimization algorithm based on the Cauchy noise loss.
It requires the settings of the maximum number of epochs (that is, the iterations of the outer loop), the learning rate policy, settings of the Cauchy noise loss, the initial parameter of the FDE model, and the bounded convex parameter space .
In our method, the sample is assumed to be a singleshot observation of a selfadjoint square random matrix model. Through the algorithm, the scale is fixed. We consider the collection of the eigenvalues of the sample matrix. Each epoch of the algorithm consists of the following steps. First, we compute the learning rate based on a learning rate policy (See Section 5.6.1), and permute randomly the indexes of eigenvalues. Second, for , we update parameters as follows: (1) generate independently samples of Cauchy random variables of scale , (2) compute by Proposition 5.23, Proposition 5.32, Theorem 5.36, Corollary 5.38 (their proofs are postponed to a later section) and (5.6), (3) update parameters by using the gradient of the loss function, and (4) project it onto the parameter space . We continue the iteration while the number of epochs is smaller than .
Remark 5.13.
Assume that the input matrix is an observation of random matrix , where is fixed and is a true parameter. Then we can expect that the algorithm returns a parameter which is close to a local minimum point of the possibly nonconvex function
since the perturbed samples are generated independently from the distribution at each epoch (recall that is fixed). The theoretical proofs of the convergence and its stability are out of the scope of this paper, and postponed to future work. See Bottou[4], which summarizes the proof of the convergence of stochastic gradients descent methods under several assumptions of possibly nonconvex loss functions.
If the convergence holds, by the analysis of the FDEgeneralization gap (see Section 5.3), we can expect that the algorithm returns a close value to a local minimum of the function . Hence by Proposition 5.9, if the local minimum is a global one, we can expect that is close to . Its theoretical proof is also postponed to future work.
5.3. Analysis of FDEgeneralization Gap
In this section we prove the FDE generalization gap, which is defined as the real random variable
(5.7) 
uniformly converges to as the matrix size becomes large.
5.3.1. Basic Properties
We begin by proving the Cauchy cross entropy is welldefined.
Lemma 5.14.
Fix . Then the followings hold.

For any , it holds that

for any , where .

for any and with . Hence is welldefined.
Proof.
By the definition of Poisson kernel, . It holds that