Cauchy noise loss

# Cauchy noise loss for stochastic optimization of random matrix models via free deterministic equivalents

Tomohiro Hayase Graduate School of Mathematics
University of Tokyo
Komaba, Tokyo 153-8914, Japan
July 14, 2019
###### Abstract.

Based on free probability theory and stochastic optimization, we introduce a new parameter estimation method of random matrix models. Our method is inspired by free deterministic equivalents and iterative methods for computing Cauchy transforms.

Moreover, we study an asymptotic property of a generalization gap and show numerical experiments of the optimization. We treat two random matrix models; the compound Wishart model and the information-plus-noise model. In addition, we propose a new rank recovery method for the information-plus-noise model, and experimentally demonstrate that it recovers the true rank even if the rank is not low.

Our simulation code is available at https://github.com/ThayaFluss/cnl.

###### Key words and phrases:
Random Matrix Theory, Free Probability Theory, Stochastic Optimization, Compound Wishart Matrix, information-plus-noise Matrix, Rank Estimation, Rank Recovery
\pdfstringdefDisableCommands

## 1. Introduction

The development of free probability theory (FPT for short) invented by Voiculescu [26] expands the scope of research of random matrices. The free probability theory is an invaluable tool for describing the asymptotic behavior of many random matrices when their size is large. For example, consider a fixed multivariate polynomial , independent random matrices , and the following;

1. deterministic matrices ,

2. the empirical spectral distribution of .

Then FPT answers how to know (2) from (1) for many polynomials and random matrices. However, little is known about its inverse direction, that is, how to know (1) from (2). This direction is regarded as a statistical problem how to estimate parameters of a random matrix model from observed empirical spectral distributions. In addition, estimating parameters of system involving random matrices appears in several fields of engineering, for example, signal processing and machine learning. Therefore, we are interested in finding a common framework to treat several random matrix models using their algebraic structures.

Compound Wishart matrices and information-plus-noise matrices are typical classes of random matrices. In this paper, we apply our methods to their families ; compound Wishart models (CW model, for short) and information-plus-noise models (IPN model). Compound Wishart matrices are introduced by Speicher [23], which also appear as sample covariance matrices of correlated samplings. Their modifications appear in analysis of some statistical models ([9], [7], [11], and [13]). See [15] for more detail. The IPN model appears in signal precessing [22]. In addition, the matrix factorization model is a version of IPN model. The matrix factorization model appears in several fields, it is also used to a rank estimation problem in the reduced-rank singular value decomposition (See [6] for the detail).

Our work is strongly inspired by the free deterministic equivalent (FDE for short) introduced by Speicher and Vargas [24][25]. Roughly speaking, we can interpret independent random matrices as deterministic matrices of operators in an infinite dimensional -probability space. The origin of FDE can be found in Neu-Speciher[19] as a mean-filed approximation method of an Anderson model in statistical physics. One can also consider FDE as one of approximation methods to eliminate randomness for computing the expectation. Particularly, FDE is a “lift” of the deterministic equivalent. More precisely, FDE is an approximation of a random matrix model at the level of operators, on the other hand, the deterministic equivalent is that at the level of Cauchy transforms. Now the deterministic equivalent is known as an approximation method of of Cauchy transforms of random matrices in several literature of wireless-network (see [9]). Despite its rich background in FPT, the algorithm of FDE is not complicated. Roughly speaking, its main step is to replace each Gaussian random variable in entries of a random matrix model by an “infinite size” Gaussian random matrix, which is called a circular element in FPT.

Moreover, our algorithm depends on two iterative computation methods of Cauchy transform based on Helton-Far-Speicher [14] and Belinschi-Mai-Speicher [2], since analytical computations of Cauchy transforms are difficult in many cases.

In the parameter optimization step, we apply a stochastic gradient decent method to minimize the Cauchy cross entropy by using gradient of a random loss function Cauchy noise loss. The Cauchy noise loss and the the Cauchy cross entropy are main topics of this article. In fact, the Cauchy cross entropy against an empirical spectral distribution has a role of an empirical risk in our framework. However, the computation of the Cauchy cross entropy needs convolutions with Poisson kernels (note that the Poisson kernel is related with the Cauchy transform). To avoid computing convolutions directly, we consider a further random version of the Cauchy cross entropy. That is just the Cauchy noise loss. In other words, the Cauchy noise loss is regarded as an averaged log-likelihood of samples perturbed by independent Cauchy random variables. See Section 4.3 for the main idea of the Cauchy noise loss.

Here we summarize our contributions.

Our major contribution is introducing a common framework for the parameter optimization of random matrix models, which is a combination of the Cauchy noise loss, FDE, iterative methods for computing Cauchy transforms, and a stochastic gradient decent method. In addition, we also give a brief and general computing method of gradients of Cauchy transforms of FDE.

The second contribution is proving the asymptotic properties of the Cauchy noise loss and the Cauchy cross entropy.

The last one is to show experimental results of our method in random matrices models (CW model and IPN model) under several settings of the scale of the Cauchy noise. In addition, we use the FDEIPN model to a rank estimation problem.

## 2. Related Work

There are several applications of deterministic equivalents and FDE to the analysis of multi input multi output channels [9] [24].

Ryan [22] applied the free deconvolution to IPN models. Their method is based on evaluating the difference of moments, the mean square error of moments. Since it uses only finite number of lower order moments, the error has subtotal information of the empirical distribution. On the contrary, our method uses full information of the empirical distribution.

The variational Bayesian method (see [6]) is a deterministic approximation in the Bayesian framework. Nakajima-Sugiyama-Babacan-Tomioka [17] and [18] gave the global analytic optimal solution of the empirical variational Bayesian matrix factorization (EVBMF, for short), and used it to a rank estimation problem. Recall that the matrix factorization model is a version of IPN model. Our method gives a different way of deterministic approximation of the non-Bayesian model. Note that the Bayesian framework is out of the scope of this paper.

There are applications of the fluctuation of FDE to some autoregressive moving-average models [12] [13]. Their methods are based on the fluctuation of CW model, and focus on the good-of-fit test of the parameter estimation, not for the parameter estimation itself.

## 3. Basic Notation

In this paper, we fix a probability space . A random variable is a -valued Borel measurable function on the probability space.

#### Measures

1. for any integrable or nonnegative real random variable .

2. for a square-integrable real random variable .

3. .

4. .

5. .

6. for .

#### Function Spaces

In addition, we use the following notion of function spaces.

1. We denote by the set of -coefficient polynomials.

2. .

3. .

4. .

#### Entropy

Let us recall on the entropy for strictly positive continuous probability density functions.

1. the set .

2. For any , we denote by .

3. Let with . The cross entropy of against the reference is defined by

4. For any such that , we denote by the entropy of .

5. The relative entropy (or Kullback-Leibler divergence) of against to is defined as .

We formulate the principle of minimum cross entropy for as follows.

###### Proposition 3.1.

Let with . Then

 argminp∈PDF+, logp∈L1(q)D[q∥p]=argminp∈PDF+, logp∈L1(q)H[q,p]={q},

where for a -valued function on a set which has a minimum point.

###### Proof.

Let with . Since , it holds that and the equality holds if and only if , which proves the assertion. ∎

#### Matrices

Let be or , and .

1. Let us denote by the set of rectangular matrices over .

2. .

3. for , and .

#### Moments and Spectral Distributions

For any , we denote the -th moment of for by . For any random variable whose low is , we define its moment by . All moments of a probability measure (resp. a random variable ) are well-defined if (resp. ).

For any self-adjoint matrix , we denote by set of its all eigenvalues by . The spectral distribution of , denoted by , is defined as the discrete measure

 μA=1d∑λ∈spAδλ.

We define the moment of as . Note that .

## 4. Random Matrix Models

In this section, we introduce random matrix models and our main idea.

### 4.1. Gaussian Random Matrix models

###### Definition 4.1.

A real normalized Gaussian random matrix (GRM over for short) of is the -matrix whose entries are independent and identically distributed with . A complex normalized Gaussian random matrix (GRM over for short) of is the -matrix whose entries are given by , where the family is independent and each element is distributed with .

###### Definition 4.2.

Let be a self-adjoint polynomial (that is, it is stable under replacing by ) in non-commutative dummy variables and their adjoint . Let be a family of pairs of natural numbers and be corresponding evaluation

 PI:m∏k=1Mrk,ℓk(L∞)×n∏k=1Mpk,dk(C)→Mp,d(L∞)

of , where products and sums satisfy dimension compatibility.

Then the real (resp. complex) Gaussian random matrix model (GRM model, for short) of type on a subset is the restriction of the following map to ;

 ¯PI:n∏k=1Mpk,dk(C)→Mp,d(L∞), ¯PI(D1,…,Dn):=PI(Z1,…,Zm,D1,…,Dn),

where the family is a collection of independent GRM over (resp. ).

We introduce examples of GRM models which are in the scope of our numerical experiments.

###### Definition 4.3.
1. A compound Wishart model (CW model for short) of type on , denoted by , is the GRM model of type on the subset , where . Note that

 WCW(A)=¯PJ(A)=Z∗AZ.
2. An information-plus-noise model (IPN model for short) of type on a subset , denoted by , is the GRM model of type on , where is a polynomial of dummy variables given by Note that

 WIPN(A,σ)=¯PJ(A,σ)=(A+σZ)∗(A+σZ),
###### Definition 4.4.

For , the empirical spectral distribution is defined by the following random measure;

 ESD(W):=μW=1d∑λ∈spWδλ.
###### Remark 4.5.

Let be a singular value decomposition of , where is a unitary matrix, is a unitary matrix, and is a rectangular diagonal matrix. Then . Hence the empirical distribution of is equal to that of , since the joint distribution of the entries of and that of is same. Similarly, it holds that , where is a diagonalization of . Hence in the parameter estimation of CW or IPN models from each empirical spectral distribution, we cannot know such unitary matrices. Therefore, we consider the following restricted domains of parameters.

###### Notation 4.6.

Let with . We denote by the diagonal embedding defined by if , otherwise, . We write and .

###### Definition 4.7.

Let with and . Let us define the domains

 ΘCW(p,M) :=ιp({v∈Rp∣\normv∞≤M}), (4.1) ΘIPN(p,d,M) :=(ιdp,d×id)({(a,σ)∈Rd×R∣\norma∞≤M,\absσ≤M}). (4.2)

### 4.2. Cauchy Transform and Cauchy Distribution

The Cauchy distribution plays a central role in this paper.

###### Definition 4.8.
1. The Cauchy distribution with scale parameter is the probability measure over whose density function is given by the following Poisson kernel;

 Pγ(x):=1πγx2+γ2, x∈R.

We call a random variable on a Cauchy random variable of scale , denote by , if its probability law is equals to .

2. The Cauchy transform of is the holomorphic function on defined as

 Gμ(z):=∫1z−tμ(dt).

For any Cauchy random variable , the random variable has every absolute moments;

 E[\abslogPγ(T)k]=∫\abslogPγ(t)kPγ(t)dt<∞. (4.3)

This follows from the facts and , for any . In particular, the entropy of each Cauchy random variable is well-defined and given by .

One of the relationships between the Cauchy transform and the Cauchy distribution is given by the convolution with the measure;

 −1πImGμ(x+iγ)=1−2πi[Gμ(x+iγ)−Gμ(x−iγ)]=Pγ∗μ(x), (4.4)

where is the convolution of and . Note that .

###### Definition 4.9.

Let . For , we call the strictly positive valued function the -slice of . If there is no confusion, we also call it the -slice of the Cauchy transform.

For fixed , the -slices have enough information to distinguish original measures as follows.

###### Lemma 4.10.

Fix . Let . Then if and only if .

###### Proof.

For any probability measure , let us denote its Fourier transform by . Similarly, we define the Fourier transform of any probability density function. Fix . Assume that . Since the Fourier transform of the convolution of two measures is equal to the product of their individual Fourier transforms, we have . Because , we have . Since the Fourier transformation is injective, we have . ∎

See [10] for basic results of the Fourier transform.

### 4.3. Main Idea and Example

In our approach, to evaluate how is close to a reference probability measure , we estimate the cross entropy of the -slice of against the -slice of , instead of that of against . We call the former cross entropy the Cauchy cross entropy (see Definition 5.8). The first reason why we use the -slices is that the Cauchy transforms of spectral distributions of random matrix models become numerically computable if we replace the random matrix model by its free deterministic equivalent (FDE for short, see Definition 5.4). Note that the convergence of the iterative methods are rigorously proven (see Section 5.4). In addition, the gradient of each -slice is also numerically computable (see Section 5.5). The second reason is that the Cauchy cross entropy is always well-defined (see Section 5.3.1). Recall that the usual cross entropy possibly ill-defined if the measures have singular parts or compact supports. The last reason is that -slice has enough information to distinguish original measures (see Lemma 4.10).

In our statistical estimation method, we evaluate the Cauchy cross entropy for a fixed of against the empirical distribution instead of that against the -slice of the deterministic reference probability measure . However, it requires the high time complexity to compute the Cauchy cross entropy against the empirical distribution by the convolution with Poisson kernel. To overcome this difficulty, we generate i.i.d Cauchy random variables (Cauchy noises) of scale and further approximate the empirical cross entropy by a random variable, which is called the Cauchy noise loss.

The left upper sub-figure of Figure 1 shows a single-shot ESD perturbed by Cauchy noises and the -slice of the Cauchy transform of FDE compound Wishart model (FDECW model, for short, see Definition 5.5). More precisely, the histogram indicates a perturbed single-shot empirical eigenvalues , where are eigenvalues of a single-shot observation of the compound Wishart random matrix and is an independent observation of Cauchy noises of scale . We set , and . The red dashed line indicates the -slice of the Cauchy transform of FDE of the random matrix model. We observed that the perturbed single-shot ESD well approximated the -slice of FDE model (see Section 5.3 for the theoretical reason). Therefore, we can expect that perturbed single-shot ESD contains enough information to estimate the parameters of the random matrix model.

The last idea of our method is using a stochastic gradient decent method (SGD, for short, see [4]) to avoid evaluating many time the cross entropy against the total perturbed single-shot ESD. Roughly speaking, we iteratively update parameters of FDE model based on the cross entropy against a subtotal perturbed single-shot ESD randomly picked from the total one at each iteration. Note that each subtotal one contains few elements ( to , in our experiments). See Algorithm 1 for the detail.

Each blue line in Figure 1 indicates the -slice of FDECW model with another parameter optimized by Algortihm 1. We observed that the blue line approached the true -slice, that is, the -slice of FDECW model with the true parameter, as the number of iterations increased.

## 5. Theory and Algorithm

### 5.1. Free Deterministic Equivalents

In this section, we introduce a version of free deterministic equivalents by Speicher-Vargas[24]. To implement our algorithm, infinite dimensional operators are not needed; our method works only using finite dimensional ones. However, infinite dimensional operators have theoretically critical roles.

First we summarize some definitions from operator algebras and free probability theory. See [16] for the detail.

###### Definition 5.1.
1. A C-probability space is a pair satisfying followings.

1. The set is a unital -algebra, that is, a possibly noncommutative subalgebra of an algebra of bounded -linear operators on a Hilbert space over satisfying the following conditions:

1. it is stable under the adjoint ,

2. it is closed under the topology of the operator norm of ,

3. it contains the identity operator as the unit of .

2. The function on is a faithful tracial state, that is a -valued linear functional with

1. for any , and the equality holds if and only if ,

2. ,

3. for any .

2. A possibly noncommutative subalgebra of a C-algebra is called a -subalgebra if is stable under the adjoint operator . Moreover, it is called a unital C-subalgebra if the -subalgebra is closed under the operator norm topology and contains as its unit.

3. Two unital -algebras are called -isomorphic if there is a bijective linear map between them which preserves the -operation and the multiplication.

4. Let us denote by the self-adjoint elements, that is, of .

5. , for any .

6. The distribution of is the probability measure determined by

 ∫xkμa(dx)=τ(ak), k∈N.
7. For , we define its Cauchy transform by . Note that .

###### Definition 5.2.

A family of -subalgebras of is said to be free if the following factorization rule holds: for any and indexes with , and with , it holds that

 τ(a1⋯al)=0.

A family of self-adjoint elements is called free if the family of the -subalgebras of polynomials of is free.

We introduce special elements in a noncommutative probability space.

###### Definition 5.3.

Let be a C-probability space.

1. An element is called standard semicircular if its distribution is given by the semicircular law

 μs(dx)=√4−x22π1[−2,2](x)dx,

where is the indicator function for any subset .

2. Let . An element is called circular of variance if

 c=√vs1+is2√2,

where is a pair of free standard semicircular elements.

3. A -free circular family (resp. standard -free circular family) is a family of circular elements such that is free (resp. and each elements is of variance ).

A free deterministic equivalent (FDE, for short) of a polynomial of GRM and deterministic matrices is constructed by replacing GRM by matrices of circular elements. Equivalently, FDE is obtained by taking the limit of the amplified models which is constructed by (1) copying deterministic matrices by taking tensor product with identity and (2) GRM are enlarged by simply increasing the number of entries. See [24] and [25, pp.19] for the detail. We reformulate FDE for random matrix models.

###### Definition 5.4 (Free deterministic equivalents).

Fix a C-probability space . Let be a Gaussian random matrix model over or of type on a subset . Then its free deterministic equivalent (FDE for short) is the map defined as

 P□J(D1,…,Dn):=P(C1,…,Cm,D1,…,Dn),

where are rectangular matrices such that the collection of all rescaled entries

 {√ℓkCk(i,j)∣i=1,…,rk,j=1,…,ℓk,k=1,…,m}

is a standard -free circular family in .

In addition, if and FDE is self-adjoint for any elements in , the Cauchy transform of FDE is called the deterministic equivalent.

Note that each coefficient is multiplied for the compatibility with the normalization of Gaussian random matrices. In addition, we note that the FDE model does not depend on the filed or .

The original definition of FDE [24][25] treats not only Ginibre matrices but also more general random matrices.

###### Definition 5.5.

Let .

1. The free deterministic equivalent information-plus-noise model (FDEIPN model, for short) of type is defined as the FDE of . Note that

 W□IPN(A,σ)=(A+σC)∗(A+σC),

where is a standard -free circular family.

2. The free deterministic equivalent compound Wishart model (FDECW model, for short) of type is defined as the FDE of . Note that

 W□CW(A)=C∗AC,

where is a standard -free circular family.

###### Remark 5.6.

By Remark 4.5, we consider the optimization of restricted models on or for some . The boundedness of each parameter space is required for the uniform convergence of our loss function (see Corollary 5.19).

We reformulate the convergence of the gap between FDE and original random matrix model in the case parameters belong to bounded sets.

###### Proposition 5.7.
1. Let us consider following sequences;

1. such that with and ,

2. such that and .

Then for any , it holds that -almost surely,

2. Let us consider following sequences;

1. such that with and ,

2. such that and ,

3. such that and .

Then for any , it holds that -almost surely,

###### Proof.

Firstly let us consider the case of CW model. By the genus expansion of Gaussian random matrices (see [16] for both real and complex case) and the uniform boundedness of the sequences, there are constants such that for any ,

By the expansion of the fluctuation of Gassian random matrices (see [8] for the complex case, [21] for the real case) and the uniform boundedness of the sequences, there are constants such that for any ,

As the consequence, we have for any ,

Since , the almost-sure convergence holds.

The proof for IPN model is given by the same argument. ∎

### 5.2. Cauchy Noise Loss

###### Definition 5.8.

Let . Assume that . Then the Cauchy cross entropy of against with the scale parameter , denoted by , is defined as

 Hγ(ν,μ):=H(Pγ∗ν,Pγ∗μ).

We postpone the proof of the well-definedness of Cauchy cross entropy and the fact to Section 5.3.1.

We have the principal of minimum Cauchy cross entropy as follows.

###### Proposition 5.9.

Fix . For any , it holds that

 argminμ∈Bc(R)Hγ(ν,μ)={ν}.
###### Proof.

If attains the minimum, then by Proposition 3.1, we have . By Lemma 4.10, the assertion holds. ∎

By the Proposition 5.9, the Cauchy cross entropy has enough information to distinguish probability measures with compact supports. This is one of the reasons why we define the loss function based on it. That is, we are interested in solving the following minimizing problem :

 argminϑ∈ΘHγ(μ□ϑ0,μ□ϑ), (5.4)

by only using a single-shot observation of the empirical spectral distribution. Since is unknown, we are not able to evaluate the cross entropy directly. Instead, we minimize the following empirical risk.

###### Definition 5.10.

Let be the pair of a self-adjoint random matrix model and its FDE. Fix . The empirical FDE risk is defined as the real random variable

 Hγ(ESD(Wϑ0),μW□ϑ).

Note that

 Hγ(ESD(Wϑ0),μW□ϑ)=∫∫−log[−1πImGW□ϑ(x−t+iγ)]ESD(Wϑ0)(dx)Pγ(t)dt.

Recall that . Since we compute by iterative methods, it requires high time complexity if we compute the empirical risk by the integration with the Poisson kernel. To reduce the time complexity, we approximate the empirical risk by a further random risk by using Cauchy random variables.

###### Definition 5.11.

Let and be real random variables. Let and . Let be a family of i.i.d. Cauchy random variables of scale which is independent of . We define the Cauchy noise loss of the scale parameter and the noise dimension as the map on given by

 Lλγ,m(μ):=1dnd∑j=1m∑k=1ℓμγ(λj−Tj,k),

where

 ℓμγ(x):=−log[−1πImGμ(x+iγ)]=−logPγ∗μ(x).
###### Definition 5.12.

Let be the pair of a random matrix model and its FDE. Fix and . Let be the empirical eigenvalues of for a . We define the Cauchy noise loss of FDE model, denoted by , as the map on given by

 Lγ,m(ϑ) :=Lλγ,m(μW□ϑ). (5.5)

We write .

Note that

#### 5.2.1. Optimization Algorithm

Algorithm 1 shows the optimization algorithm based on the Cauchy noise loss.

It requires the settings of the maximum number of epochs (that is, the iterations of the outer loop), the learning rate policy, settings of the Cauchy noise loss, the initial parameter of the FDE model, and the bounded convex parameter space .

In our method, the sample is assumed to be a single-shot observation of a self-adjoint square random matrix model. Through the algorithm, the scale is fixed. We consider the collection of the eigenvalues of the sample matrix. Each epoch of the algorithm consists of the following steps. First, we compute the learning rate based on a learning rate policy (See Section 5.6.1), and permute randomly the indexes of eigenvalues. Second, for , we update parameters as follows: (1) generate independently samples of Cauchy random variables of scale , (2) compute by Proposition 5.23, Proposition 5.32, Theorem 5.36, Corollary 5.38 (their proofs are postponed to a later section) and (5.6), (3) update parameters by using the gradient of the loss function, and (4) project it onto the parameter space . We continue the iteration while the number of epochs is smaller than .

###### Remark 5.13.

Assume that the input matrix is an observation of random matrix , where is fixed and is a true parameter. Then we can expect that the algorithm returns a parameter which is close to a local minimum point of the possibly non-convex function

 ϑ↦Hγ(ESD(Wϑ0)(ω),μW□ϑ),

since the perturbed samples are generated independently from the distribution at each epoch (recall that is fixed). The theoretical proofs of the convergence and its stability are out of the scope of this paper, and postponed to future work. See Bottou[4], which summarizes the proof of the convergence of stochastic gradients descent methods under several assumptions of possibly non-convex loss functions.

If the convergence holds, by the analysis of the FDE-generalization gap (see Section 5.3), we can expect that the algorithm returns a close value to a local minimum of the function . Hence by Proposition 5.9, if the local minimum is a global one, we can expect that is close to . Its theoretical proof is also postponed to future work.

### 5.3. Analysis of FDE-generalization Gap

In this section we prove the FDE generalization gap, which is defined as the real random variable

 Hγ(ESD(Wϑ0),μW□ϑ)−Hγ(μW□ϑ0,μW□ϑ), (5.7)

uniformly converges to as the matrix size becomes large.

#### 5.3.1. Basic Properties

We begin by proving the Cauchy cross entropy is well-defined.

###### Lemma 5.14.

Fix . Then the followings hold.

1. For any , it holds that

2. for any , where .

3. for any and with . Hence is well-defined.

###### Proof.

By the definition of Poisson kernel, . It holds that