finite_size_corrections_wigner

## Abstract

In this paper we study principal components analysis in the regime of high dimensionality and high noise. Our model of the problem is a rank-one deformation of a Wigner matrix where the signal-to-noise ratio (SNR) is of constant order, and we are interested in the fundamental limits of detection of the spike. Our main goal is to gain a fine understanding of the asymptotics for the log-likelihood ratio process, also known as the free energy, as a function of the SNR. Our main results are twofold. We first prove that the free energy has a finite-size correction to its limit—the replica-symmetric formula—which we explicitly compute. This provides a formula for the Kullback-Leibler divergence between the planted and null models. Second, we prove that below the reconstruction threshold, where it becomes impossible to reconstruct the spike, the log-likelihood ratio has fluctuations of constant order and converges in distribution to a Gaussian under both the planted and (under restrictions) the null model. As a consequence, we provide a general proof of contiguity between these two distributions that holds up to the reconstruction threshold, and is valid for an arbitrary separable prior on the spike. Formulae for the total variation distance, and the Type-I and Type-II errors of the optimal test are also given. Our proofs are based on Gaussian interpolation methods and a rigorous incarnation of the cavity method, as devised by Guerra and Talagrand in their study of the Sherrington-Kirkpatrick spin-glass model.

## 1Introduction

Spiked models, which are distributions over matrices of the form “signal + noise,” have been a mainstay in the statistical literature since their introduction by [25] as models for the study of high-dimensional principal component analysis that are tractable yet realistic. Spectral properties of these models have been extensively studied, in particular in random matrix theory, where they are known as deformed ensembles [40]. Landmark investigations in this area [5] have established the existence of a spectral threshold above which the top eigenvalue detaches from the bulk of eigenvalues and becomes informative about the spike, and below which the top eigenvalue bears no information. Estimation using the top eigenvector undergoes the same transition, where it is known to “lose track” of the spike below the spectral threshold [38]. Although these spectral analyses have provided many insights, as have analyses based on more thoroughgoing usage of spectral data and/or more advanced optimization-based procedures [4], they stop short of characterizing the fundamental limits of estimating the spike, or detecting its presence from the observation of a sample matrix. These questions, information-theoretic and statistical in nature, are more naturally approached by looking at objects such as the posterior law of spike and the associated likelihood ratio process.

The main approach to date to the challenging problem of controlling the likelihood ratio is via the second moment method. Controlling the second moment enables one to show contiguity, in the sense of [29], between the planted and null models and thus declare impossibility of strong detection—i.e., the impossibility of vanishing Type-I and Type-II errors of any given test—in the region where this second moment is bounded [9]. This method is known, however, to require careful conditioning and truncation due to the existence of rare but catastrophic events under which the likelihood ratio becomes exponentially large. These events thus dominate the second moment, although they are virtually irrelevant to the detection task. Moreover, even after conditioning the method may fail in identifying the detection thresholds, depending on the structure of the spike. Furthermore, contiguity has little or no bearing on the problem of weak detection: When errors are inevitable, what is the smallest error achievable by any test?

Motivated by a desire to overcome these limitations, we consider a particularly simple spiked model—the rank-one spiked Wigner model—and provide an alternative approach to the detection problem that obviates the use of the second moment method altogether. This is achieved by obtaining asymptotic distributional results for the log-likelihood ratio process, then appealing to standard results from the theory of statistical experiments. We are thereby able to provide solutions to both the strong and weak variants of the detection problem. To study the likelihood ratio in this setting we build on the technology developed by Aizenman, Guerra, Panchenko, Talagrand, and many others, in their study of the Sherrington-Kirkpatrick (SK) spin glass-model. Specifically, we make use of Gaussian interpolation methods and Talagrand’s cavity method.

### 1.1Setup and summary of the results

In the spiked Wigner model, one observes a rank-one deformation of a Wigner matrix :

where and are independent for all . The spike vector represents the signal to be recovered, or its presence detected. We assume that the entries of the spike are drawn i.i.d. from a prior distribution on having bounded support. The parameter plays the role of the signal-to-noise ratio, and the scaling by is such that the signal and noise components of the observed data are of comparable magnitudes. This places the problem in a high-noise regime where consistency is not possible but partial recovery still is. As a matter of convenience, we discard the diagonal terms from the observations. (Adding the diagonal back does not pose any additional technical difficulties, and our results can be straightforwardly extended to this case.) We endow the real line with the Borel -algebra and define on it. We denote by the joint probability law of the observations as per and define the likelihood ratio

A simple computation based on conditioning on reveals that

We define the free energy (density) associated with the model to be

We see that , where is the Kullback-Leibler divergence between probability measures. (The free energy is usually defined differently in the literature as the log-normalizing constant in the posterior of given . The two definitions are strictly equivalent.) It was initially argued via heuristic replica and cavity computations [31] that converges to a limit , which is referred to as the replica-symmetric formula. This formula, variational in nature, encodes in principle a full characterization of the limits of estimating the spike with non-trivial accuracy. Indeed, various formulae for other information-theoretic quantities can be deduced from it, including the mutual information between and , the minimal mean squared error of estimating based on , and the overlap of a draw from the posterior with the spike . Most of these claims have subsequently been proved rigorously in a series of papers [17] under various assumptions on the prior. However, these results stop short of providing explicit characterizations of thresholds for the detection problem.

The main goal of this paper is to gain a more refined understanding of the asymptotic behavior of the log-likelihood ratio , and its mean , under as becomes large. We first determine the finite-size correction of to its limit : we prove (under conditions on ) that converges to a limit with rate . Besides providing an explicit rate of convergence of to its limit, this result translates into a formula for the Kullback-Leibler divergence , which is particularly interesting below the reconstruction threshold: we will see that in this regime , so ceases to be extensive in the size of the system and converges to a finite value .

Second, we prove that in this same regime, the log-likelihood ratio has fluctuations of constant order under , and converges asymptotically to a Gaussian with a mean equal to half the variance. This allows us to provide an alternative proof of contiguity between and , valid in the entire regime where contiguity can possibly hold, as well as a formula for the Type-II error for testing between these two distributions.

Under the null distribution on the other hand, the model is equivalent to the widely studied Sherrington-Kirkpatrick model (provided that ). In one of the first rigorous results on this model, [1] proved that in the high-temperature regime and in the absence of an external field, the fluctuations of the log-partition function of the model about its mean, which is given by the “annealed” computation, are asymptotically Gaussian with explicit mean and variance. By mapping their result into our setting, we obtain the fluctuations of under in the non-reconstruction phase, as well as a formula for the Type-I error. Although we only obtain this last formula for the Rademacher prior, we conjecture its validity for arbitrary priors. An interesting symmetry emerges from these results: the limiting Gaussians under and have means of equal magnitude and opposite signs, and equal variances. This symmetry causes the Type-I and Type-II errors to be equal. Adding up the two latter quantities, we obtain a formula for the total variation distance .

Our results are in the spirit of those of [36], who studied the likelihood ratio of the joint eigenvalue densities under the spiked covariance model with a sphericity prior, and showed its asymptotic normality below the spectral threshold. Their results thus pertain to eigenvalue-based tests while ours are for arbitrary tests, albeit for a simpler model. Our results show in particular that it is still possible to distinguish the planted model from the null model with non-vanishing probability below the reconstruction threshold; i.e., even when estimation of the spike becomes impossible. Performing such a test in practice of course hinges on the computational problem of efficiently computing the likelihood ratio; we leave open this question of constructing computationally efficient tests in the non-reconstruction phase.1

### 1.2Background

The formula. For , consider the function

where , and . This is the divergence between the distributions of the random variables and . We define the Replica-Symmetric () potential

and finally define the formula

A central result in this context is that free energy converges to the formula for all [31]:

The values of that maximize the potential and their properties play an important role in the theory. [30] proved that the map has a unique maximizer for all where is the set of points where the function is differentiable. By convexity of (see next section), . Moreover, they showed that the map is non-decreasing, and

One should interpret the value as the best overlap an estimator based on observing can have with the spike . Indeed, the overlap between the spike and a random draw from the posterior should concentrate in the large limit about (hence the name “replica-symmetry”). A matrix variant of this result (where one estimates ) was proved in [30]. In Section 3, we prove strong (vector) versions of this result where under mild assumptions, optimal rates of convergence are given.

The reconstruction threshold. The first limit in shows that when the prior is not centered, it is always possible to have a non-trivial overlap with for any . On the other hand, when the prior has zero mean, and since is a non-decreasing function of , it is useful to define the critical value of below which estimating becomes impossible:

We refer to as the critical or reconstruction threshold. The next lemma establishes a natural bound on .

Indeed, assume that is centered, and let . Since and , we see that and . So cannot be a maximizer of . Therefore and .

The importance of Lemma ? stems from the fact that the value is the spectral threshold previously discussed. Above this value, the first eigenvalue of the matrix leaves the bulk, and is at the edge of the bulk below it [39]. This value also marks the limit below which the first eigenvector of captures no information about the spike [11]. Inequality can be strict or turn into equality depending on the prior . For instance, there is equality if the prior is Gaussian or Rademacher—so that the first eigenvector overlaps with the spike as soon as estimation becomes possible at all—and strict inequality in the case of the (sufficiently) sparse Rademacher prior . More precisely, there exists a value

such that for , and for . In the latter case, the spectral approach to estimating fails for , and it is believed that no polynomial time algorithm succeeds in this region [31].

## 2Main results

### 2.1Finite size corrections to the RS formula

The results we are about to present hold in a possibly slightly smaller set than . While uniqueness of only needs first differentiability of the formula, our results need a second derivative to exist. In physics parlance, our results do not hold at values of at which a particular kind of first-order phase transition occurs, namely, one in which the order parameter is not differentiable. The presence of these transitions depends again on the prior . For the Gaussian and Rademacher prior, there are no such transitions, while for the sparse Rademacher prior discussed above, there is one first-order transition where is not defined for every . Thus we define the set

Since is the point-wise limit of a sequence of convex functions, it is also convex. Then by Alexandrov’s theorem [3], the set is of full Lebesgue measure in (cf. .) Moreover, we can see that , since if , we have , therefore . By continuity, vanishes on the entire interval . Our first main result is to establish the existence of a function defined on such that either below or above it when the prior is not symmetric about the origin, we have

An explicit formula for will be given. But first we need to introduce some notation. Let and consider the quantities

where

with and the expectation operator is w.r.t. and . The Gibbs measure can be interpreted as the posterior distribution of given the observation . (More on this point of view in Section 3.) Now let

and finally define

We will prove (Lemma ?) that for all so that this function is well defined on .

The theorem asserts that either below the reconstruction threshold, or above it when the prior is not symmetric, the free energy has a finite-size correction of order to its limit and a subsequent term of order in the expansion. In the case with symmetric prior, the problem is invariant under a sign flip of the spike, so the overlap has a symmetric distribution, and hence concentrates equiprobably about two distinct values . Our techniques do not survive this symmetry, and resolving this case seems to require a new approach.

We see that is an extensive quantity in whenever , or equivalently, . On the other hand, this is of constant order below :

Centered prior. Let us consider the case where the prior has zero mean, and unit variance (the latter can be assumed without loss of generality by rescaling ), so that Lemma ? reads . If , we have , , and one can check that in this case

Therefore, expression simplifies to

By the above calculation, we have a formula for the divergence between and below the reconstruction threshold (see plot in Figure ?):

More information on . Expression looks mysterious at first sight. Let us briefly explain its origin. A slightly less processed expression for is the following

after which follows by simple integration. The integrand in the above expression is obtained, as we will show, as the first entry of the solution of the linear system

where and is the “cavity” matrix

The above matrix happens to have two eigenvalues which are exactly and . The matrix and the above linear system will emerge naturally as a result of the cavity method. On the other hand, the integral over the time parameter is along an interpolation path invented by [21], [24] in the context of the Sherrington–Kirkpatrick model, and the integrand can be interpreted as the asymptotic variance in a central limit theorem satisfied by the overlap between two “replicas” under the law induced by a certain interpolating Gibbs measure. A definition of these notions with the corresponding results can be found in Sections Section 3 and Section 4. The full execution of the cavity method is relegated to Section 5.

### 2.2Fluctuations below the reconstruction threshold

Corollary ? asserts that below the reconstruction threshold, the expectation of the log-likelihood ratio under is of constant order (in ) and is asymptotically equal to . In this section we are interested in the fluctuations of this quantity about its expectation. It can be seen by a standard concentration-of-measure argument that for all , concentrates about its expectation with fluctuations bounded by . While this bound is likely to be of the right order above [23], it is very pessimistic below . Indeed, we will show that the fluctuations are of constant order with a Gaussian limiting law in this regime. This phenomenon was noticed early on in the case of the SK model: [1] showed that in the absence of an external field, the log-partition function of this model has (shifted) Gaussian fluctuations about its easily computed “annealed average” in high temperature. We will directly deduce from their result a central limit theorem for under in the case where the prior is Rademacher. Furthermore, a proof by [44] of their result provided us with a road map for proving a similar result under . We now present our second main result along with consequences for hypothesis testing. For , let

The symbol denotes convergence in distribution as . The formal connection to the SK model and the proof of the above theorem are presented in Section 6.

A remarkable feature is the symmetry between the above two statements. Roughly speaking, this symmetry takes its roots in the fact that the model under the alternative distribution “lives on the Nishimori line”: under , which is the spiked model, the interaction of the spike with a replica creates terms that account for twice the contribution of the interaction between two independent replicas and , and thus flips the sign of the mean from to . This mechanism will become apparent from the proof. Moreover, the fact that the mean is half the variance in these limiting Gaussians has interesting consequences for hypothesis testing. The next subsection is devoted to this problem.

We believe that statement (ii) is still valid up to for a general prior (of zero mean and unit variance). It is possible to prove the convergence up to some value with essentially the same approach as ours, but reaching the optimal threshold seems to require more technical work. In particular, our interpolation bound (see our “main estimate”, Section 4.2) has to be significantly improved to deal with this case. Progress will be reported in a future work.

We also point out that similar fluctuation results were recently proved by [6] for a spherical model where one integrates over the uniform measure on the sphere in the definition of . Their model, due to its integrable nature, is amenable to analysis using tools from random matrix theory. The authors are thus able to also analyze a “low temperature” regime (absent in our problem) where the fluctuations are no longer Gaussian but given by the Tracy-Widom distribution. Their techniques seem to be tied to the spherical case however.

#### Strong and weak detection below λc.

Consider the problem of deciding whether an array of observations is likely to have been generated from for a fixed or from . Let us denote by the null hypothesis and the alternative hypothesis. Two formulations of this problem exist: one would like to construct a sequence of measurable tests that returns “0” for and “1” for , for which either

or less stringently, the total mis-classification error, or risk

is minimized among all possible tests . The question of existence of a test that answers to the requirement is referred to as the strong detection problem, and the question of minimizing the criterion is referred to as the weak detection, or simply hypothesis testing problem.

Strong detection. Using a second moment argument based on the computation of a truncated version of , [9] and [41] showed that and are mutually contiguous when , where the latter quantity equals for some priors while it is suboptimal for others (e.g., the sparse Rademacher case, see discussion below). It is easy to see that contiguity implies impossibility of strong detection since for instance, if then . Here we show that Theorem ? provides a more powerful approach to contiguity:

A consequence of statement (i) in Theorem ? is that if

under along some subsequence and for some random variable , then by the continuous mapping theorem we necessarily have

We have , and since , we have . We now conclude using Le Cam’s first lemma in both directions [45].

This approach allows one to circumvent second moment computations which are not guaranteed to be tight in general, and necessitate careful and prior-specific conditioning that truncates away undesirable events.

We note that in the case of the sparse Rademacher prior , contiguity holds for all as soon as by the above corollary, thus closing the gaps in the results of [9] and [41]. Indeed, as argued below Lemma ?, the reconstruction and spectral thresholds are equal () for all , and differ () below . This implies that strong detection is impossible for and possible otherwise when , while it becomes impossible only below but possible otherwise when .

Weak detection. We have seen that strong detection is possible if and only if . It is then natural to ask whether weak detection is possible below , i.e., is it possible to test with accuracy better than that of a random guess below the reconstruction threshold? The answer is yes, and this is another consequence of Theorem ?. More precisely, the optimal test minimizing the risk is the likelihood ratio test which rejects the null hypothesis (i.e., returns “1”) if , and its error is

One can readily deduce from Theorem ? the Type-I and Type-II errors of the likelihood ratio test: for all the Type-II error is

and in the case of the Rademacher prior, the Type-I error is

for all . Here, is the complementary error function. These can be combined into a formula for and the total variation distance between and (see plot in Figure ?):

We similarly conjecture that the formula for Type-I error, hence formula , should be correct up to for all (bounded) priors with zero mean and unit variance.

## 3Overlap convergence: optimal rates

A crucial component of proving our main results is understanding the rate of convergence of the overlap , where is drawn from , to its limit . By Bayes’ rule, we see that

where is the Hamiltonian

From the formulas and , it is straightforward to see that

This provides another way of interpreting as the expected log-partition function (or normalizing constant) of the posterior . For an integer and , we define the Gibbs average of w.r.t. as

This is nothing else that the average of with respect to . The variables are called replicas, and are interpreted as random variables independently drawn from the posterior. When we simply write instead of . Throughout the rest of the manuscript, we use the following notation: for , we let

In this section we show the convergence of the first 4 moments of the overlap at optimal rates under some conditions: if either the prior is not symmetric about the origin or the Hamiltonian is “perturbed” in the following sense. Let and consider the“ interpolating” Hamiltonian (this qualification will become clear in the next section)

where the ’s are i.i.d. standard Gaussian r.v.’s independent of everything else, and . We similarly define the Gibbs average as in where is replaced by . We now state a fundamental property satisfied by both and .

The Nishimori property. The fact that the Gibbs measure is a posterior distribution has far-reaching consequences. A crucial implication is that the -tuples and have the same law under . This fact, which is a simple consequence of Bayes’ rule [30] prevents replica-symmetry from breaking [27]. In particular, and have the same distribution. This bares the name of the Nishimori property in the spin glass literature [35]. Moreover, this property is preserved under the interpolating Gibbs measure for all . Indeed, the interpolation is constructed in such a way that is the posterior distribution of the signal given the augmented set of observations

where one receives side information about from a scalar Gaussian channel, , and the signal-to-noise ratios of the two channels are altered in a time dependent way. Now we state our concentration result.

If is symmetric about the origin then the distribution of under is also symmetric, so . If moreover (i.e., ) then becomes trivial at since both sides are constant. On the other hand, if either or is asymmetric, the sign symmetry of the spike is broken. This forces the overlap to be positive and hence concentrate about . Finally, if , and the sign symmetry becomes irrelevant since the overlap converges to zero regardless. Let us mention that in the symmetric unperturbed case (), we expect a variant of to hold where is replaced by its absolute value in the statement, and the upper bound would be . Unfortunately, our methods do not allow us to prove such a statement, but we are able to prove a weaker result (see Lemma ?): for all ,

Although this a minor technical point, we also point out that the estimate in the statement is suboptimal. A heuristic argument allows us to get as , but we are currently unable to rigorously justify it.

Mmse. The bound can be used to deduce the optimal error of estimating based on the observations . The posterior mean is the estimator with Minimal Mean Squared Error (MMSE) among all estimators , and the MMSE is

The last line follows from the Nishimori property, since . Theorem ? implies in particular (under the conditions of its validity) that , yielding the value of the MMSE. It is in particular possible to estimate the spike from the observations with non-trivial accuracy if and only if . Note that at (no side information) the result still holds below or when the prior is not symmetric. Otherwise, as mentioned before, the problem is invariant under a sign flip of so one has to change the measure of performance. Beside the result , we are unable to say much in this situation.

Asymptotic variance. By Jensen’s inequality we deduce from the convergence of the second moment:

To establish our finite-size correction result (Theorem ?) we need to prove a result stronger than , namely that converges to a limit. For and , we let

where and are defined in .

The proofs of Theorems ? and ? rely on the cavity method, and will be presented in Section 5. Finally, the techniques we use could be easily extended to prove convergence of all the moments at optimal rates: for all integers ,

but we will not need this stronger statement.

## 4The interpolation method

In this section we present the interpolation method of [21]. All our main arguments will rely, in one way or another, on this method. Along the way, we prove Theorem ?. The idea is to construct a continuous interpolation path between the Hamiltonian and a simpler Hamiltonian that decouples all the variables, and analyze the incremental change in the free energy along the path. We present two versions of this method. The first one is the classical method which is applied to the free energy of the entire system, and a second one applied to the free energy of a more restricted system.

### 4.1The Guerra interpolation

Our interpolating Hamiltonian is from with for some . Now we consider the interpolating free energy

We see that and . This function is moreover differentiable in , and by differentiation, we have

Now we use Gaussian integration by parts to eliminate the variables and . The details of this computation are explained extensively in many sources. See [43]. We get

Completing the squares yields

The first line in the above expression involves overlaps between two independent replicas, while the second one involves overlaps between one replica and the planted solution. Using the Nishimori property, the derivative of can be written as

The last term follows by symmetry between sites. Now, integrating over , the difference between the free energy and the potential can be written in the form of a sum rule:

We see from that converges to if and only if the overlap concentrates about . This happens only for a value of that maximizes the potential