On the Noise-Information Separation of aPrivate Principal Component Analysis Scheme

# On the Noise-Information Separation of a Private Principal Component Analysis Scheme

## Abstract

In a survey disclosure model, we consider an additive noise privacy mechanism and study the trade-off between privacy guarantees and statistical utility. Privacy is approached from two different but complementary viewpoints: information and estimation theoretic. Motivated by the performance of principal component analysis, statistical utility is measured via the spectral gap of a certain covariance matrix. This formulation and its motivation rely on classical results from random matrix theory. We prove some properties of this statistical utility function and discuss a simple numerical method to evaluate it.

\IEEEoverridecommandlockouts

## 1 Introduction

In the last decades, privacy breaches made clear the necessity of privacy mechanisms with provable guarantees. In this context, additive noise mechanisms are a popular choice among practitioners given their ease of implementation and mathematical tractability [1]. In order to understand the trade-off between the privacy guarantees provided by and the statistical cost of this type of mechanism, it is necessary to precisely quantify privacy and statistical utility. In this paper we consider two common measures of privacy, one based on mutual information and the other one on the minimum mean-squared error (MMSE). In the context of a survey with queries and respondents, we introduce a measure of statistical utility motivated by the performance of principal component analysis (PCA), a statistical method aimed at finding the least number of variables that explain a given data set [2, Ch. 9]. More specifically, statistical utility is measured by the gap between the eigenvalues of a certain covariance matrix associated with the responses. This formulation and its motivation rely on classical results from random matrix theory. To facilitate mathematical tractability, we focus on a toy model where the eigenvalues of the data covariance matrix are either large or negligible. For this model, we derive a simple numerical method to compute the utility function. A general treatment of spectrum separation can be found in [3, Ch. 6].

Private versions of PCA have been analyzed in the past, specially under the framework of differential privacy, see [4] and references therein. Many of these analyses rely on results stemming from finite dimensional (random) matrix theory, see, e.g., [5]. The approach in the present paper follows a different path, relying on asymptotic random matrix theory considerations. Our main motivation is two-fold: the behavior of the eigenvalues of certain random matrices becomes simpler when the dimensions go to infinity (Thm. 1) and this asymptotic behavior essentially appears in finite dimension (Thm. 2).

In Sec. 2 we present the setting of our problem. The statistical utility function and some of its properties are then introduced in Sec. 3, followed by a privacy analysis in Sec. 4. In particular, we study the privacy-utility trade-off in the spirit of [6], which is also related with the privacy-utility trade-offs in [7] and references therein. In Sec. 5 a simple numerical method for the computation of the utility function is provided. Due to space limitations, all the proofs are deferred to [8].

Notation. Let and , where is the set of complex numbers and is the imaginary part of . For a complex matrix , we let be its -entry and be its conjugate transpose. The indicator function of a set is denoted by . For a probability distribution , we let be its support, i.e., the smallest closed set with . For with eigenvalues , the probability distribution defined by is called the eigenvalue distribution of .

## 2 Setting

Assume that a survey with queries is handed to respondents. Let be the matrix associated to this survey. We assume that is a realization of a random matrix

 X=Σ1/2W,

where is a (deterministic) covariance matrix and is a random matrix whose entries are independent and identically distributed (i.i.d.) real random variables with zero mean and unit variance. Note that the columns of are independent realizations of a random vector with covariance . A popular instance of this model corresponds to the case where the entries of are i.i.d. Gaussian random variables; thus the entries of are possibly correlated Gaussian random variables. The covariance matrix possesses valuable statistical information about the respondent population. Hence, in many applications the data aggregator is interested in obtaining an estimation of . In this setting, the canonical estimator is the sample covariance matrix

 ˆΣ:=1nˆXˆX∗.

Because of privacy concerns, the respondents might not want to disclose their answers, , to the data aggregator. Instead, they might want to use a randomized mechanism to alter their answers, giving them the position of plausible deniability towards their responses. In this paper we focus on an additive noise model: instead of providing to the data aggregator, the respondents provide

 ˆXt:=ˆX+√tˆZ,

where is a design parameter and is a realization of , a random matrix which is independent of and whose entries are i.i.d. random variables with zero mean and unit variance. In this case, the sample covariance matrix equals

 ˆΣt=1nˆXtˆX∗t,

a realization of , where . Note that and that , where denotes the identity matrix. The probability distribution of the additive noise may change according to the nature of the data, e.g., discrete or continuous. In particular, both and the distribution of the noise are the design parameters of the privacy mechanism. Observe that given these parameters, this additive mechanism can be implemented locally at each user, making unnecessary the presence of a trustworthy data aggregator.

If for the application at hand the noise distribution is fixed, then the trade-off between privacy and statistical utility becomes evident: when increases the respondents’s privacy improves as their answers get more distorted but, at the same time, the sample covariance matrix differs more from . Note that for fixed and large (), the latter is not a problem. Indeed, under some mild assumptions, the law of large numbers implies

 limn→∞∥Σt−(Σ+tIp)∥22\lx@stackrela.s.=0,

where a.s. stands for almost surely. Hence, the data aggregator might use as an estimate of without incurring a big statistical loss. However, when both and are large (), the estimator is known to be a poor estimate of , e.g., the eigenvalues of might be very different from those of , see Thm. 1. Since in many contemporary applications and are within the same order of magnitude, it is necessary to quantify the statistical cost incurred by the additive noise mechanism in this regime. In the next section we do so by introducing a utility function connected to the performance of PCA.

## 3 Statistical Utility Function

We now introduce a statistical utility function that captures the performance of PCA applied to . In order to motivate its definition, let us consider the following example.

To simplify the exposition, in this section we assume that the entries of and are Gaussian. At the end of this section we comment on the universality of the subsequent analysis.

Example 1. Let and . Assume that is diagonal with eigenvalues , , and with multiplicities , , and , respectively. A histogram of the eigenvalues of an instance of is given in Fig. 1. Note that this distribution is a blurred version of the eigenvalue distribution of ,

 FΣ([a,b])=71010∈[a,b]+21017∈[a,b]+110110∈[a,b]. (1)

As increases, the additive noise becomes stronger, making the eigenvalue distribution of more diffuse, as shown in Fig. 1. This behavior has a direct impact on the benefits of PCA, which provides a dimensionality reduction inversely proportional to the number of largest eigenvalues. For example, PCA performed on would propose the five largest eigenvalues as the most informative components. Similarly, PCA performed on or would suggest the fifteen largest eigenvalues. Since all the eigenvalues of are merged together, PCA in this latter case might be ineffective.

The forthcoming definition of the statistical utility function relies on the following asymptotic considerations. The Gaussianity assumed in this section implies

 Xt=Σ1/2W+√tZ\lx@stackreld=(Σ+tIp)1/2W, (2)

where stands for equality in distribution. In particular,

 Σt\lx@stackreld=1n(Σ+tIp)1/2WW∗(Σ+tIp)1/2. (3)

The next theorem [9, Thm. 1.1] is a generalization of the Marchenko-Pastur theorem, a cornerstone of random matrix theory. For a probability distribution function , its Cauchy transform is the (analytic) function defined by

 G(z)=∫R1z−xdF(x).

The Cauchy transform characterizes a distribution function. Indeed, the Stieltjes inversion formula states that

 F([a,b])=−1πlimϵ→0+∫baIG(x+iϵ)dx,

for all continuity points of . When is regular enough, its density equals .

###### Theorem 1 ([9]).

Assume on a common probability space:

(a) For , is , are identically distributed for all , independent across for each , ;

(b) with as ;

(c) is random Hermitian nonnegative definite, with eigenvalue distribution converging a.s. in distribution to a probability distribution on as ;

(d) and are independent.

Let be the Hermitian nonnegative square root of , and let . Then, a.s., the eigenvalue distribution of converges in distribution, as , to a non-random probability distribution , whose Cauchy transform satisfies

 G(z)=∫R1z−x(1−c+czG(z))dH(x), (4)

in the sense that, for each , is the unique solution to (4) in .

Note that if is discrete, the integral in (4) reduces to a sum. In particular, if , then is the only root in of a polynomial of degree . For instance, if , then solves the equation , a quadratic polynomial in . The next example shows the predictive power of Thm. 1.

Example 2. In Fig. 2, the histogram of the eigenvalues of a realization of is depicted for two different values of and with . In both cases, the eigenvalue distribution of is given by (1). The asymptotic density of the eigenvalues provided by Thm. 1 is also depicted. Observe the close agreement between the empirical and asymptotic eigenvalue distributions, even for as small as 50.

The previous example demonstrates that not only the distribution of the eigenvalues follows closely the corresponding asymptotic density, but also that there are no eigenvalues outside the support of the asymptotic prediction. This observation is formalized in the following theorem [10]. Given and a probability distribution on , we let be the limiting distribution determined by (4).

###### Theorem 2 ([10]).

Assume:

(a) , , are i.i.d. random variables with , , and ;

(b) with as ;

(c) For each , is Hermitian nonnegative definite with eigenvalue distribution converging in distribution to a probability distribution ;

(d) where with , and is any Hermitian square root of ;

(e) The interval with lies outside the support of and for all large .

Then, with probability one, no eigenvalue of appears in for all large .

The previous theorem readily implies that the gaps in the support of appear in finite dimension. This is of particular interest for this paper, as PCA is more useful when there are few large eigenvalues, i.e., there is a gap between large and small eigenvalues. Now we introduce the promised statistical utility function.

In order to keep the analysis tractable, we consider the following toy model for the situation in which there is a clear distinction between large and small eigenvalues: the covariance matrix has only one non-zero eigenvalue, say , with multiplicity for some . Under this assumption, the eigenvalue distribution of equals1

 Ht(x)=(1−r)1x≥t+r1x≥s+t. (5)

By (3) and Thm. 1, a.s., the eigenvalue distribution of converges, as , to

 Ft:=Fc,Ht. (6)

Finally, let be the support of and be the number of its connected components. Note that, by Lemma 1, is finite for every .

###### Definition 1.

The utility function is defined as follows. If , we let . If and are the connected components of ,

 U(t)=mina∈At,b∈Bt|a−b|.

In words, approximates the separation between the large and the small eigenvalues of , as long as such separation exists. As exhibited by equations (2) and (3), the large eigenvalues of correspond mainly to the non-zero eigenvalues of , while the small ones come from the added noise . Note that in order for to be well defined, it is necessary for the range of to be a subset of , as established next.

###### Theorem 3.

With the assumptions from (5) and (6), we have that for all .

One way to compute is finding , determining its connected components, and measuring their distance. However, there is a more efficient method based on the discriminant of a cubic equation. To avoid an unnecessary digression, this method is discussed in Sec. 5. Using this method, in Fig. 3 we plot the graph of .

Under our standing assumptions, the performance of PCA is heavily compromised for a noise power such that , as the gap between noise and information disappears. Indeed, for large enough the gap always disappears, as established by the following proposition.

###### Proposition 1.

There exists such that for all .

Note that in Fig. 3 there exists a such that if and only if . Thus, in principle, any noise power does not compromise the performance of PCA. This property makes useful in the design of privacy mechanisms. In view of Thm. 3 and Prop. 1, the existence of such is equivalent to the following.

###### Conjecture 1.

is non-increasing in .

In addition to simulations, there are theoretical reasons to believe in the above conjecture, e.g., similar results are known to be true for other random matrix models [11]. Ultimately, we are interested in the statistical utility function and not only in the set . For this utility function, there is numerical evidence supporting the following stronger conjecture.

###### Conjecture 2.

is non-increasing and convex in .

Remark. In this section we assumed that both data and noise are Gaussian. Nonetheless, one can appeal to universality arguments to establish that the conclusions reached in this section hold for a much wider range of random matrix models. In the square case, when , one can appeal to the universality of the circular law, as established in [12], and the asymptotic freeness of several random matrices, see, e.g., [13, 14]. The non-square case can be handled similarly using the ideas in [15] and references therein.

## 4 Privacy Measures

Having defined as the utility function, we need to specify a privacy function to quantify the trade-off between utility and privacy. A natural option is to measure the information leakage of the user’s raw data in its perturbed version. In this section we discuss two specific measures of information leakage: mutual information and MMSE.

Mutual Information. Let and , , denote the -th column of and , respectively. Since the entries of and are i.i.d., the mutual information does not depend on . Thus, w.l.o.g., we define

 PIT(t):=I(X(1);X(1)t),

as a privacy measure. Measuring privacy in terms of mutual information has been explored extensively in the past, see, e.g., [16]. Assuming that both data and noise are Gaussian,

 PGIT(t)=12logdet(Ip+Σt).

In particular, for the toy model of the previous section,

 Pr,s,pIT(t)=⌊rp⌋2log(1+st).

In the context of the last remark of the previous section, i.e., when data and/or noise are not necessarily Gaussian, it is relevant to consider the following.

Assume that the noise is Gaussian but the data is drawn from an arbitrary distribution having a density and finite third moment. Let . With this notation,

 PIT(t)=h(θX(1)+Z(1))−p2log2πe,

where denotes differential entropy. In particular, studying amounts to studying . If , then it follows from [17, Lemma 1] that, as ,

 I(X(1);θX(1)+Z(1))=θ22+o(θ2),

and thus in the high privacy regime (). For , the chain rule implies

 I(X(1);θX(1)+Z(1))≤pTr(Σ)θ22+o(θ2),

and hence in the high privacy regime.

Now assume that neither data nor noise is Gaussian. Recall that the non-Gaussianity of a random vector is defined as , where denotes the Kullback-Leibler divergence, and is a Gaussian random vector with the same mean and covariance matrix as . It can be shown that

 I(X(1);X(1)t)=PGIT(t)+D(√tZ(1))−D(X(1)t).

In this case, regardless of distributions of and ,

 PIT(t)≤PGIT(t)+D(√tZ(1))=PGIT(t)+pD(√tZ11),

where the last equality holds as the entries of are i.i.d.

MMSE. In [18], see also [16], the authors proposed an estimation-theoretic measure in terms of MMSE. Following this approach, we define

 PET(t)\coloneqqp∑i=1mmse(Xi1|Xt)=E[∥∥X(1)−E[X(1)|Xt]∥∥2],

where . If both data and noise are Gaussian, then we can write

 PGET(t)=Tr[(Ip+t−1Σ)−1Σ].

In particular, for the toy model in the previous section

 Pr,s,pET(t)=⌊rp⌋tst+s.

It is worth pointing out that and are connected by the so-called I-MMSE relation, see [19]. For example, when the noise is Gaussian, . In this case, quantifies the rate of decrease of the information-theoretic privacy leakage .

Privacy-Utility Function. In order to formally connect privacy and utility, we define the following privacy-utlity function in the spirit of [6], see also [7] and references therein. For , we define

 gIT(ϵ):=supt:PIT(t)≤ϵU(t).

In words, equals the largest utility under the privacy constraint . Conditional on the non-increasing behavior of (Conj. 2), it is easy to verify that for the model of the previous section

 gr,s,pIT(ϵ)=U((Pr,s,pIT)−1(ϵ)), (7)

where . Since can be computed using the tools from the following section, (7) provides a useful way to compute the privacy-utility function . Fig. 4 depicts for , and . Observe that, conditional on the existence of (Conj. 1), if and only if .

The privacy-utility trade-off for can be handled similarly by replacing with , as two highly correlated random variables posses a high mutual information but, at the same time, a small MMSE.

## 5 Numerical Computation of U

Throughout this section , , and are fixed. For , we let be the Cauchy transform of with defined as in (5). By Thm. 1, for each , is a solution to the equation

 G=1−rz−t(1−c+czG)+rz−(t+s)(1−c+czG).

Alternatively, is a root of the polynomial

 Pt,z(G)=At,zG3+Bt,zG2+Ct,zG+Dt,z∈C[G],

where , , , and with and . The following lemma provides a characterization of .

###### Lemma 1.

Let be the (real) polynomial given by

 x↦ 18Ax,tBx,tCx,tDx,t−4B3x,tDx,t +B2x,tC2x,t−4Ax,tC3x,t−27A2x,tD2x,t.

Then, is the closure of .

The above lemma suggests a simple method to compute : find the positive roots of , identify where is positive and negative, and subtract the roots delimiting the gap of interest. This process is depicted in Fig. 5, where the support of is represented by thick blue lines and the value of equals the third minus the second positive root of .

### Footnotes

1. More precisely, which is negligible.

### References

1. C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor, Our Data, Ourselves: Privacy Via Distributed Noise Generation.   Springer Heidelberg, 2006, pp. 486–503.
2. R. J. Muirhead, Aspects of multivariate statistical theory.   John Wiley & Sons, 2009, vol. 197.
3. Z. Bai and J. W. Silverstein, Spectral analysis of large dimensional random matrices.   Springer, 2010, vol. 20.
4. K. Chaudhuri, A. D. Sarwate, and K. Sinha, “A near-optimal algorithm for differentially-private principal components,” JMLR, vol. 14, no. 1, pp. 2905–2943, 2013.
5. L. Wei, A. D. Sarwate, J. Corander, A. Hero, and V. Tarokh, “Analysis of a privacy-preserving PCA algorithm using random matrix theory,” in Signal and Information Processing (GlobalSIP).   IEEE, 2016, pp. 1335–1339.
6. S. Asoodeh, M. Diaz, F. Alajaji, and T. Linder, “Information extraction under privacy constraints,” Information, vol. 7, no. 1, p. 15, 2016.
7. L. Sankar, S. R. Rajagopalan, and H. V. Poor, “Utility-privacy tradeoffs in databases: An information-theoretic approach,” IEEE Transactions on Information Forensics and Security, vol. 8, no. 6, pp. 838–852, 2013.
8. M. Diaz, S. Asoodeh, F. Alajaji, T. Linder, S. Belinschi, and J. Mingo, “On the noise-information separation of a private principal component analysis scheme,” To appear.
9. J. W. Silverstein, “Strong convergence of the empirical distribution of eigenvalues of large dimensional random matrices,” Journal of Multivariate Analysis, vol. 55, no. 2, pp. 331–339, 1995.
10. Z.-D. Bai and J. W. Silverstein, “No eigenvalues outside the support of the limiting spectral distribution of large-dimensional sample covariance matrices,” Annals of Probability, pp. 316–345, 1998.
11. P. Biane, “On the free convolution with a semi-circular distribution,” Indiana University Mathematics Journal, pp. 705–718, 1997.
12. T. Tao and V. Vu, “Random matrices: Universality of ESDs and the circular law,” The Ann. Probability, vol. 38, no. 5, pp. 2023–2065, 2010, with an appendix by M. Krishnapur.
13. A. Nica and R. Speicher, Lectures on the combinatorics of free probability.   Cambridge University Press, 2006, vol. 13.
14. J. Mingo and R. Speicher, Free Probability and Random Matrices, ser. Fields Institute Monographs.   Springer-Verlag New York, 2017, vol. 35.
15. F. Benaych-Georges, “Rectangular random matrices, entropy, and Fisher’s information,” Journal of Operator Theory, pp. 371–419, 2009.
16. F. P. Calmon, A. Makhdoumi, M. Médard, M. Varia, M. Christiansen, and K. R. Duffy, “Principal inertia components and applications,” IEEE Trans. Inf. Theory, vol. 63, no. 8, pp. 5011–5038, Aug 2017.
17. M. S. Pinsker, V. V. Prelov, and S. Verdú, “Sensitivity of channel capacity,” IEEE Trans. Inf. Theory, vol. 41, no. 6, pp. 1877–1888, Nov 1995.
18. S. Asoodeh, F. Alajaji, and T. Linder, “Privacy-aware MMSE estimation,” in IEEE International Symposium on Information Theory (ISIT), July 2016, pp. 1989–1993.
19. D. Guo, S. Shamai, and S. Verdú, “Mutual information and minimum mean-square error in gaussian channels,” IEEE Trans. Inf. Theory, vol. 51, no. 4, pp. 1261–1282, April 2005.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters