Optimal query complexity for estimating the trace of a matrix

Optimal query complexity for estimating the trace of a matrix

Karl Wimmer Duquesne University Most of the work is done when the author is visiting the Simons Institute for the Theory of Computing, University of California-Berkeley. Supported in part by NSF grant CCF-1117079.    Yi Wu Purdue UniversityMost of the work is done when the author is visiting the Simons Institute for the Theory of Computing, University of California-Berkeley.    Peng Zhang Purdue University
Abstract

Given an implicit matrix with oracle access for any , we study the query complexity of randomized algorithms for estimating the trace of the matrix. This problem has many applications in quantum physics, machine learning, and pattern matching. Two metrics are commonly used for evaluating the estimators: i) variance; ii) a high probability multiplicative-approximation guarantee. Almost all the known estimators are of the form for being i.i.d. for some special distribution.

Our main results are summarized as follows:

  1. We give an exact characterization of the minimum variance unbiased estimator in the broad class of linear nonadaptive estimators (which subsumes all the existing known estimators).

  2. We also consider the query complexity lower bounds for any (possibly nonlinear and adaptive) estimators:

    1. We show that any estimator requires queries to have a guarantee of variance at most .

    2. We show that any estimator requires queries to achieve a -multiplicative approximation guarantee with probability at least .

    Both above lower bounds are asymptotically tight.

As a corollary, we also resolve a conjecture in the seminal work of Avron and Toledo (Journal of the ACM 2011) regarding the sample complexity of the Gaussian Estimator.

1 Introduction

Given an matrix , we study the problem of estimating its trace

with a randomized algorithm that can query for any . The goal is to minimize the number of queries used to achieve certain type of accuracy guarantee, such as the variance of the estimate or a multiplicative approximation (which holds with high probability). Finding an estimator that achieves such an accuracy guarantee with few queries has several applications. For example, this problem is well studied in the subject of lattice quantum chromodynamics, since such queries are physically feasible and can be used to efficiently estimate the trace of a function of a large matrix . Such an estimator can also be used as a building block for many other applications including solving least-squares problems [Hut89], computing the number of triangles in a graph [Avr10, Tso08], and string pattern matching [ACD01, AGW13].

This problem has been well studied in the literature. All of the previously analyzed estimators are of the form for ; nearly all take to be independent and identically distributed (i.i.d.) from some well designed distribution. For example, in [Hut89], the author just takes each query to be a random vector whose entries are i.i.d. Rademacher random variables (i.e., each coordinate is a uniformly random sample from ); we call this the Rademacher estimator. There are also several other alternative distributions on , such as drawing each query from a multivariate normal distribution [SR97], we call this the Gaussian estimator. Here, the coordinates of each vectors are i.i.d. Gaussian random variables. The work of [IE04] considers the case where only one query is allowed, but that query can be a unit vector in . Other estimators occur in [DS93, Wan94]. Recent work by [AT11], the authors propose several new estimators such as the unit vector estimator, normalized Rayleigh-quotient trace estimator, and the mixed unit vector estimator. One estimator that does not use i.i.d. queries is due to [RKA13]; in that work, the authors propose querying random standard basis vectors without replacement.

To characterize the performance of an estimator, perhaps the most natural metric is the variance of the estimator. It is known that the Gaussian estimator has variance and the random Rademacher vector estimator has variance , where is the Frobenius norm. In recent work by Avron and Toledo [AT11], it is suggested that the notion of a multiplicative approximation guarantee might be a better success metric of an estimator than the variance. Formally, we say an estimator is an -estimator if it outputs an estimate in the interval with probability at least . It should be noted that some assumptions on the matrices need to be made to have a valid -estimator, as it is impossible to achieve any multiplicative approximation when the matrix could have a trace of . A natural choice is to assume that comes from the class of symmetric positive semidefinite (SPD) matrices. For a SPD matrix, the authors in [AT11] prove that the Gaussian estimator with queries to the oracle is an -estimator. It was recently shown in [RKA13] that the random Rademacher vector estimator is also an -estimator with the same sample complexity.

An open problem asked in [AT11] is the following: does the Gaussian estimator require in order to be an -estimator? The authors showed that this number of queries suffices and conjectured that their analysis of the Gaussian estimator is tight with supporting evidence from empirical experiments. The paper gives some intuition on how to show an lower bound. The authors suggested that the difficulty of turning this argument into a formal proof is that “current bounds [on the cumulative distribution function] are too complex to provide a useful lower bound”. Regarding lower bounds for trace estimators, we note the related work of [LNW14], which considers the problem of sketching the nuclear norms of using bilinear sketches (which can be viewed as nonadaptive queries of the form ). The problem is similar to estimating trace when the underlying matrix is positive semidefinite.

All of the above mentioned estimators (with one exception in [RKA13]) use independent identically distributed queries from some special distributions, and the output is a linear combination of the query results. On the other hand, when viewing an estimator as a randomized algorithm, we can choose any distribution over the queries, and the output can be any (possibly randomized) function of the results of the queries. Given the success of the previously mentioned estimators, it is natural to ask whether these extensions are helpful. For example, can we get a significantly better estimator with a non i.i.d. distribution? Can we do better with adaptive queries? Can we do better with a nonlinear combination of the query results?

In this paper, we make progress on answering above questions and understanding the optimal query complexity for randomized trace estimators. Below is an informal summary of our results.

  1. Among all the linear nonadaptive trace estimators (which subsumes all the existing trace estimators), we prove that the “random orthogonal vector” estimator is the minimum variance estimator. The distribution on the queries is not i.i.d., and we are unable to find an occurrence of this estimator in the literature regarding trace estimators.

  2. We also prove two asymptotically optimal lower bounds for any (possibly adaptive and possibly nonlinear) estimator.

    1. We show that every trace estimator requires queries to have a guarantee that the variance of the estimator is at most .

    2. We show that every -estimator requires queries.

As a simple corollary, our result also confirms the above mentioned conjecture in [AT11] (as well as the tightness of the analysis of the Rademacher estimator in [RKA13]). Notice our result is a much stronger statement: the original conjectured lower bound is only for an estimator that returns a linear combination of i.i.d. Gaussian queries; we prove the lower bound holds for any estimator. Our lower bound also suggests that adaptiveness as well as nonlinearity will not help asymptotically as all these lower bounds are matched by the nonadaptive Gaussian estimator. On the other hand, our upper bound suggests that the exact minimum variance estimator might not use i.i.d. queries.

1.1 Problem Definitions

Definition 1 (estimator for the trace)

A trace estimator is a randomized algorithm that, given query access to an oracle for an unknown matrix , makes a sequence of queries to the oracle and receives . The output of the estimator is a real number determined by the queries and the answers to the queries.

Definition 2 (nonadaptive linear unbiased trace estimator)

We say a trace estimator is nonadaptive if the distribution of is not dependent on . A trace estimator is linear if we sample from a distribution over queries as well as their weights: , and , and output . In addition, a linear trace estimator is unbiased if

Without loss of generality, we can assume that all the queries in a linear estimator are of unit length, where the actual lengths of the queries are absorbed by the weights.

The most natural measure of quality of an estimator is its variance. There is a large body of work on the existence of and finding a minimum variance unbiased estimator. Such an estimator has a strong guarantee; it is the estimator for which the variance is minimized for all possible values of the parameter to estimate. In general, finding such an estimator is quite difficult. It is easy to see that the variance depends on the scale of the matrix. To normalize, we assume that the Frobenius norm of the matrix is fixed.

Definition 3

We define the variance of a trace estimator as the worst case of variance over all matrices with Frobenius norm . To be specific, given a matrix  let us define , then

If the variance of an estimator is at most , we say that is a -variance estimator.

Given an unbiased estimator class, the minimum variance unbiased estimator has the minimum variance among all the (unbiased) estimators in the class.

Another natural accuracy guarantee for a trace estimator is the notion of -estimator that is introduced in [AT11].

Definition 4 (-estimator)

A trace estimator is said to be an -estimator of the trace if, for every matrix , we have that with probability at least .

We stress that both Definitions 3 and 4 involve worst case estimates over the choice of the matrix, and the randomness only comes from the internal randomness of the estimator.

1.2 Main Results

Our main results are as follows:

Theorem 1.1

Among all linear nonadaptive unbiased trace estimators, the minimum variance unbiased estimator that makes queries is achieved by sampling random orthogonal unit vectors (see Definition 5) and outputting .

Theorem 1.2

Any trace estimator with variance requires queries.

Theorem 1.3

Any -estimator for the trace requires queries, even if the unknown matrix is known to be positive semidefinite.

The bounds in Theorem 1.2 and 1.3 are tight: both bounds can be asymptotically matched by the Gaussian estimator and the uniform Rademacher vector estimator.

1.3 Proof Techniques Overview

All of our results crucially use a powerful yet simple trick, which we call symmetrization. The heart of this trick lies in the fact that the trace of a matrix is unchanged under similarity transformations; for every and orthogonal . If we have a nonadaptive estimator with query distribution and an orthogonal matrix , using the queries distributed as should not be too different in terms of worst-case behavior. (We have to be more careful with adaptive estimators, which we discuss in Section 3.) Thus, applying symmetrization to a nonadaptive estimator yields a nonadaptive estimator where it draws queries as in the original estimator, but transforms the queries using a random orthogonal transformation. This “symmetrizes” the estimator. We prove that the performance of the estimator never decreases when symmetrization is applied, so we can exclusively consider symmetrized estimators.

In order to characterize the minimum variance linear nonadaptive unbiased estimator, we notice that after the symmetrization, the distribution over queries for any such estimator is defined by a distribution over the pairwise angles of the queries. We then show that the queries should be taken to be orthogonal with certainty in order to minimize variance.

As for the lower bounds for adaptive and nonlinear estimators, the symmetrization also plays an important role. Consider the problem of proving a query lower bound for -approximation: the most common approach of proving such a lower bound is to use Yao’s minimax principle. To apply this principle, we would need to construct two distributions of matrices such that the distributions cannot be distinguished after making a number of queries, even though the traces of the matrices are very different in the two distributions. There are several technical difficulties in applying the minimax principle directly here. First of all, the query space is , so it is unclear whether one can assert that there exists a sufficiently generalized minimax principle to handle this case. Second, even if one can apply a suitable version of minimax principle, we do not have general techniques of analyzing the distribution of adaptive queries, especially when the queries involve real numbers and thus the algorithm might have infinitely many branches.

We overcome the above two barriers and avoid using a minimax principle entirely by applying symmetrization. One nice property of the symmetrization process is that a symmetrized estimator outputs the same distribution of results on all matrices with the same diagonalization. In the proof we carefully construct two distributions of matrices with the same diagonalization in each distribution, while the traces are different for different distributions. Each distribution is simply the “orbit” of a single diagonal matrix ; the support consists of all matrices similar to . Using the symmetrization, it suffices to show that we can not distinguish these two distributions of matrices by adaptive queries, as it is equivalent to distinguish two diagonal matrices for symmetrized trace estimators. The argument for achieving a lower bound for adaptive estimators is more subtle; we show that due to the structure of symmetrized estimators, we define a stronger query model such that adaptive estimators behave the same as the nonadaptive estimators while we achieve the same lower bound, even with the stronger query model.

1.4 Organization

In section 2, we define the mathematical tools that are needed in our analysis. In section 3, we introduce the idea of symmetrization. We prove Theorem 1.1 in section 4. In section 5, we prove Theorem 1.2. In section 6, we prove Theorem 1.3.

2 Preliminaries

Definition 5 (random Gaussian matrix and random orthogonal matrix)
  • We call a vector a random Gaussian vector if each coordinate is sampled independently from .

  • We call an matrix a random Gaussian matrix if its entries are sampled independently from .

  • We call an matrix a random orthogonal matrix if it is drawn from the distribution whose probability measure is the Haar measure on the group of orthogonal matrices; specifically, it is the unique probability measure that is invariant under orthogonal transformations.

  • We call vectors random orthogonal unit vectors if they are chosen as row vectors of a random orthogonal matrix.

We note that one way to generate a random orthogonal matrix is to generate a random Gaussian matrix and perform Gram-Schmidt orthonormalization on its rows.

Definition 6 (total variation distance)

Let be two distributions with density functions over a domain . The total variation distance between and is defined as .

Proposition 1

Suppose we are given one sample on either distribution or defined on the same sample space, and we are asked to distinguish which distribution the sample came from. The success probability of any algorithm is at most .

Definition 7 (KL-divergence)

Let be two distributions with density functions over a domain . The Kullback-Leibler divergence of from is defined as .

We know the following relationship between Kullback-Leibler divergence and total variation distance, which is also known as Pinsker’s Inequality.

Theorem 2.1

Suppose the KL-divergence between distributions is , and total variation distance is , then

To bound the total variation distance of the distributions we consider, we compute the KL-divergence and apply Pinsker’s Inequality. For example, the KL-divergence between two multivariate Gaussian distributions is well known; we will only use the following special case for multivariate Gaussian distributions with mean and covariance matrices and .

Theorem 2.2

Given two -dimensional Gaussian distributions and we have that

3 Symmetrization of an estimator

In this section, we introduce the idea of symmetrization of an estimator which is a crucial element of all our remaining proofs. We first define the rotation of an estimator, which we will denote for an orthogonal matrix . Intuitively, the construction of is such that emulates the behavior of on a rotated version of the matrix . More specifically, makes queries in the following way:

  • Letting be a random variable whose distribution is the same as the first query of , the distribution of the first query of is the same as the random variable .

  • Given queries made by so far with responses , the distribution of the th query of has the same distribution as , where is distributed the same as the th query that makes, given queries with responses .

In the case that is a nonadaptive estimator, the queries of are just , where is a set of queries from the distribution of queries that makes.

Lemma 1

For any estimator and orthogonal matrix ,

  • .

  • is an -approximation estimator if and only if is also an -estimator.

Proof

We know that given a matrix , the behavior of is the same as on estimating . On the other hand, we know that and . Therefore, the variance of on is the same as the variance of on . Now suppose is an -estimator. We know that the approximation guarantee of on is the same as on . Therefore, we know that with probability at least , the estimator ’s output is within

Definition 8 (averaging estimators over a distribution)

Suppose we have a collection of estimators , for any probability distribution on , we define as the following estimator:

  1. Randomly sample an estimator .

  2. Output according to the estimation of .

Lemma 2

Averaging a collection of estimators cannot increase variance or weaken an -guarantee. Specifically:

  • If all the estimators are unbiased and have variance at most , then ’s variance is also at most .

  • If all the estimators in are -estimators, then is also an -estimator.

Proof

For the first, we apply the law of total variance conditioned on the draw of :

The second term above is , since all estimators in are unbiased. Since for every , as well.

For the second claim, assuming that

for each , we have

Definition 9 (Symmetrization of a trace estimator)

Given an estimator , we define the symmetrization of to be the estimator where we

  1. sample a random orthogonal matrix (see definition 5), and

  2. use to estimate the trace.

We say an estimator is symmetric if it is equivalent to the symmetrization of some estimator.

By Lemma 1 and Lemma 2, we know that ’s variance as well as its -approximation is always no worse than . Therefore, without loss of generality, we can always assume that the optimal estimator is symmetric.

One nice property of the symmetric estimator is that it has the same performance on all matrices with the same diagonalization.

Lemma 3

Given a symmetrized estimator , its variance and approximation guarantee is the same for any matrix and for any orthogonal matrix .

Proof

Given any matrix , we know that the variance of is and the variance of on the matrix is . We know that and are identically distributed; therefore, has the same estimation variance on and .

Similarly, for the approximation guarantee, suppose is an -estimator, which means that

We know that for the matrix , has the same distribution as . Therefore,

4 Optimal Linear Nonadaptive Estimator

Without loss of generality, we can assume that the optimal estimator is symmetric. For a symmetric nonadaptive estimator, we can think of as generated by the following process.

  • Sample a configuration from some distribution . For each configuration , there is a corresponding weight vector .

  • Generate by drawing random unit vectors conditioned on the angle between being for all . (This can be done efficiently.)

  • Output .

The proof of Theorem 1.1 consists of two steps. First we will show that we can set all of the angles (deterministically) to be without increasing the variance, so we can assume that the queries are orthogonal. In the second step, we will then show that the optimal way of assigning weight is to (deterministically) set each weight to be .

We first prove that we can replace the queries by random orthogonal unit vectors without increasing the variance.

Lemma 4

Let be randomly orthogonal unit vectors. We have that

(1)
Proof

It is easy to see that the marginal distribution on each is the same as the marginal distribution on . Therefore, we have that

This implies that is also an unbiased estimator.

Since both estimators have the same expectation, in order to show (1), it suffices to prove that

(2)

By the process of generating , we know that the marginal distribution of is independent of and equal to the marginal distribution of . If we expand the left hand side of (2), we have that

If we expand the right hand side of (2) we have that

Therefore, in order to prove (2), it suffices to prove that for any and , we have

(3)

To compare and , we note that the marginal distribution on the pair is equivalent to drawing from the following process:

  1. Draw .

  2. Set and .

It is easy to check that the joint distribution on and has the same distribution as two random unit vectors with angle .

Therefore,

(4)

In order to simplify the above expression, we first claim that

To see this, note that is a random unit vector orthogonal to . Conditioned on any fixed realization of , the distribution on is symmetric about ; has the same distribution as .

In addition, using Cauchy-Schwarz and the fact that and have the same distribution, we have that

Therefore, we have that

which proves (3), completing the proof of Lemma 4.

Now that we can assume that the queries are mutually orthogonal, we can view this as an estimator with randomized weights for . Below we will use the random variable to denote as is independent from .

Lemma 5

Let be random orthogonal unit vectors. Then the estimator has minimum variance when .

Proof

First we must have to make the estimator unbiased, since . Also,

Minimizing the variance is equivalent to minimizing

Equality holds for , completing the proof.

Combining Lemma 4 and Lemma 5, the minimum variance linear nonadaptive unbiased estimator making queries is , where is a collection of random orthogonal unit vectors. This completes the proof of Theorem 1.1.

5 Every -variance estimator requires -queries

Theorem 5.1

Any estimator with variance requires queries.

The main idea to show such a lower bound is to reduce the problem from getting a -variance estimator to an easier problem of distinguishing two distributions. Here, the unknown matrix is drawn from either distribution or distribution , and our goal is to determine, with high probability, which distribution the sample came from. Importantly, the trace of every matrix in the support of is far from the trace of every matrix in the support of .

For the problem here, we define and as follows, with parameter precisely specified later :

  1. A draw from distribution is the matrix , where are two random orthogonal unit vectors.

  2. A draw from distribution is the matrix , where are two random orthogonal unit vectors, for

It is easy to check that we have . It will be more convenient to write as that

where we have set and . We will also write . For the rest of the proof, we do not explicitly write as an expression of . Instead, our proof only requires that and . It is straightforward to verify that .

The proof of Theorem 5.1 consists of the following two lemmas.

Lemma 6

Suppose there is an -variance estimator that makes queries. Then we can distinguish from with probability at least using queries.

Lemma 7

Any algorithm that can distinguish from with probability at least requires at least queries.

Combining Lemma 7 and Lemma 6, we have that for any estimator with variance , we need at least queries.

Proof

(Proof of Lemma 6). Without loss of generality, let us assume that our estimator is symmetric. By assumption, we have for . The randomness comes from the distribution of as well as the estimator . We use the fact that a symmetric estimator has the same variance on any matrix with the same diagonalization.

Consider the following algorithm for distinguishing and : it uses to first get an estimation on the trace of the unknown matrix . Since every matrix in the support of has trace while every matrix in the support of has trace , the algorithm will output that comes from the distribution if , and otherwise. Below we will show the accuracy of the above algorithm is at least .

We know that is a random variable satisfying , where the randomness is only over the internal randomness of . By Chebyshev’s inequality, we have . Thus, with probability at least , if is drawn from , then , so the algorithm succeeds. Similarly, with probability at least , if is drawn from , then , so the algorithm succeeds in this case as well.

Proof

(Proof of Lemma 7) First let us define a general problem.

Definition 10 (distinguishing rank matrices)

We are given a matrix that is from one of the following two classes of distributions.

  1. Distribution : for and for to be two uniformly random orthogonal unit vectors.

  2. Distribution : for and for to be two uniformly random orthogonal unit vectors.

It is easy to see that our goal is to understand the sample complexity of distinguishing from .

Recall that . We will prove the testing problem of distinguishing from requires even with stronger queries. We define a strong query to be a query that, instead of returning , returns and . The vectors and are vectors used to construct from a draw of or in Definition 10. It suffices to show that we need at least strong queries to distinguish from .

One interpretation of the information given by a strong query is that it gives us the projection of on the query point . After making queries , we have the projection of both on