Gaussian approximation for the sup-norm of high-dimensional matrix-variate U-statistics and its applications

Gaussian approximation for the sup-norm of high-dimensional matrix-variate U-statistics and its applications

Abstract

This paper studies the Gaussian approximation of high-dimensional and non-degenerate U-statistics of order two under the supremum norm. We propose a two-step Gaussian approximation procedure that does not impose structural assumptions on the data distribution. Specifically, subject to mild moment conditions on the kernel, we establish the explicit rate of convergence that decays polynomially in sample size for a high-dimensional scaling limit, where the dimension can be much larger than the sample size. We also supplement a practical Gaussian wild bootstrap method to approximate the quantiles of the maxima of centered U-statistics and prove its asymptotic validity. The wild bootstrap is demonstrated on statistical applications for high-dimensional non-Gaussian data including: (i) principled and data-dependent tuning parameter selection for regularized estimation of the covariance matrix and its related functionals; (ii) simultaneous inference for the covariance and rank correlation matrices. In particular, for the thresholded covariance matrix estimator with the bootstrap selected tuning parameter, we show that the Gaussian-like convergence rates can be achieved for heavy-tailed data, which are less conservative than those obtained by the Bonferroni technique that ignores the dependency in the underlying data distribution. In addition, we also show that even for subgaussian distributions, error bounds of the bootstrapped thresholded covariance matrix estimator can be much tighter than those of the minimax estimator with a universal threshold.

1Introduction

Let be a sample of independent and identically distributed (iid) random vectors in with the distribution function . Let be a separable Banach space equipped with the norm and be a -valued measurable and symmetric kernel function such that for all and . Consider the U-statistics of order two

The main focus of this paper is to study the asymptotic behavior of the random variable in the high-dimensional setup when . Since the introduction of U-statistics by Hoeffding [?], their limit theorems have been extensively studied in the classical asymptotic setup where diverges and is fixed [?]. Recently, due to the explosive data enrichment, regularized estimation and dimension reduction of high-dimensional data have attracted a lot of research attentions such as covariance matrix estimation [?], graphical models [?], discriminant analysis [?], factor models [?], among many others. Those problems all involve the consistent estimation of an expectation of U-statistics of order two , where and are two independent random vectors in with the distribution . Below are two examples.

In this paper, we are interested in the following central questions: how does the dimension impact the asymptotic behavior of U-statistics and how can we make statistical inference when ? Motivation of this paper comes from the estimation and inference problems for large covariance matrix and its related functionals [?]. To establish rate of convergence for the regularized estimators or to study the -norm Gaussian approximations in high-dimensions, a key issue is to characterize the supremum norm of . Therefore, as the primary concern of the current paper, we shall consider and .

Our first main contribution is to provide a Gaussian approximation scheme for the high-dimensional non-degenerate U-statistics under the sup-norm. Different from the central limit theorem (CLT) type results for the maxima of sums of iid random vectors [?], which are directly approximated by the Gaussian counterparts with the matching first and second moments, approximating the sup-norm of U-statistics is more subtle because of its dependence and nonlinearity. Here, we propose a two-step Gaussian approximation method in Section 2. In the first step, we approximate the U-statistics by the leading component of a linear form in the Hoeffding decomposition (a.k.a. the Hájek projection); in the second, the linear term is further approximated by the Gaussian random vectors. To approximate the distribution of the sup-norm of U-statistics by a linear form, a maximal moment inequality is developed to control the nonlinear and canonical, i.e. completely degenerate, form of the reminder term. Then the linear projection is handled by the recent development of Gaussian approximation in high-dimensions [?]. Explicit rate of convergence of the Gaussian approximation for high-dimensional U-statistics is established for unbounded kernels subject to sub-exponential and uniform polynomial moment conditions. Specifically, under either moment conditions, we show that the same convergence rate that decays polynomially in sample size as in the Gaussian approximation for the maxima of sums of iid random vectors is attained and the validity of the Gaussian approximation is proved for a high-dimensional scaling limit, where can be much larger than .

The second contribution of this paper is to propose a Gaussian wild bootstrap procedure for approximating the quantiles of . Since the (unobserved) linear projection terms of the centered U-statistics depend on the unknown underlying data distribution and there is a nonlinear remainder term, we use an additional estimation step beyond the Gaussian approximation. Here, we employ the idea of decoupling and estimate the linear projection on an independent dataset. Validity of the Gaussian wild bootstrap is established under the same set of assumptions in the Gaussian approximation results. One important feature of the Gaussian approximation and the bootstrap procedure is that no structural assumptions on the distribution are required and the strong dependence in is allowed, which in fact helps the Gaussian and bootstrap approximation. In Section 4, we demonstrate the capability of the proposed bootstrap method applied to a number of important high-dimensional problems, including the data-dependent tuning parameter selection in the thresholded covariance matrix estimator and the simultaneous inference of the covariance and Kendall’s tau rank correlation matrices. Two additional applications for the estimation problems of the sparse precision matrix and the sparse linear functionals are given in the Supplemental Materials (SM). In those problems, we show that the Gaussian like convergence rates can be achieved for non-Gaussian data with heavy-tails. For the sparse covariance matrix estimation problem, we also show that the thresholded estimator with the tuning parameter selected by the bootstrap procedure can gain potentially much tighter performance bounds over the minimax estimator with a universal threshold that ignores the dependency in [?].

To establish the Gaussian approximation result and the validity of the bootstrap method, we have to bound the the expected sup-norm of the second-order canonical term in the Hoeffding decomposition of the U-statistics and establish its non-asymptotic maximal moment inequalities. An alternative simple data splitting approach by reducing the U-statistics to sums of iid random matrices can give the exact rate for bounding the moments in the non-degenerate case [?]. Nonetheless, the reduction to the iid summands in terms of data splitting does not exploit the complete degeneracy structure of the canonical term and it does not lead to the convergence result in the Gaussian approximation for the non-degenerate U-statistics; see Section 5.1 for details. In addition, unlike the Hoeffding decomposition approach, the data splitting approximation is not asymptotically tight in distribution and therefore it is less useful in making inference of the high-dimensional U-statistics.
Notations and definitions. For a vector , we use , , and to denote its entry-wise , , and norms, respectively. For a matrix , we use and to denote its Frobenius and spectral norms, respectively. Denote and . We shall use to denote positive finite absolute constants, and and , to denote positive finite constants whose values do not depend on and and may vary at different places. We write if for some constant , and if and . For a random variable , we write for . We use and . Throughout the paper, we write and let and be two independent random vectors in with the distribution , which are independent of . We write and , where and . For a matrix-valued kernel , we say that: (i) is non-degenerate w.r.t. if for all ; (ii) is canonical or completely degenerate w.r.t. if for all and for all . Without loss of generality, we shall assume throughout the paper that and the matrix is symmetric, i.e. .

2Gaussian approximation

In this section, we study the Gaussian approximation for in (Equation 1), or equivalently the approximation for the sup-norm of the centered U-statistics by considering and . If ’s are non-Gaussian, a seemingly intuitive method would be generating Gaussian random vectors ’s by matching the first and second moments of ; i.e. to approximate by . However, empirical evidence suggests that this may not be a good approximation and theoretically it seems that the nonlinearity in and accounts for a statistically invalid approximation. To illustrate this point, we simulate iid observations from the -variate elliptic -distribution in (Equation 27) with mean zero and degree of freedom in the SM. We consider the covariance matrix kernel ( ?) as an example. For , the P-P plot of the empirical cdfs of the sup-norm of the centered covariance matrices made from and is shown in Figure ? (left) over 5000 simulations.

To correct the bias, a closer inspection reveals that is an approximately linear statistic and its linear projection part in the Hoeffding decomposition is the leading term. This motivates us to propose a two-step approximation method. Let

Clearly, is a -valued symmetric and canonical U-statistic of order two w.r.t. the distribution . Then the Hoeffding decomposition of the kernel is given by

from which we have

On the right-hand side of the last expression, the second term is expected to be the leading term (a.k.a. the Hájek projection) and the first term to be negligible under the sup-norm. Therefore, we can reasonably expect that

where the latter can be further approximated by for iid Gaussian random vectors and is the positive-definite covariance matrix of ; c.f. [?]. Denote . Here, we slightly abuse notations and write as the half-vectorized lower triangular matrix of by columns. Therefore is the covariance matrix indexed by such that and . Similarly, we shall use to denote either the matrix or the half-vectorized version. For the previous elliptic -distribution example, we plot the empirical cdfs of against . Figure ? (right) shows a much better approximation using the leading term in the Hoeffding decomposition.

Let , , , and , where are iid . Denote , , and . Let

be the Kolmogorov distance between and . Let be a sequence of real numbers possibly tending to infinity. We consider two types of conditions on the kernel moments. First, we establish the explicit convergence rate for the kernels with sub-exponential moments; e.g. the -contaminated normal distribution (Equation 26) in the SM.

The assumptions in Theorem ? have meaningful interpretations. (GA.1) ensures the non-degeneracy of the Gaussian approximation and that the truncation does not lose too much information due to the sub-exponential tails. (GA.2) describes the high-dimensional scaling limit of valid Gaussian approximation range. In the high-dimensional context, the dimension grows with the sample size and the distribution function also depends on . Therefore, is allowed to increase with . Theorem ? shows that the approximation error in the Kolmogorov distance converges to zero even if can be much larger than and no structural assumptions on are required. In particular, Theorem ? applies to kernels with the sub-exponential distribution such that for all , in which case and the dimension is allowed to have a subexponential growth rate in the sample size , i.e. . Condition (GA.1) also covers bounded kernels , where may increase with .

Next, we consider kernels with uniform polynomial moments (up to the fourth order); e.g. the elliptical -distribution (Equation 27) in the SM.

Theorem ? and ? allow us to approximate the quantiles of by those of , with the knowledge of . In practice, the covariance matrix and the Hájek projection terms depend on the data distribution , which is unknown. Thus, quantiles of need to be estimated in real applications. However, we shall see in Section 3 that Theorem ? and ? can still be used to derive a feasible resampling based method to approximate the quantiles of Gaussian maxima and therefore .

3Wild bootstrap

The main purpose of this section is to approximate the quantiles of . Let be an independent copy of that are observed; call this training data. Such data can always be obtained by a half-sampling or data splitting on the original data. Therefore, we assume that the sample size of total data is . Since are unknown, we construct an estimator for it. Let be an estimator of using the original and training data. Recall that for any fixed , which can be viewed as the population version for the second variable . Therefore, we build an empirical version as our estimator of . Specifically, we consider

Conditional on , is an unbiased estimator of . It is interesting to view as a decoupled estimator of . Let

where are iid standard Gaussian random variables that are also independent of . Then and are bootstrapped versions of . Denote the conditional quantiles of and given the data as

where is the probability taken w.r.t. . Now, we can compute the conditional quantile by the Gaussian wild bootstrap method. Specifically, can be numerically approximated by resampling on the multiplier Gaussian random variables and we wish to use to approximate the quantiles of .

To assess the quality of the Gaussian wild bootstrap, we show two examples for the covariance matrix kernel on the -contaminated normal distribution (Equation 26) with the sub-exponential moment and on the elliptic -distribution (Equation 27) with the uniform polynomial moment. In each simulation, we generate 200 bootstrap samples for . Then, we estimate for the whole range of probabilities . Figure ? shows the empirical approximation result. Here, we choose in (Equation 26) and (Equation 27), where is the vector of all ones. From Figure ?, the bootstrap approximation seems to be better in the sub-exponential moment case than in the polynomial moment case; see (GA.1)+(GA.2) versus (GA.1’)+(GA.2’). More simulation examples can be found in the SM.

4Statistical applications

In this section, we present two statistical applications for the theoretical results established in Section 2Section 3. Two additional examples can be found in the SM. Here, for notational convenience, we rescale and let . Recall that , where .

4.1Tuning parameter selection for the thresholded covariance matrix estimator

Consider the problem of sparse covariance matrix estimation. Let and

be the class of sparse covariance matrices in terms of the strong -ball. Here, is a constant and may grow with . Let and

be the thresholded sample covariance matrix estimator of . The class was introduced in [?] and the high-dimensional properties of were analyzed in [?] for iid sub-Gaussian data and in [?] for heavy-tailed time series with algebraic tails. In both scenarios, the rates of convergence were obtained with the Bonferroni (i.e. union bound) technique and one-dimensional concentration inequalities. Those performance bounds of the thresholded estimator critically depend on the tuning parameter . The ideal choice of the threshold for establishing the rate of convergence under the spectral and Frobenius norms is , whose distribution depends on the unknown underlying data distribution . In the problem of the high-dimensional sparse covariance matrix estimation, data-dependent tuning parameter selection is often empirically done with the cross-validation (CV) and its theoretical properties largely remain unknown. High probability bounds of are given in [?]. Here, we provide a principled and data-dependent way to determine the threshold .

The upper bound in ( ?) is not essential and it is chosen for conveniently comparing with , which is the Orlicz norm of for and . In general, the variance factor for a subgaussian random variable is not equivalent to the variance. For a sequence of random variables if and , then by Markov’s inequality, we always have , while may depend on and it may diverge at faster rate than such as and as . As a simple example, let be a sequence of real numbers such that and consider random variables such that and . Obviously, and . Let for some constant . Then for all large enough ; i.e. . In fact, if , then . Therefore, we are mainly interested in the general case when as in the statistical applications.

There are a number of interesting features of Theorem ?. Consider ; i.e. is truly sparse such that for . Then we can take and the convergence rates are

Hence, the tuning parameter can be adaptively selected by bootstrap samples while the rate of convergence is nearly optimal in the following sense. Since the distribution of mimics that of , achieves the same convergence rate as the thresholded estimator for the oracle choice of the threshold with probability at least . On the other hand, the bootstrap method is not fully equivalent to the oracle procedure in terms of the constants in the estimation error bounds. Suppose that we know the support of , i.e. locations of the nonzero entries in . Then, the oracle estimator is simply and we have

Therefore, the constant of the convergence rate for the bootstrap method does not attain the oracle estimator. However, we shall comment that is not a tuning parameter since it does not depend on and the effect of only appears in the constants in front of the convergence rates ( ?) and ( ?).

Assuming that the observations are subgaussian and the variance factor is a fixed constant, it is known that the threshold value achieves the minimax rate for estimating the sparse covariance matrix [?]. Compared with the minimax optimal tuning parameter , our bootstrap threshold exhibits several advantages which we shall highlight (with stronger side conditions). First, is non-adaptive since the constant depends on the underlying distribution through and it is more conservative than the bootstrap threshold in view of ( ?). The reason is that the minimax lower bound is based on the worst case analysis and the matching upper bound is obtained by the union bound which ignores the dependence structures in . On the contrary, takes into account the dependence information of by conditioning on the observations. Second, the bootstrap threshold does not need the knowledge of and it allows to increase with and . In this case, the universal thresholding rule even for , in which the variances are uniformly bounded by a constant. In contrast, from ( ?), the bootstrap threshold , where the constant of depends only on . Therefore as and can potentially gain much tighter performance bounds than . One exception for ruling out the increasing when is the Gaussian distribution . However, the main focus of this paper is the statistical estimation and inference for high-dimensional non-Gaussian data and therefore the Gaussian example is not so interesting here. Third, as we shall demonstrate in Theorem ?, the Gaussian type convergence rate of the bootstrap method in Theorem ? remains valid even for heavy-tailed data with polynomial moments. Specifically, we have the following result.

From Theorem ?, the subgaussian assumption on is not essential: for the non-Gaussian data with heavier tails than subgaussian, the thresholded covariance matrix estimator with the threshold selected by the wild bootstrap approach again attains the Gaussian type convergence rate at the asymptotic confidence level . In particular, the dimension may still be allowed to increase subexponentially fast in the sample size . The cost of the heavy-tailed distribution is only a sacrifice of the convergence rate from to . However, as commented in Remark ?, this gap becomes smaller and eventually vanishes for stronger moment conditions (here, we need ).

Next, we compare Theorem ? with the threshold obtained by the union bound approach. Assume that for . By the Nagaev inequality [?] applied to the split sample in Remark ?, one can show that

is the right threshold that gives a large probability bound for . For , we see that when . Therefore, in high dimensional settings, the bootstrap method adapts to the dependence in and gives better convergence rate under the spectral and Frobenius norms. Moreover, for observations with polynomial moments, the minimax lower bound is currently not available to justify .

4.2Simultaneous inference for covariance and rank correlation matrices

Another related important problem of estimating the sparse covariance matrix is the consistent recovery of its support, i.e. non-zero off-diagonal entries in [?]. Towards this end, a lower bound of the minimum signal strength (-min condition) is a necessary condition to separate the weak signals and true zeros. Yet, the -min condition is never verifiable. To avoid this undesirable condition, we can alternatively formulate the recovery problem as a more general hypothesis testing problem

where is a known matrix. In particular, if , then the support recovery can be re-stated as the following simultaneously testing problem: for all and ,

The test statistic we construct is , which is an type statistic by taking the maximum magnitudes on the off-diagonal entries. Then is rejected if .

From Corollary ?, the test based on is asymptotically exact of size for subgaussian data. A similar result can be established for observations with polynomial moments. Due to the space limit, the details are omitted. [?] proposed a similar test statistic for comparing the two-sample large covariance matrices. Their results (Theorem 1 in [?]) are analogous to Corollary ? in this paper in that no structural assumptions in are needed in order to obtain the asymptotic validity of both tests. However, we shall note that their assumptions (C.1), (C.2), and (C.3) on the non-degeneracy are stronger than our condition . For subgaussian observations , (C.3) in [?] assumed that for some constant , where . If , then [?] requires that for all have to obey a uniform lower bound that diverges to infinity. For the covariance matrix kernel, since , we only need that for some fixed lower bound.

Next, we comment that a distinguishing feature of our bootstrap test from the test statistic [?] is that no structural assumptions are made on and we allow for the strong dependence in . For example, consider again the elliptic distributions with the positive-definite such that the covariance matrix is proportion to . Then, we have

For any , as . Therefore, the limiting distribution of the test statistic in [?] is no longer normal and its asymptotic distribution remains unclear.

Finally, the covariance matrix testing problem (Equation 5) can be generalized further to nonparametric forms which can gain more robustness to outliers and the nonlinearity in the dependency structures. Let be the expectation of the random matrix associated with and be a known matrix. Consider the testing problem

Then, the test statistic can be constructed as (or ) and is rejected if (or ), where the bootstrap samples are generated w.r.t. the kernel . The above test covers Kendall’s tau rank correlation matrix as a special case where is the bounded kernel defined in ( ?).

Therefore, the asymptotic validity of the bootstrap test for large Kendall’s tau rank correlation matrix is obtained when without imposing structural and moment assumptions on .

5Proofs of the main results

The rest of the paper is organized as follows. In Section 5.1, we first present a useful inequality for bounding the expectation of the sup-norm of the canonical U-statistics and then compare with an alternative simple data splitting bound by reducing to the moment bounding exercise for the sup-norm of sums of iid random matrices. We shall discuss several advantages of using the U-statistics approach by exploring the degeneracy structure. Section 5.2 contains the proofs of the main results on Gaussian approximation and Section 5.3 proves the convergence rate of the Gaussian wild bootstrap. Proofs of the statistical applications are given in Section 5.4. Additional proofs and technical lemmas are given in the SM.

5.1A maximal inequality for canonical U-statistics

Before proving our main results, we first establish a maximal inequality of the canonical U-statistics of order two. The derived expectation bound is useful in controlling the size of the nonlinear and completely degenerate error term in the Gaussian approximation.

Note that Theorem ? is non-asymptotic. As immediate consequences of Theorem ?, we can derive the rate of convergence of with kernels under the subexponential and uniform polynomial moment conditions.

5.2Proof of results in Section

Let and for

be the smooth-max function for approximating . Denote as the first-order partial derivative w.r.t. and the space of all bounded functions that are three times continuously differentiable on with for . Let be such that: (i) and for some absolute constant and for ; (ii) if and if . Let and for any , let for . Then for . Here, and are smoothing parameters for approximating the max and indicator functions, respectively. In particular, we have

Define

Clearly for . Let

and be similarly defined with replaced by , where follow iid . Put . To prove Theorem ? and ?, we first need a smoothing lemma.

The proof is based on a delicate combination of Lemma ? and Theorem ?. Since the proof of Theorem ? is quite involved, here we only explain the main idea and give a sketch of the proof. All proof details can be found in the SM.

Main idea. Lemma ? gives a general rate of convergence for the Gaussian approximation with some unspecified smoothing parameters and . The error bound in Lemma ? involves three parts: (i) one from approximating the linear projection , (ii) one from the second-order canonical remainder , and (iii) one from the smoothing errors of the max and indicator functions. Recall that is the degree of smoothing for the max function (Equation 7) and controls the approximation of the indicator function ( ?). For larger (or ), (or ) is closer to the non-smooth function (or ). Specifically, for larger and , (iii) contributes less, while (i) and (ii) contribute more, to the error bound for the Gaussian approximation ( ?). Hence, the key step is to optimize the Gaussian approximation error bound ( ?) on the smoothing parameters and to find the best trade-off of the three parts.

Step 1. Choose the smoothing parameters. Note that both and depend on the truncation parameter in Lemma ?. Therefore, the optimization problem eventually boils down to choose a proper threshold . Once is chosen, then we shall first have a natural choice of to make the constraint of Lemma ? on active in order to apply ( ?); i.e . So is a strictly decreasing function in and we need to choose a non-decreasing to counter-balance the smoothing errors. Motivated from the proof of [?] in which only the linear part of was dealt with, here we need to choose a larger because of the extra nonlinear term . Since in the linear case of [?] where and , it is intuitive to choose in our case such that takes the form , where for and the corresponding indicator smoothing parameters as

where is defined in Lemma ?. For , we require different , to balance the error bound ( ?). Let be the solution for balancing the two components in . Then, is strictly increasing when and it is truncated to a constant level for .

Step 2. Calculate the error bound for the chosen parameters. Now, we invoke Theorem ? to quantify the contributions of , and to . Combining ( ?) and ( ?), it will be shown (after some algebraic manipulations on the two cases and , where ) that the optimal choice of in order to achieve the overall error bound

is given by , , and . Then, the explicit rate of convergence ( ?) is immediate by substituting the choice of into (Equation 9).

In the following proofs of Theorem ?, ? and ?, the constants of depend only on and in (GA.1) and (GA.2) in the sub-exponential kernel case and (GA.1’) and (GA.2’) in the uniform polynomial kernel case.

Choose for some . Let . By Theorem ? and Lemma ?, we have

By ( ?), we have . Since , ( ?) follows.

Choose for some . Let . By Theorem ? and Lemma ? with , we have

Choose . Then, it follows that .

5.3Proof of results in Section

Let , where