ClusterSeeking JamesStein Estimators
Abstract
This paper considers the problem of estimating a highdimensional vector of parameters from a noisy observation. The noise vector is i.i.d. Gaussian with known variance. For a squarederror loss function, the JamesStein (JS) estimator is known to dominate the simple maximumlikelihood (ML) estimator when the dimension exceeds two. The JSestimator shrinks the observed vector towards the origin, and the risk reduction over the MLestimator is greatest for that lie close to the origin. JSestimators can be generalized to shrink the data towards any target subspace. Such estimators also dominate the MLestimator, but the risk reduction is significant only when lies close to the subspace. This leads to the question: in the absence of prior information about , how do we design estimators that give significant risk reduction over the MLestimator for a wide range of ?
In this paper, we propose shrinkage estimators that attempt to infer the structure of from the observed data in order to construct a good attracting subspace. In particular, the components of the observed vector are separated into clusters, and the elements in each cluster shrunk towards a common attractor. The number of clusters and the attractor for each cluster are determined from the observed vector. We provide concentration results for the squarederror loss and convergence results for the risk of the proposed estimators. The results show that the estimators give significant risk reduction over the MLestimator for a wide range of , particularly for large . Simulation results are provided to support the theoretical claims.
Highdimensional estimation, Large deviations bounds, Loss function estimates, Risk estimates, Shrinkage estimators
1 Introduction
\IEEEPARstartConsider the problem of estimating a vector of parameters from a noisy observation of the form
The noise vector is distributed as , i.e., its components are i.i.d. Gaussian random variables with mean zero and variance . We emphasize that is deterministic, so the joint probability density function of for a given is
(1) 
The performance of an estimator is measured using the squarederror loss function given by
where denotes the Euclidean norm. The risk of the estimator for a given is the expected value of the loss function:
where the expectation is computed using the density in (1). The normalized risk is .
Applying the maximumlikelihood (ML) criterion to (1) yields the MLestimator . The MLestimator is an unbiased estimator, and its risk is . The goal of this paper is to design estimators that give significant risk reduction over for a wide range of , without any prior assumptions about its structure.
In 1961 James and Stein published a surprising result [1], proposing an estimator that uniformly achieves lower risk than for any , for . Their estimator is given by
(2) 
and its risk is [2, Chapter , Thm. 5.1]
(3) 
Hence for ,
(4) 
An estimator is said to dominate another estimator if
with the inequality being strict for at least one . Thus (4) implies that the JamesStein estimator (JSestimator) dominates the MLestimator. Unlike the MLestimator, the JSestimator is nonlinear and biased. However, the risk reduction over the MLestimator can be significant, making it an attractive option in many situations — see, for example, [3].
By evaluating the expression in (3), it can be shown that the risk of the JSestimator depends on only via [1]. Further, the risk decreases as decreases. (For intuition about this, note in (3) that for large , .) The dependence of the risk on is illustrated in Fig. 1, where the average loss of the JSestimator is plotted versus , for two different choices of .
The JSestimator in (2) shrinks each element of towards the origin. Extending this idea, JSlike estimators can be defined by shrinking towards any vector, or more generally, towards a target subspace . Let denote the projection of onto , so that . Then the JSestimator that shrinks towards the subspace is
(5) 
where is the dimension of .
(6) 
where is the empirical mean of .
It can be shown that the different variants of the JSestimator such as (2),(5),(6) all dominate the MLestimator.
Throughout this paper, the term “attracting vector” refers to the vector that is shrunk towards. For in (2), the attracting vector is , and the risk reduction over is larger when is close to zero. Similarly, if the components of are clustered around some value , a JSestimator with attracting vector would give significant risk reduction over . One motivation for Lindley’s estimator in (6) comes from a guess that the components of are close to its empirical mean — since we do not know , we approximate it by and use the attracting vector .
Fig. 1 shows how the performance of and depends on the structure of . In the left panel of the figure, the empirical mean is always , so the risks of both estimators increase monotonically with . In the right panel, all the components of are all equal to . In this case, the distance from the attracting vector for is , so the risk does not vary with ; in contrast the risk of increases with as its attracting vector is .
The risk reduction obtained by using a JSlike shrinkage estimator over crucially depends on the choice of attracting vector. To achieve significant risk reduction for a wide range of , in this paper, we infer the structure of from the data and choose attracting vectors tailored to this structure. The idea is to partition into clusters, and shrink the components in each cluster towards a common element (attractor). Both the number of clusters and the attractor for each cluster are to be determined based on the data .
As a motivating example, consider a in which half the components are equal to and the other half are equal to . Fig. 1(a) shows that the risk reduction of both and diminish as gets larger. This is because the empirical mean is close to zero, hence and both shrink towards 0. An ideal JSestimator would shrink the ’s corresponding to towards the attractor , and the remaining observations towards . Such an estimator would give handsome gains over for all with the above structure. On the other hand, if is such that all its components are equal (to ), Lindley’s estimator is an excellent choice, with significantly smaller risk than for all values of (Fig. 1(b)).
We would like an intelligent estimator that can correctly distinguish between different structures (such as the two above) and choose an appropriate attracting vector, based only on . We propose such estimators in Sections 3 and 4. For reasonably large , these estimators choose a good attracting subspace tailored to the structure of , and use an approximation of the best attracting vector within the subspace.
The main contributions of our paper are as follows.

We construct a twocluster JSestimator, and provide concentration results for the squarederror loss, and asymptotic convergence results for its risk. Though this estimator does not dominate the MLestimator, it is shown to provide significant risk reduction over Lindley’s estimator and the regular JSestimator when the components of can be approximately separated into two clusters.

We present a hybrid JSestimator that, for any and for large , has risk close to the minimum of that of Lindley’s estimator and the proposed twocluster JSestimator. Thus the hybrid estimator asymptotically dominates both the MLestimator and Lindley’s estimator, and gives significant risk reduction over the MLestimator for a wide range of .

We generalize the above idea to define general multiplecluster hybrid JSestimators, and provide concentration and convergence results for the squarederror loss and risk, respectively.

We provide simulation results that support the theoretical results on the loss function. The simulations indicate that the hybrid estimator gives significant risk reduction over the MLestimator for a wide range of even for modest values of , e.g. . The empirical risk of the hybrid estimator converges rapidly to the theoretical value with growing .
1.1 Related work
George [7, 8] proposed a “multiple shrinkage estimator”, which is a convex combination of multiple subspacebased JSestimators of the form (5). The coefficients defining the convex combination give larger weight to the estimators whose target subspaces are closer to . Leung and Barron [9, 10] also studied similar ways of combining estimators and their risk properties. Our proposed estimators also seek to emulate the best among a class of subspacebased estimators, but there are some key differences. In [7, 8], the target subspaces are fixed a priori, possibly based on prior knowledge about where might lie. In the absence of such prior knowledge, it may not be possible to choose good target subspaces. This motivates the estimators proposed in this paper, which use a target subspace constructed from the data . The nature of clustering in is inferred from , and used to define a suitable subspace.
Another difference from earlier work is in how the attracting vector is determined given a target subspace . Rather than choosing the attracting vector as the projection of onto , we use an approximation of the projection of onto . This approximation is computed from , and concentration inequalities are provided to guarantee the goodness of the approximation.
The risk of a JSlike estimator is typically computed using Stein’s lemma [5]. However, the datadependent subspaces we use result in estimators that are hard to analyze using this technique. We therefore use concentration inequalities to bound the loss function of the proposed estimators. Consequently, our theoretical bounds get sharper as the dimension increases, but may not be accurate for small . However, even for relatively small , simulations indicate that the risk reduction over the MLestimator is significant for a wide range of .
Noting that the shrinkage factor multiplying in (2) could be negative, Stein proposed the following positivepart JSestimator [1]:
(7) 
where denotes . We can similarly define positivepart versions of JSlike estimators such as (5) and (6). The positivepart Lindley’s estimator is given by
(8) 
Baranchik [11] proved that dominates , and his result also proves that dominates . Estimators that dominate are discussed in [12, 13]. Fig. 1 shows that the positivepart versions can give noticeably lower loss than the regular JS and Lindley estimators. However, for large , the shrinkage factor is positive with high probability, hence the positivepart estimator is nearly always identical to the regular JSestimator. Indeed, for large , , and the shrinkage factor is
We analyze the positivepart version of the proposed hybrid estimator using concentration inequalities. Though we cannot guarantee that the hybrid estimator dominates the positivepart JS or Lindley estimators for any finite , we show that for large , the loss of the hybrid estimator is equal to the minimum of that of the positivepart Lindley’s estimator and the clusterbased estimator with high probability (Theorems 3 and 4).
The rest of the paper is organized as follows. In Section 2, a twocluster JSestimator is proposed and its performance analyzed. Section 3 presents a hybrid JSestimator along with its performance analysis. General multipleattractor JSestimators are discussed in Section 4, and simulation results to corroborate the theoretical analysis are provided in Section 5. The proofs of the main results are given in Section 6. Concluding remarks and possible directions for future research constitute Section 7.
1.2 Notation
Bold lowercase letters are used to denote vectors, and plain lowercase letters for their entries. For example, the entries of are , . All vectors have length and are column vectors, unless otherwise mentioned. For vectors , denotes their Euclidean inner product. The allzero vector and the allone vector of length are denoted by and , respectively. The complement of a set is denoted by . For a finite set with realvalued elements, denotes the minimum of the elements in . We use to denote the indicator function of an event . A central chisquared distributed random variable with degrees of freedom is denoted by . The function is given by , and . For a random variable , denotes . For realvalued functions and , the notation means that , and means that for some positive constant .
For a sequence of random variables , , , and respectively denote convergence in probability, almost sure convergence, and convergence in norm to the random variable .
We use the following shorthand for concentration inequalities. Let be a sequence of random variables. The notation , where is either a random variable or a constant, means that for any ,
(9) 
where and are positive constants that do not depend on or . The exact values of and are not specified.
The shrinkage estimators we propose have the general form
For , the th component of the attracting vector is the attractor for (the point towards which it is shrunk).
2 A twocluster JamesStein estimator
Recall the example in Section 1 where has half its components equal to , and the other half equal to . Ideally, we would like to shrink the ’s corresponding to the first group towards , and the remaining points towards . However, without an oracle, we cannot accurately guess which point each should be shrunk towards. We would like to obtain an estimator that identifies separable clusters in , constructs a suitable attractor for each cluster, and shrinks the in each cluster towards its attractor.
We start by dividing the observed data into two clusters based on a separating point , which is obtained from . A natural choice for the would be the empirical mean ; since this is unknown we use . Define the clusters
The points in and will be shrunk towards attractors and , respectively, where are defined in (21) later in this section. For brevity, we henceforth do not indicate the dependence of the attractors on . Thus the attracting vector is
(10) 
with and defined in (21). The proposed estimator is
(11) 
where the function is defined as
(12) 
The attracting vector in (10) lies in a twodimensional subspace defined by the orthogonal vectors and . To derive the values of and in (10), it is useful to compare to the attracting vector of Lindley’s estimator in (6). Recall that Lindley’s attracting vector lies in the onedimensional subspace spanned by . The vector lying in this subspace that is closest in Euclidean distance to is its projection . Since is unknown, we use the approximation to define the attracting vector .
Analogously, the vector in the twodimensional subspace defined by (10) that is closest to is the projection of onto this subspace. Computing this projection, the desired values for are found to be
(13) 
As the ’s are not available, we define the attractors as approximations of , obtained using the following concentration results.
Lemma 1.
The proof is given in Appendix B.1.
Using Lemma 1, we can obtain estimates for in (13) provided we have an estimate for the term . This is achieved via the following concentration result.
Lemma 2.
Fix . Then for any , we have
(20) 
where is a positive constant and .
The proof is given in Appendix B.2.
Note 1.
Henceforth in this paper, is used to denote a generic bounded constant (whose exact value is not needed) that is a coefficient of in expressions of the form where is some constant. As an example to illustrate its usage, let , where and . Then, we have .
Using Lemmas 1 and 2, the two attractors are defined to be
(21) 
With chosen to be a small positive number, this completes the specification of the attracting vector in (10), and hence the twocluster JSestimator in (11).
Note that , defined by (10), (21), is an approximation of the projection of onto the twodimensional subspace spanned by the vectors and . We remark that , which approximates the vector in that is closest to , is distinct from the projection of onto . While the analysis is easier (there would be no terms involving ) if were chosen to be a projection of (instead of ) onto , our numerical simulations suggest that this choice yields significantly higher risk. The intuition behind choosing the projection of onto is that if all the in a group are to be attracted to a common point (without any prior information), a natural choice would be the mean of the within the group, as in (13). This mean is determined by the term , which is different from because
The term involving in (21) approximates .
Note 2.
The attracting vector is dependent not just on but also on , through the two attractors and . In Lemma 2, for the deviation probability in (20) to fall exponentially in , needs to be held constant and independent of . From a practical design point of view, what is needed is . Indeed , for to be a reliable approximation of the term , it is shown in Appendix B.2, specifically in (100) that we need . Numerical experiments suggest a value of for to be large enough for a good approximation.
We now present the first main result of the paper.
Theorem 1.
The loss function of the twocluster JSestimator in (11) satisfies the following:

For a sequence of with increasing dimension , if , we have
(23)
The constants are given by
(24) 
(25) 
where
(26) 
The proof of the theorem is given in Section 6.2.
Remark 1.
In Theorem 1, represents the concentrating value for the distance between and the attracting vector . (It is shown in Sec. 6.2 that concentrates around .) Therefore, the closer is to the attracting subspace, the lower the normalized asymptotic risk . The term represents the concentrating value for the distance between and . (It is shown in Sec. 6.2 that concentrates around .)
Remark 2.
Comparing in (24) and in (25), we note that because
(27) 
To see (27), observe that in the sum in the numerator, the function assigns larger weight to the terms with than to the terms with .
Furthermore, for large if either , or . In the first case, if , for , we get . In the second case, suppose that of the values equal and the remaining values equal for some . Then, as , it can be verified that . Therefore, the asymptotic normalized risk converges to in both cases.
The proof of Theorem 1 further leads to the following corollaries.
Corollary 1.
The loss function of the positivepart JSestimator in (7) satisfies the following:

For any ,
where , and and are positive constants.

For a sequence of with increasing dimension , if , we have
Note that the positivepart Lindley’s estimator in (8) is essentially a singlecluster estimator which shrinks all the points towards . Henceforth, we denote it by .
Corollary 2.
The loss function of the positivepart Lindley’s estimator in (8) satisfies the following:

For any ,
where and are positive constants, and
(28) 
For a sequence of with increasing dimension , if , we have
(29)
Remark 3.
Statement of Corollary 1, which is known in the literature [14], implies that is asymptotically minimax over Euclidean balls. Indeed, if denotes the set of such that , then Pinsker’s theorem [15, Ch. 5] implies that the minimax risk over is asymptotically (as ) equal to .
Statement of Corollary 1 and both the statements of Corollary 2 are new, to the best of our knowledge. Comparing Corollaries 1 and 2, we observe that since for all with strict inequality whenever . Therefore the positivepart Lindley’s estimator asymptotically dominates the positive part JSestimator.
It is well known that both and dominate the MLestimator [11]. From Corollary 1, it is clear that asymptotically, the normalized risk of is small when is small, i.e., when is close to the origin. Similarly, from Corollary 2, the asymptotic normalized risk of is small when is small, which occurs when the components of are all very close to the mean . It is then natural to ask if the twocluster estimator dominates , and when its asymptotic normalized risk is close to . To answer these questions, we use the following example, shown in Fig. 2. Consider whose components take one of two values, or , such that is as close to zero as possible. Hence the number of components taking the value is . Choosing , , the key asymptotic risk term in Theorem 1 is plotted as a function of in Fig. 2 for various values of .
Two important observations can be made from the plots. Firstly, exceeds for certain values of and . Hence, does not dominate . Secondly, for any , the normalized risk of goes to zero for large enough . Note that when is large, both and are large and hence, the normalized risks of both and are close to . So, although does not dominate , or , there is a range of for which is much lower than both and . This serves as motivation for designing a hybrid estimator that attempts to pick the better of and for the in context. This is described in the next section.
In the example of Fig. 2, it is worth examining why the twocluster estimator performs poorly for a certain range of , while giving significantly risk reduction for large enough . First consider an ideal case, where it is known which components of theta are equal to and which ones are equal to (although the values of may not be known). In this case, we could use a JamesStein estimator of the form (5) with the target subspace being the twodimensional subspace with basis vectors