A Proofs of General Lemmas

# Cluster-Seeking James-Stein Estimators

## Abstract

This paper considers the problem of estimating a high-dimensional vector of parameters from a noisy observation. The noise vector is i.i.d. Gaussian with known variance. For a squared-error loss function, the James-Stein (JS) estimator is known to dominate the simple maximum-likelihood (ML) estimator when the dimension exceeds two. The JS-estimator shrinks the observed vector towards the origin, and the risk reduction over the ML-estimator is greatest for that lie close to the origin. JS-estimators can be generalized to shrink the data towards any target subspace. Such estimators also dominate the ML-estimator, but the risk reduction is significant only when lies close to the subspace. This leads to the question: in the absence of prior information about , how do we design estimators that give significant risk reduction over the ML-estimator for a wide range of ?

In this paper, we propose shrinkage estimators that attempt to infer the structure of from the observed data in order to construct a good attracting subspace. In particular, the components of the observed vector are separated into clusters, and the elements in each cluster shrunk towards a common attractor. The number of clusters and the attractor for each cluster are determined from the observed vector. We provide concentration results for the squared-error loss and convergence results for the risk of the proposed estimators. The results show that the estimators give significant risk reduction over the ML-estimator for a wide range of , particularly for large . Simulation results are provided to support the theoretical claims.

{IEEEkeywords}

High-dimensional estimation, Large deviations bounds, Loss function estimates, Risk estimates, Shrinkage estimators

## 1 Introduction

\IEEEPARstart

Consider the problem of estimating a vector of parameters from a noisy observation of the form

 y=θ+w.

The noise vector is distributed as , i.e., its components are i.i.d. Gaussian random variables with mean zero and variance . We emphasize that is deterministic, so the joint probability density function of for a given is

 pθ(y)=1(2πσ2)n2e−∥y−θ∥22σ2. (1)

The performance of an estimator is measured using the squared-error loss function given by

 L(θ,^θ(y))\vcentcolon=∥^θ(y)−θ∥2,

where denotes the Euclidean norm. The risk of the estimator for a given is the expected value of the loss function:

 R(θ,^θ)\vcentcolon=E[∥^θ(y)−θ∥2],

where the expectation is computed using the density in (1). The normalized risk is .

Applying the maximum-likelihood (ML) criterion to (1) yields the ML-estimator . The ML-estimator is an unbiased estimator, and its risk is . The goal of this paper is to design estimators that give significant risk reduction over for a wide range of , without any prior assumptions about its structure.

In 1961 James and Stein published a surprising result [1], proposing an estimator that uniformly achieves lower risk than for any , for . Their estimator is given by

 ^θJS=[1−(n−2)σ2∥y∥2]y, (2)

and its risk is [2, Chapter , Thm. 5.1]

 R(θ,^θJS)=nσ2−(n−2)2σ4E[1∥y∥2]. (3)

Hence for ,

 R(θ,^θJS)

An estimator is said to dominate another estimator if

 R(θ,^θ1)≤R(θ,^θ2),  ∀θ∈Rn,

with the inequality being strict for at least one . Thus (4) implies that the James-Stein estimator (JS-estimator) dominates the ML-estimator. Unlike the ML-estimator, the JS-estimator is non-linear and biased. However, the risk reduction over the ML-estimator can be significant, making it an attractive option in many situations — see, for example, [3].

By evaluating the expression in (3), it can be shown that the risk of the JS-estimator depends on only via [1]. Further, the risk decreases as decreases. (For intuition about this, note in (3) that for large , .) The dependence of the risk on is illustrated in Fig. 1, where the average loss of the JS-estimator is plotted versus , for two different choices of .

The JS-estimator in (2) shrinks each element of towards the origin. Extending this idea, JS-like estimators can be defined by shrinking towards any vector, or more generally, towards a target subspace . Let denote the projection of onto , so that . Then the JS-estimator that shrinks towards the subspace is

 ^θ=PV(y)+[1−(n−d−2)σ2∥y−PV(y)∥2](y−PV(y)), (5)

where is the dimension of .1 A classic example of such an estimator is Lindley’s estimator [4], which shrinks towards the one-dimensional subspace defined by the all-ones vector . It is given by

 ^θL=¯y1+[1−(n−3)σ2∥y−¯y1∥2](y−¯y1), (6)

where is the empirical mean of .

It can be shown that the different variants of the JS-estimator such as (2),(5),(6) all dominate the ML-estimator.2 Further, all JS-estimators share the following key property [6, 7, 8]: the smaller the Euclidean distance between and the attracting vector, the smaller the risk.

Throughout this paper, the term “attracting vector” refers to the vector that is shrunk towards. For in (2), the attracting vector is , and the risk reduction over is larger when is close to zero. Similarly, if the components of are clustered around some value , a JS-estimator with attracting vector would give significant risk reduction over . One motivation for Lindley’s estimator in (6) comes from a guess that the components of are close to its empirical mean  — since we do not know , we approximate it by and use the attracting vector .

Fig. 1 shows how the performance of and depends on the structure of . In the left panel of the figure, the empirical mean is always , so the risks of both estimators increase monotonically with . In the right panel, all the components of are all equal to . In this case, the distance from the attracting vector for is , so the risk does not vary with ; in contrast the risk of increases with as its attracting vector is .

The risk reduction obtained by using a JS-like shrinkage estimator over crucially depends on the choice of attracting vector. To achieve significant risk reduction for a wide range of , in this paper, we infer the structure of from the data and choose attracting vectors tailored to this structure. The idea is to partition into clusters, and shrink the components in each cluster towards a common element (attractor). Both the number of clusters and the attractor for each cluster are to be determined based on the data .

As a motivating example, consider a in which half the components are equal to and the other half are equal to . Fig. 1(a) shows that the risk reduction of both and diminish as gets larger. This is because the empirical mean is close to zero, hence and both shrink towards 0. An ideal JS-estimator would shrink the ’s corresponding to towards the attractor , and the remaining observations towards . Such an estimator would give handsome gains over for all with the above structure. On the other hand, if is such that all its components are equal (to ), Lindley’s estimator is an excellent choice, with significantly smaller risk than for all values of (Fig. 1(b)).

We would like an intelligent estimator that can correctly distinguish between different structures (such as the two above) and choose an appropriate attracting vector, based only on . We propose such estimators in Sections 3 and 4. For reasonably large , these estimators choose a good attracting subspace tailored to the structure of , and use an approximation of the best attracting vector within the subspace.

The main contributions of our paper are as follows.

• We construct a two-cluster JS-estimator, and provide concentration results for the squared-error loss, and asymptotic convergence results for its risk. Though this estimator does not dominate the ML-estimator, it is shown to provide significant risk reduction over Lindley’s estimator and the regular JS-estimator when the components of can be approximately separated into two clusters.

• We present a hybrid JS-estimator that, for any and for large , has risk close to the minimum of that of Lindley’s estimator and the proposed two-cluster JS-estimator. Thus the hybrid estimator asymptotically dominates both the ML-estimator and Lindley’s estimator, and gives significant risk reduction over the ML-estimator for a wide range of .

• We generalize the above idea to define general multiple-cluster hybrid JS-estimators, and provide concentration and convergence results for the squared-error loss and risk, respectively.

• We provide simulation results that support the theoretical results on the loss function. The simulations indicate that the hybrid estimator gives significant risk reduction over the ML-estimator for a wide range of even for modest values of , e.g. . The empirical risk of the hybrid estimator converges rapidly to the theoretical value with growing .

### 1.1 Related work

George [7, 8] proposed a “multiple shrinkage estimator”, which is a convex combination of multiple subspace-based JS-estimators of the form (5). The coefficients defining the convex combination give larger weight to the estimators whose target subspaces are closer to . Leung and Barron [9, 10] also studied similar ways of combining estimators and their risk properties. Our proposed estimators also seek to emulate the best among a class of subspace-based estimators, but there are some key differences. In [7, 8], the target subspaces are fixed a priori, possibly based on prior knowledge about where might lie. In the absence of such prior knowledge, it may not be possible to choose good target subspaces. This motivates the estimators proposed in this paper, which use a target subspace constructed from the data . The nature of clustering in is inferred from , and used to define a suitable subspace.

Another difference from earlier work is in how the attracting vector is determined given a target subspace . Rather than choosing the attracting vector as the projection of onto , we use an approximation of the projection of onto . This approximation is computed from , and concentration inequalities are provided to guarantee the goodness of the approximation.

The risk of a JS-like estimator is typically computed using Stein’s lemma [5]. However, the data-dependent subspaces we use result in estimators that are hard to analyze using this technique. We therefore use concentration inequalities to bound the loss function of the proposed estimators. Consequently, our theoretical bounds get sharper as the dimension increases, but may not be accurate for small . However, even for relatively small , simulations indicate that the risk reduction over the ML-estimator is significant for a wide range of .

Noting that the shrinkage factor multiplying in (2) could be negative, Stein proposed the following positive-part JS-estimator [1]:

 ^θJS+=[1−(n−2)σ2∥y∥2]+y, (7)

where denotes . We can similarly define positive-part versions of JS-like estimators such as (5) and (6). The positive-part Lindley’s estimator is given by

 ^θL+=¯y1+[1−(n−3)σ2∥y−¯y1∥2]+(y−¯y1). (8)

Baranchik [11] proved that dominates , and his result also proves that dominates . Estimators that dominate are discussed in [12, 13]. Fig. 1 shows that the positive-part versions can give noticeably lower loss than the regular JS and Lindley estimators. However, for large , the shrinkage factor is positive with high probability, hence the positive-part estimator is nearly always identical to the regular JS-estimator. Indeed, for large , , and the shrinkage factor is

 (1−(n−2)σ2∥y∥2)≈(1−(n−2)σ2∥θ∥2+nσ2)>0.

We analyze the positive-part version of the proposed hybrid estimator using concentration inequalities. Though we cannot guarantee that the hybrid estimator dominates the positive-part JS or Lindley estimators for any finite , we show that for large , the loss of the hybrid estimator is equal to the minimum of that of the positive-part Lindley’s estimator and the cluster-based estimator with high probability (Theorems 3 and 4).

The rest of the paper is organized as follows. In Section 2, a two-cluster JS-estimator is proposed and its performance analyzed. Section 3 presents a hybrid JS-estimator along with its performance analysis. General multiple-attractor JS-estimators are discussed in Section 4, and simulation results to corroborate the theoretical analysis are provided in Section 5. The proofs of the main results are given in Section 6. Concluding remarks and possible directions for future research constitute Section 7.

### 1.2 Notation

Bold lowercase letters are used to denote vectors, and plain lowercase letters for their entries. For example, the entries of are , . All vectors have length and are column vectors, unless otherwise mentioned. For vectors , denotes their Euclidean inner product. The all-zero vector and the all-one vector of length are denoted by and , respectively. The complement of a set is denoted by . For a finite set with real-valued elements, denotes the minimum of the elements in . We use to denote the indicator function of an event . A central chi-squared distributed random variable with degrees of freedom is denoted by . The -function is given by , and . For a random variable , denotes . For real-valued functions and , the notation means that , and means that for some positive constant .

For a sequence of random variables , , , and respectively denote convergence in probability, almost sure convergence, and convergence in norm to the random variable .

We use the following shorthand for concentration inequalities. Let be a sequence of random variables. The notation , where is either a random variable or a constant, means that for any ,

 P(|Xn(θ)−X|≥ϵ)≤Ke−nkmin(ϵ2,1)max(∥θ∥2/n,1), (9)

where and are positive constants that do not depend on or . The exact values of and are not specified.

The shrinkage estimators we propose have the general form

 ^θ=ν+[1−nσ2∥y−ν∥2]+(y−ν).

For , the th component of the attracting vector is the attractor for (the point towards which it is shrunk).

## 2 A two-cluster James-Stein estimator

Recall the example in Section 1 where has half its components equal to , and the other half equal to . Ideally, we would like to shrink the ’s corresponding to the first group towards , and the remaining points towards . However, without an oracle, we cannot accurately guess which point each should be shrunk towards. We would like to obtain an estimator that identifies separable clusters in , constructs a suitable attractor for each cluster, and shrinks the in each cluster towards its attractor.

We start by dividing the observed data into two clusters based on a separating point , which is obtained from . A natural choice for the would be the empirical mean ; since this is unknown we use . Define the clusters

 C1 \vcentcolon={yi, 1≤i≤n∣yi>¯y}, C2 \vcentcolon={yi, 1≤i≤n∣yi≤¯y}.

The points in and will be shrunk towards attractors and , respectively, where are defined in (21) later in this section. For brevity, we henceforth do not indicate the dependence of the attractors on . Thus the attracting vector is

 ν2\vcentcolon=a1⎡⎢ ⎢ ⎢ ⎢ ⎢⎣1{y1>¯y}1{y2>¯y}⋮1{yn>¯y}⎤⎥ ⎥ ⎥ ⎥ ⎥⎦+a2⎡⎢ ⎢ ⎢ ⎢ ⎢⎣1{y1≤¯y}1{y2≤¯y}⋮1{yn≤¯y}⎤⎥ ⎥ ⎥ ⎥ ⎥⎦, (10)

with and defined in (21). The proposed estimator is

 ^θJS2 =ν2+[1−nσ2∥y−ν2∥2]+(y−ν2) =ν2+⎡⎢ ⎢⎣1−σ2g(∥y−ν2∥2/n)⎤⎥ ⎥⎦(y−ν2), (11)

where the function is defined as

 g(x)\vcentcolon=max(σ2,x),x∈R. (12)

The attracting vector in (10) lies in a two-dimensional subspace defined by the orthogonal vectors and . To derive the values of and in (10), it is useful to compare to the attracting vector of Lindley’s estimator in (6). Recall that Lindley’s attracting vector lies in the one-dimensional subspace spanned by . The vector lying in this subspace that is closest in Euclidean distance to is its projection . Since is unknown, we use the approximation to define the attracting vector .

Analogously, the vector in the two-dimensional subspace defined by (10) that is closest to is the projection of onto this subspace. Computing this projection, the desired values for are found to be

As the ’s are not available, we define the attractors as approximations of , obtained using the following concentration results.

###### Lemma 1.

We have

 1nn∑i=1yi1{yi>¯y}≐1nn∑i=1θi1{yi>¯y}+σn√2πn∑i=1e−(¯θ−θi)22σ2, (14) 1nn∑i=1yi1{yi≤¯y}≐1nn∑i=1θi1{yi≤¯y}−σn√2πn∑i=1e−(¯θ−θi)22σ2, (15) 1nn∑i=1θi1{yi>¯y}≐n∑i=1θiQ(¯θ−θiσ), (16) 1nn∑i=1θi1{yi≤¯y}≐1nn∑i=1θiQc(¯θ−θiσ), (17) Missing or unrecognized delimiter for \right (18) P(1n∣∣ ∣∣n∑i=11{yi≤¯y}−n∑i=1Qc(¯θ−θiσ)∣∣ ∣∣≥ϵ)≤Ke−nkϵ2. (19)

where . Recall from Section 1.2 that the symbol is shorthand for a concentration inequality of the form (9).

The proof is given in Appendix B.1.

Using Lemma 1, we can obtain estimates for in (13) provided we have an estimate for the term . This is achieved via the following concentration result.

###### Lemma 2.

Fix . Then for any , we have

 P⎛⎝∣∣∣σ22nδn∑i=01{|yi−¯y|≤δ}−⎛⎝σn√2πn∑i=0e−(¯θ−θi)22σ2+κnδ⎞⎠∣∣∣ ≥ϵ)≤10e−nkϵ2, (20)

where is a positive constant and .

The proof is given in Appendix B.2.

###### Note 1.

Henceforth in this paper, is used to denote a generic bounded constant (whose exact value is not needed) that is a coefficient of in expressions of the form where is some constant. As an example to illustrate its usage, let , where and . Then, we have .

Using Lemmas 1 and 2, the two attractors are defined to be

 a1=∑ni=1yi1{yi>¯y}−σ22δ∑ni=01{|yi−¯y|≤δ}∑ni=11{yi>¯y},a2=∑ni=1yi1{yi≤¯y}+σ22δ∑ni=01{|yi−¯y|≤δ}∑ni=11{yi≤¯y}. (21)

With chosen to be a small positive number, this completes the specification of the attracting vector in (10), and hence the two-cluster JS-estimator in (11).

Note that , defined by (10), (21), is an approximation of the projection of onto the two-dimensional subspace spanned by the vectors and . We remark that , which approximates the vector in that is closest to , is distinct from the projection of onto . While the analysis is easier (there would be no terms involving ) if were chosen to be a projection of (instead of ) onto , our numerical simulations suggest that this choice yields significantly higher risk. The intuition behind choosing the projection of onto is that if all the in a group are to be attracted to a common point (without any prior information), a natural choice would be the mean of the within the group, as in (13). This mean is determined by the term , which is different from because

 E(n∑i=1(yi−θi)1{yi≥¯y})=E(n∑i=1wi1{yi≥¯y})≠0.

The term involving in (21) approximates .

###### Note 2.

The attracting vector is dependent not just on but also on , through the two attractors and . In Lemma 2, for the deviation probability in (20) to fall exponentially in , needs to be held constant and independent of . From a practical design point of view, what is needed is . Indeed , for to be a reliable approximation of the term , it is shown in Appendix B.2, specifically in (100) that we need . Numerical experiments suggest a value of for to be large enough for a good approximation.

We now present the first main result of the paper.

###### Theorem 1.

The loss function of the two-cluster JS-estimator in (11) satisfies the following:

1. For any , and for any fixed that is independent of ,

 P(∣∣∣1n∥θ−^θJS2∥2−[min(βn,βnσ2αn+σ2)+κnδ +o(δ)]∣∣∣≥ϵ)≤Ke−nkmin(ϵ2,1)max(∥θ∥2/n,1), (22)

where are given by (25) and (24) below, and is a positive constant that is independent of and , while is another positive constant that is independent of (for a fixed ).

2. For a sequence of with increasing dimension , if , we have

 limn→∞∣∣∣1nR(θ,^θJS2)−[min(βn,βnσ2αn+σ2)+κnδ+o(δ)]∣∣∣ =0. (23)

The constants are given by

 βn\vcentcolon=∥θ∥2n−c21nn∑i=1Q(¯θ−θiσ)−c22nn∑i=1Qc(¯θ−θiσ), (24)
 Missing or unrecognized delimiter for \left (25)

where

 c1\vcentcolon=∑ni=1θiQ(¯θ−θiσ)∑ni=1Q(¯θ−θiσ),  c2\vcentcolon=∑ni=1θiQc(¯θ−θiσ)∑ni=1Qc(¯θ−θiσ). (26)

The proof of the theorem is given in Section 6.2.

###### Remark 1.

In Theorem 1, represents the concentrating value for the distance between and the attracting vector . (It is shown in Sec. 6.2 that concentrates around .) Therefore, the closer is to the attracting subspace, the lower the normalized asymptotic risk . The term represents the concentrating value for the distance between and . (It is shown in Sec. 6.2 that concentrates around .)

###### Remark 2.

Comparing in (24) and in (25), we note that because

 c1−c2=−n∑ni=1(θi−¯θ)Q(θi−¯θσ)(∑ni=1Q(θi−¯θσ))(∑ni=1Qc(θi−¯θσ))≥0. (27)

To see (27), observe that in the sum in the numerator, the function assigns larger weight to the terms with than to the terms with .

Furthermore, for large if either , or . In the first case, if , for , we get . In the second case, suppose that of the values equal and the remaining values equal for some . Then, as , it can be verified that . Therefore, the asymptotic normalized risk converges to in both cases.

The proof of Theorem 1 further leads to the following corollaries.

###### Corollary 1.

The loss function of the positive-part JS-estimator in (7) satisfies the following:

1. For any ,

 P(∣∣ ∣∣∥θ−^θJS+∥2n−γnσ2γn+σ2∣∣ ∣∣≥ϵ)≤Ke−nkmin(ϵ2,1),

where , and and are positive constants.

2. For a sequence of with increasing dimension , if , we have

 limn→∞∣∣∣1nR(θ,^θJS+)−γnσ2γn+σ2∣∣∣=0.

Note that the positive-part Lindley’s estimator in (8) is essentially a single-cluster estimator which shrinks all the points towards . Henceforth, we denote it by .

###### Corollary 2.

The loss function of the positive-part Lindley’s estimator in (8) satisfies the following:

1. For any ,

 P(∣∣ ∣∣∥θ−^θJS1∥2n−ρnσ2ρn+σ2∣∣ ∣∣≥ϵ)≤Ke−nkmin(ϵ2,1),

where and are positive constants, and

 ρn\vcentcolon=∥∥θ−¯θ1∥∥2n. (28)
2. For a sequence of with increasing dimension , if , we have

 limn→∞∣∣∣1nR(θ,^θJS1)−ρnσ2ρn+σ2∣∣∣=0. (29)
###### Remark 3.

Statement of Corollary 1, which is known in the literature [14], implies that is asymptotically minimax over Euclidean balls. Indeed, if denotes the set of such that , then Pinsker’s theorem [15, Ch. 5] implies that the minimax risk over is asymptotically (as ) equal to .

Statement of Corollary 1 and both the statements of Corollary 2 are new, to the best of our knowledge. Comparing Corollaries 1 and 2, we observe that since for all with strict inequality whenever . Therefore the positive-part Lindley’s estimator asymptotically dominates the positive part JS-estimator.

It is well known that both and dominate the ML-estimator [11]. From Corollary 1, it is clear that asymptotically, the normalized risk of is small when is small, i.e., when is close to the origin. Similarly, from Corollary 2, the asymptotic normalized risk of is small when is small, which occurs when the components of are all very close to the mean . It is then natural to ask if the two-cluster estimator dominates , and when its asymptotic normalized risk is close to . To answer these questions, we use the following example, shown in Fig. 2. Consider whose components take one of two values, or , such that is as close to zero as possible. Hence the number of components taking the value is . Choosing , , the key asymptotic risk term in Theorem 1 is plotted as a function of in Fig. 2 for various values of .

Two important observations can be made from the plots. Firstly, exceeds for certain values of and . Hence, does not dominate . Secondly, for any , the normalized risk of goes to zero for large enough . Note that when is large, both and are large and hence, the normalized risks of both and are close to . So, although does not dominate , or , there is a range of for which is much lower than both and . This serves as motivation for designing a hybrid estimator that attempts to pick the better of and for the in context. This is described in the next section.

In the example of Fig. 2, it is worth examining why the two-cluster estimator performs poorly for a certain range of , while giving significantly risk reduction for large enough . First consider an ideal case, where it is known which components of theta are equal to and which ones are equal to (although the values of may not be known). In this case, we could use a James-Stein estimator of the form (5) with the target subspace being the two-dimensional subspace with basis vectors

 u1\vcentcolon=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣1{θ1=τ}1{θ2=τ}⋮1{θn=τ}⎤⎥ ⎥ ⎥ ⎥ ⎥⎦,   u2\vcentcolon