Stable Estimation of a Covariance Matrix Guided by Nuclear Norm Penalties

Stable Estimation of a Covariance Matrix Guided by Nuclear Norm Penalties

Abstract

Estimation of covariance matrices or their inverses plays a central role in many statistical methods. For these methods to work reliably, estimated matrices must not only be invertible but also well-conditioned. In this paper we present an intuitive prior that shrinks the classic sample covariance estimator towards a stable target. We prove that our estimator is consistent and asymptotically efficient. Thus, it gracefully transitions towards the sample covariance matrix as the number of samples grows relative to the number of covariates. We also demonstrate the utility of our estimator in two standard situations – discriminant analysis and EM clustering – when the number of samples is dominated by or comparable to the number of covariates.

keywords:
Covariance estimation, Regularization, Condition number, Discriminant analysis, EM Clustering
1

1 Introduction

Estimates of covariance matrices and their inverses play a central role in many core statistical methods, ranging from least squares regression to EM clustering. In these applications it is crucial to obtain estimates that are not just non-singular but also stable. It is well known that the sample covariance matrix

is the maximum likelihood estimates of the population covariance of a random sample from a multivariate normal distribution. When the number of components of exceeds the sample size , the sample covariance is no longer invertible. Even when is close to , becomes unstable in the sense that small perturbations in measurements can lead to disproportionately large fluctuations in its entries. A reliable way to combat instability is to perform penalized maximum likelihood estimation.

To motivate our choice of penalization, consider the eigenvalues of the sample covariance matrix in a simple simulation experiment. We drew independent samples from a 10-dimensional multivariate normal distribution . Figure 1 presents boxplots of the sorted eigenvalues of the sample covariance matrix over 100 trials for sample sizes drawn from the set . The boxplots descend from the largest eigenvalue on the left to the smallest eigenvalue on the right. The figure vividly illustrates the previous observation that the highest eigenvalues tend to be inflated upwards above 1, while the lowest eigenvalues are deflated downwards below 1 (Ledoit and Wolf, 2004, 2012). In general, if the sample size and the number of components approach in such a way that the ratio approaches , then the eigenvalues of tend to the Marĉenko-Pastur law (Marĉenko and Pastur, 1967), which is supported on the interval . Thus, the distortion worsens as approaches . The obvious remedy is to pull the highest eigenvalues down and push the lowest eigenvalues up.

Figure 1: Boxplots of the sorted eigenvalues of the sample covariance matrix over 100 random trials. Here the number of components , and the sample size is drawn from the set .

In this paper, we introduce a novel prior which effects the desired adjustment on the sample eigenvalues. Maximum a posteriori (MAP) estimation under the prior boils down to a simple nonlinear transformation of the sample eigenvalues. In addition to proving that our estimator has desirable theoretical properties, we also demonstrate its utility in extending two fundamental statistical methods – discriminant analysis and EM clustering - to contexts where the number of samples is either on the order of or dominated by the number of parameters .

The rest of our paper is organized as follows. Section 2 discusses the history of robust estimation of structured and unstructured covariance matrices. Section 3 specifies our Bayesian prior and derives the MAP estimator under the prior. Section 4 proves that the estimator is consistent and asymptotically efficient. Section 5 reports finite sample studies comparing our MAP estimator to relevant existing estimators. Section 6 illustrates the estimator for some common tasks in statistics. Finally, Section 7 discusses limitations, generalizations, and further applications of the estimator.

2 Related Work

Structured estimation of covariance matrices can be attacked from two complementary perspectives: generalized linear models and regularization (Pourahmadi, 2011, 2013). In this work we consider the problem from the latter perspective. Regularized estimation of covariance matrices and their inverses has been a topic of intense scrutiny (Wu and Pourahmadi, 2003; Bickel and Levina, 2008), and the current literature reflects a wide spectrum of structural assumptions. For instance, banded covariance matrices make sense for time series and spatial data, where the order of the components is important. It is also helpful to impose sparsity on a covariance matrix, its inverse, or its factors in a Cholesky decomposition or other factorization (Huang et al., 2006; Rohde and Tsybakov, 2011; Cai and Zhou, 2012; Ravikumar et al., 2011; Rajaratnam et al., 2008; Khare and Rajaratnam, 2011; Fan et al., 2011; Banerjee et al., 2008; Friedman et al., 2008; Hero and Rajaratnam, 2011, 2012; Peng et al., 2009).

In this current paper, we do not assume any special structure. Our sole concern is to directly address the distortion in the eigenvalues of the sample covariance matrix. Thus, we work in the context of rotationally-invariant estimators first proposed by Stein (1975). If is the spectral decomposition of , then Stein suggests alternative estimators of the form

that change the eigenvalues but not the eigenvectors of . In particular, Stein (1975); Haff (1991); Ledoit and Wolf (2004) and Warton (2008) study the family

(1)

of linear shrinkage estimators, where and for some . The estimator (1) obviously entails

A natural choice of , and one taken by the popular estimator of Ledoit and Wolf (2004), is the mean of the sample eigenvalues, namely . Under the assumption that , is the maximum likelihood estimate of . With this choice, the linear estimator becomes a mixture of the covariance model with the greatest number of degrees of freedom and the simplest non-trivial model. For the rest of this paper, we will assume that . We also highlight the fact that the Ledoit and Wolf linear estimator, which we refer by acronym LW, is a notable member of the class of linear estimators as it specifies an asymptotically optimal value for based on the data.

Ledoit and Wolf (2004, 2012) show that linear shrinkage works well when is large or the population eigenvalues are close to one another. On the other hand, if is small or the population eigenvalues are dispersed, linear shrinkage yields marginal improvements over the sample covariance. Nonlinear shrinkage estimators may present avenues for further improvement (Dey and Srinivasan, 1985; Daniels and Kass, 2001; Sheena and Gupta, 2003; Pourahmadi et al., 2007; Ledoit and Wolf, 2012; Won et al., 2012). Our shrinkage estimator is closest in spirit to the estimator of Won et al. (2012), who put a prior on the condition number of the covariance matrix.

Recall that the condition number of a matrix is the ratio of its largest singular value to its smallest singular value. For a symmetric matrix, the singular values are the absolute values of the eigenvalues, and for a covariance matrix they are the eigenvalues themselves. The best conditioned matrices are multiples of the identity matrix and have . A well-conditioned covariance estimate is one where is not too large, say in excess of 1000.

When does not greatly exceed , Figure 1 shows that the sample covariance matrix is often ill conditioned. To address this defect, Won et al. perform maximum likelihood estimation restricted to the space of positive definite matrices whose condition number does not exceed a threshold . Let denote the negative loglikelihood, namely

Won et al. seek an that solves

minimize
subject to

where and are the largest and smallest eigenvalues of respectively and is a tuning parameter. Note that when , the unique solution is . In practice, is determined by cross-validation. We will refer to the solution to the above problem as Won et al.’s CNR for condition number regularized estimator.

Won et al. show that CNR has improved finite sample performance compared to linear estimators in simulations, but the greatest gains arise when only a few eigenvalues of the population covariance are much larger than the rest. The gains diminish when this does not hold. The main contribution of the estimator we describe next is that it provides superior performance in scenarios where CNR loses its competitive edge.

3 Maximum a Posteriori Estimation with a Novel Prior

Adding a penalty is equivalent to imposing a prior on the population covariance . The prior we advocate is designed to steer the eigenvalues of away from the extremes of 0 and . Recall that the nuclear norm of a matrix , which we denote by , is the sum of the singular values of . The reasonable choice

relies on the nuclear norms of and , a positive strength constant , and an mixture constant .

This is a proper prior on the set of invertible matrices. One can demonstrate this fact by comparing the nuclear norm to the Frobenius norm , which coincides with the Euclidean norm of the vector of singular values of . In view of the equivalence of vector norms on ,

for some positive constant . Integrating the resulting inequality

over demonstrates that the prior is proper. The normalizing constant of is irrelevant in the ensuing discussion. Consider therefore minimization of the objective function

The maximum of occurs at the posterior mode. In the limit as tends to 0, reduces to the loglikelihood . In the sequel we will refer to our MAP covariance estimate by the acronym CERNN (Covariance Estimate Regularized by Nuclear Norms).

Fortunately, three of the four terms of can be expressed as functions of the eigenvalues of . The trace contribution presents a greater challenge. As before, let denote the spectral decomposition of with nonnegative diagonal entries ordered from largest to smallest. Likewise, let denote the spectral decomposition of with positive diagonal entries ordered from largest to smallest. In view of the von Neumann-Fan inequality (Mirsky, 1975), we can assert that

with equality if and only if . Consequently, we make the latter assumption and replace by

using the cyclic permutation property of the trace function. At a stationary point of , we have

The solution to this essentially quadratic equation is

(2)

We reject the negative root as inconsistent with being positive definite. For the special case of no data, all , and the prior mode occurs at a multiple of the identity matrix.

In contrast, CNR shrinks the sample eigenvalues as follows

(3)

for a that is determined from the data. Thus, CNR truncates extreme sample eigenvalues and leaves the moderate ones unchanged.

Figure 2 compares the eigenvalue solution paths obtained by CERNN and CNR to the solution paths of the linear estimator on a set of five sample eigenvalues (13.29, 5.73, 1.51, 0.55, 0.44). At each condition number , the regularization parameters () of the three methods are adjusted to give a condition number matching . Each path starts as a sample eigenvalue and ends at the mean of the sample eigenvalues. As desired, all three methods pull the large eigenvalues down towards and the small eigenvalues up towards . There are important differences, however. Compared to the linear estimator, both of the non-linear estimators pull the larger eigenvalues down more aggressively and pull the smaller eigenvalues up less aggressively. The discrepancy in the treatment of high and low eigenvalues is more pronounced in CNR than in CERNN. We will see later in finite sample experiments that this more moderate approach usually leads to better performance.

Figure 2: A comparison of the solution paths for CERNN (left panel, solid line) and CNR (right panel, solid line) against the path for the linear estimator (both panels, dashed line) for sample eigenvalues (13.29, 5.73, 1.51, 0.55, 0.44). The dotted line indicates the mean of the sample eigenvalues.

Some insight to the limiting behavior of our estimator can be gained through asymptotic expansions. Holding all but one variable fixed in formula (2), one can demonstrate after a fair amount of algebra that

(4)

These asymptotic expansions accord with common sense. Namely, the data eventually overwhelms a fixed prior, and increasing the penalty strength for a fixed amount of data pulls the estimate of toward the prior mode. The second asymptotic expansion will prove useful in selecting in a practical data-dependent way.

3.1 Choosing and

Let us first consider how to choose . A natural choice for it would match the prior to the scale of the data. Thus, we would determine as the solution to the equation

namely

(5)

Of course, we recognize that this choice of results in the prior mode being . For the remainder of the paper we will set to .

We recommend different strategies for choosing depending on whether the covariance estimation is being performed for a supervised or unsupervised task. Both strategies employ cross-validation. We defer describing the former strategy until we discuss an application to discriminant analysis. To choose in the unsupervised context, we implement cross-validation as follows. Let denote the observed data. In -fold cross validation we partition the observations, or rows of , into disjoint sets. Let denote the th subset, denote the number of its rows, and denote the estimate we obtain when we fit the data using all but the th partition . We have indicated with our notation that our estimate depends on but not since we have fixed at the value displayed in equation (5). For a grid of increasing regularization parameters, , we evaluate the predictive negative loglikelihood of on the held out th subgroup

We select the that minimizes the average over the folds, namely

We want to choose sufficiently large to adequately explore the dynamic range of possible solutions. The second expansion in (4) gives us a rough bound on how much each of the shrunken eigenvalues deviates from the prior mode. We choose so that those deviations are no more than some small fraction of the prior mode, namely we choose so that

where , say .

4 Consistency and Asymptotic Efficiency

In proving consistency, we will need various facts. First, suppose and are two symmetric matrices with ordered eigenvalues and . Then one has

(6)

This is a consequence of Fan’s inequality because and . If the two matrices and are simultaneously diagonalizable, then equality holds in inequality (6). We will also need the inequalities

(7)

for nonnegative . Verification will be left to the reader based on the fact that the derivatives of alternate in sign. Functions having this property are said to be completely monotonic.

Let be the sample covariance matrix with eigenvalues through for the first sample points. The sequence converges almost surely to the true covariance matrix with eigenvalues through . Inequality (6) therefore implies . On this basis we will argue that as well, where the are the transformed eigenvalues of . To make this reasoning rigorous, we must show that the asymptotic expansion (4) is uniform as the eigenvalues converge to the eigenvalues . This is where the inequalities (7) come into play. Indeed, we have

(8)

The identity

finishes the proof that tends to .

Now consider the question of asymptotic efficiency. The scaled difference tends in distribution to a multivariate normal distribution with mean because the sequence of estimators is asymptotically efficient (Ferguson, 1996). The representation

and Slutsky’s theorem (Ferguson, 1996) imply that tends in distribution to the same limit. In this regard note that

tends almost surely to 0 owing to the bounds (8) and the convergence of to .

CNR and linear estimators are also asymptotically very well behaved. The asymptotic properties of CNR are stated in terms of the entropy risk, which is the expectation of the entropy loss given below

(9)

CNR has asymptotically lower entropy risk than the sample covariance (Won et al., 2012). But since the sample covariance is a consistent estimator of the covariance, it follows that CNR is also a consistent estimator.

Among all linear estimators, the LW estimator of Ledoit and Wolf (2004) is asymptotically optimal with respect to a quadratic risk. To make this claim more precise, consider the optimization problem,

(10)
subject to

where the weights and are allowed to be random and can therefore depend on the data. One can think of the solution to (10) as being the best possible linear combination of and the sample covariance . Even though may not be an estimator, since it is allowed to depend on the unseen population covariance , Ledoit and Wolf (2004) prove that the quadratic risk of LW has the same quadratic risk as asymptotically. Their results are actually even stronger than stated here because is allowed to increase along with , under suitable, but somewhat complicated regularity conditions that we omit. Since the sample covariance is consistent, and its quadratic risk by definition exceeds the quadratic risk of , it follows that LW is also a consistent estimator.

5 Finite Sample Performance

Given their similar asymptotic behavior, to better understand the relative strengths of CERNN, CNR, and the optimal linear estimator, LW, we conducted a finite sample study almost identical to the one carried out by Won et al. (2012). We assessed the estimators based on two commonly used criteria, namely the entropy loss (9) and the quadratic loss

In our experiments, we simulated data in which we varied the ratio and the spread of the eigenvalues of the population covariance. As noted earlier, linear shrinkage improves on the sample covariance when is large or the population eigenvalues are close to one another. Won et al. report that when the eigenvalues of the population covariance are bimodal, CNR dramatically outperforms linear estimators if very few eigenvalues take on the high values. As the proportion of large eigenvalues grows, however, the discrepancy diminishes. CERNN shrinks extreme eigenvalues in a manner similar to CRN but less drastically and shrinks intermediate eigenvalues similarly to linear estimators. In contrast CNR leaves intermediate eigenvalues unchanged. Consequently, we anticipate that CERNN has the potential to improve on CNR in the latter case, where there is a need to shrink all eigenvalues, like linear estimators, but with extra emphasis on the extreme ones, unlike linear estimators.

We simulated -dimensional zero mean multivariate normal samples with diagonal covariances with , 250, and 500. The eigenvalues took on either high values of or low values of where . For each , the number of high value eigenvalues ranged from a single high value to 10%, 20%, 30%, and 40% of . For each we drew one of three sample sizes such that the ratio took on values that were roughly 1.25, 2, or 4. For each choice of , , and , we simulated 100 data sets and computed CNR, CERNN, and LW. For each data set, we chose an optimal and via 10-fold cross-validation. We used the R package CondReg to obtain the CNR estimate (Oh et al., 2013). LW specifies a penalization parameter based on the data.

To expedite comparisons, we report the average ratio of the loss of other estimators to the loss of CERNN. When this ratio is less than one, the other estimator is performing better than CERNN, and when the ratio is greater than one, CERNN is performing better. Tables 1 and 2 summarize comparisons under the quadratic loss and entropy loss respectively.

We first note that these experiments confirm what was previously observed by Won et al. (2012). Regardless of loss criterion, CNR typically outperforms linear estimators over a wide range of scenarios, but especially when few eigenvalues dominate and the ratio is larger. Compared to CERNN, however, CNR soundly outperforms CERNN only in the extreme case of a single large eigenvalue. In all other scenarios, under either loss criterion, CERNN performs better. It is especially notable that CERNN performs very well in comparison to CNR in scenarios where CNR provides marginal improvement over linear estimators, namely when the fraction of high eigenvalues is highest at 40%. According to equation (3), recall that CNR only shrinks the most extreme sample eigenvalues and leaves the intermediate eigenvalues unchanged. It is not surprising then that it works best when there are very few large population eigenvalues and loses its competitive edge in the opposite circumstance. CERNN’s more moderate approach of shrinking all eigenvalues, with extra emphasis on larger ones, appears to win out when there is a more balanced mix of high and low eigenvalues.

singleton
CNR/CERNN LW/CERNN
0.42 (0.13) 0.19 (0.03) 0.12 (0.01) 12.4 (6.05) 17.9 (3.97) 33.2 (4.23)
0.36 (0.07) 0.23 (0.03) 0.20 (0.02) 9.28 (2.76) 18.1 (2.66) 31.7 (2.09)
0.37 (0.04) 0.29 (0.03) 0.27 (0.02) 8.87 (1.59) 16.9 (1.48) 26.7 (1.18)
high
CNR/CERNN LW/CERNN
1.15 (0.20) 1.80 (0.12) 2.77 (0.13) 4.91 (0.53) 8.01 (0.42) 17.0 (0.66)
1.18 (0.11) 1.66 (0.08) 1.84 (0.06) 4.42 (0.30) 6.85 (0.28) 12.3 (0.41)
1.46 (0.08) 1.39 (0.05) 1.81 (0.05) 4.14 (0.21) 6.08 (0.20) 10.3 (0.29)
high
CNR/CERNN LW/CERNN
4.28 (0.70) 8.52 (0.86) 18.5 (1.06) 8.72 (0.88) 23.5 (1.62) 78.5 (3.32)
2 1.97 (0.16) 2.51 (0.13) 5.25 (0.16) 6.10 (0.37) 12.9 (0.53) 31.1 (1.00)
1.25 1.45 (0.07) 1.95 (0.07) 4.90 (0.15) 4.98 (0.20) 9.42 (0.31) 20.9 (0.61)
high
CNR/CERNN LW/CERNN
11.9 (1.90) 37.8 (3.17) 128 (6.03) 16.9 (1.74) 62.6 (3.49) 244 (7.32)
2 3.97 (0.36) 6.65 (0.34) 13.6 (0.31) 10.7 (0.67) 31.4 (1.24) 95.0 (2.22)
1.25 2.15 (0.10) 3.76 (0.12) 10.6 (0.30) 7.71 (0.36) 18.6 (0.63) 51.8 (1.56)
high
CNR/CERNN LW/CERNN
25.6 (3.18) 83.2 (5.06) 304 (10.7) 28.3 (2.54) 105 (4.60) 407 (10.9)
2 11.0 (1.10) 28.2 (1.61) 68.9 (3.48) 20.1 (1.23) 66.9 (2.28) 234 (4.74)
1.25 3.66 (0.22) 6.39 (0.14) 18.9 (0.27) 13.5 (0.68) 38.6 (1.05) 119 (1.62)
Table 1: Average ratio of quadratic loss of CNR and LW to that of CERNN. Values less than 1 indicate better performance than CERNN (bold). Standard deviations are in parentheses.
singleton
CNR/CERNN LW/CERNN
0.83 (0.11) 0.56 (0.05) 0.40 (0.03) 5.38 (2.24) 9.20 (2.10) 18.7 (2.61)
2 0.78 (0.07) 0.59 (0.03) 0.49 (0.02) 5.24 (1.63) 11.9 (2.08) 24.6 (2.17)
1.25 0.79 (0.04) 0.66 (0.03) 0.56 (0.02) 5.75 (1.15) 12.8 (1.37) 24.4 (1.30)
high
CNR/CERNN LW/CERNN
1.11 (0.12) 1.34 (0.07) 1.62 (0.06) 1.43 (0.11) 1.83 (0.09) 3.19 (0.11)
1.04 (0.07) 1.08 (0.04) 1.16 (0.03) 1.29 (0.07) 1.34 (0.05) 1.81 (0.05)
1.13 (0.05) 0.96 (0.03) 1.26 (0.03) 1.29 (0.05) 1.20 (0.03) 1.38 (0.03)
high
CNR/CERNN LW/CERNN
1.90 (0.19) 2.71 (0.15) 3.74 (0.10) 2.16 (0.19) 4.16 (0.20) 8.27 (0.17)
1.19 (0.06) 1.20 (0.05) 1.97 (0.03) 1.73 (0.10) 2.93 (0.10) 5.57 (0.11)
0.92 (0.03) 1.03 (0.02) 2.14 (0.04) 1.42 (0.05) 2.18 (0.06) 4.05 (0.07)
high
CNR/CERNN LW/CERNN
2.50 (0.20) 4.04 (0.17) 6.49 (0.15) 2.69 (0.17) 5.06 (0.17) 9.06 (0.15)
1.50 (0.08) 1.72 (0.06) 2.39 (0.02) 2.44 (0.12) 4.58 (0.12) 8.53 (0.10)
0.96 (0.03) 1.24 (0.02) 2.64 (0.02) 2.11 (0.10) 3.78 (0.10) 7.13 (0.07)
high
CNR/CERNN LW/CERNN
2.80 (0.17) 4.44 (0.13) 7.29 (0.13) 2.79 (0.15) 4.91 (0.12) 8.45 (0.12)
2.09 (0.11) 3.04 (0.11) 4.23 (0.12) 2.81 (0.13) 4.91 (0.12) 8.51 (0.09)
1.13 (0.05) 1.33 (0.02) 2.66 (0.02) 2.67 (0.11) 4.79 (0.09) 8.64 (0.08)
Table 2: Average ratio of entropy loss of CNR and LW to that of CERNN. Values less than 1 indicate better performance than CERNN (bold). Standard deviations are in parentheses.

6 Applications

Several common statistical procedures are potential beneficiaries of shrinkage estimation of sample covariance matrices. Here we illustrate how CERNN applies to discriminant analysis and clustering.

6.1 Discriminant Analysis

The classical discriminant function

incorporates the mean and prior probability of each class . A new observation is assigned to the class maximizing . If there are classes , then the standard estimator of is

where

One can obviously shrink to moderate its eigenvalues. In quadratic discriminant analysis, a separate covariance matrix is assigned to each class . These are estimated in the usual way, and eigenvalue shrinkage is likely even more beneficial than in linear discriminant analysis. Friedman (1989) advocates regularized discriminant analysis (RDA), a compromise between linear and quadratic discriminant analysis that shrinks toward a common via a convex combination . Although Friedman also suggests shrinking toward class specific multiples of the identity matrix, we do not consider his more complicated version here. Guo et al. (2007) shrink covariance estimates towards the identity matrix and also apply lasso shrinkage on the centroids to obtain improved classification performance in microarray studies. The main difference between CERNN and these methods is that CERNN performs nonlinear shrinkage of the sample eigenvalues.

Since we are primarily interested in the case where all or most of the predictors are instrumental in grouping, we consider only Friedman’s method in a comparison on three data sets from the UCI machine learning repository (Bache and Lichman, 2013). In the case of the E. Coli data set, we restricted analysis to the five most abundant classes. We split each data set into training and testing sets. In each experiment we used 1/5 of the data for training and 4/5 for testing. Table 3 records the number of samples per group in each set. In these data poor examples, even linear discriminant analysis is not viable since a common sample covariance estimate will be ill-conditioned if not singular. Nonetheless, our results show that the combination of separate covariances with regularization works well. We modeled a separate covariance for each class and used 5-fold cross validation to select regularization parameters for CERNN. We used essentially the same procedure described in 3.1 except that we used the misclassification rate instead of the predictive negative loglikelihood. We also used 5-fold cross validation to select the single parameter for (Friedman, 1989). The testing errors in Table 3 demonstrate that CERNN performs well in comparison with RDA. Even when it does not perform as accurately, its drop off is small.

data samples (train) samples (test) CERNN RDA
wine 13/13/10 46/58/38
seeds 14/15/13 56/55/57
ecoli 30/17/7/3/9 113/60/28/17/43
Table 3: Comparison of CERNN and RDA on three data sets from the UCI machine learning repository. The fourth column indicates the number of parameters (mean and covariance) per group in the QDA model. The fifth and sixth columns breakdown the number of samples per group. The last two columns report the classification success rate in the test set.

6.2 Covariance Regularized EM Clustering

We now show how CERNN stabilizes estimation in the standard EM clustering algorithm (McLachlan and Peel, 2000). Let denote a multivariate Gaussian density with mean and covariance . EM clustering revolves around the mixture density

with parameters . The are nonnegative mixture weights summing to 1. We are given independent observations from the mixture density and wish to estimate by maximizing the loglikelihood. It is well known that the loglikelihood is unbounded from above and plagued by many suboptimal local extrema (McLachlan and Peel, 2000). Hathaway (1985) proposed constraints leading to a maximum likelihood problem free of singularities and beset by fewer local extrema. Later it was shown that these constraints could be met by imposing bounds on the largest and smallest eigenvalues of the (Ingrassia, 2004; Ingrassia and Rocci, 2007). These findings suggest that employing our Bayesian prior to estimate can also improve the search for good local optima, since it shrinks the extreme sample eigenvalues to the sample mean eigenvalue.

If is the indicator function of the event that observation comes from cluster , then the complete data loglikelihood plus logprior amounts to

Straightforward application of Bayes rule yields the conditional expectation

These weights should be subscripted by the current iteration number , but to avoid clutter we omit the subscripts. If we set

then the EM updates are , , and

We now address two practical issues. First, there is the question of how to choose . In the previous examples we sought a stable estimate of a single covariance matrix. Here we seek covariance matrices whose imputed data change from iteration to iteration. In accord with our previous choice for , we set to be . This changes dynamically as changes. Second, it is possible for for all for a given at some point as the algorithm iterates. Indeed, finite machine precision may assign the value 0 at some point, making the updates for and undefined. Consequently, we only update and when . This is not a major issue, however, since this scenario only arises when no data points should be assigned to the th cluster.

Besides the work of Ingrassia and Rocci (2007), similar approaches have been employed previously. Fraley and Raftery (2007) suggest a restricted parameterization of covariance matrices. While they offer a menu of parameterizations that cover a range of degrees of freedom, each model has a fixed number of degrees of freedom. One advantage of our model is that the degrees of freedom may be tuned continuously with a single parameter .

Figure 3 shows the results of clustering with our algorithm on a simulated data set. A total of 60 data points were generated from a mixture of 10 bivariate normals corresponding to 59 parameters in the most general case. The number of observations per cluster ranged from 3 to 11. We set the number of clusters to be and considered several different values over a wide range (0.1, 10, 100, 10000). We could choose in a completely data dependent way through cross-validation, but our main concern is to stabilize the procedure so fine-tuning is not of paramount importance, especially when doing so complicates the procedure. We ran our algorithm 500 times using random initializations with the -means++ algorithm (Arthur and Vassilvitskii, 2007) and kept the clustering that gave the greatest value of the expected log likelihood

The resulting clusterings are quite good over our broad range of values. It is notable that for three out of the four values of , even clusters 2 and 10, which overlap, are somewhat distinguished. The only missteps occur when , where cluster 1 is split into two clusters, and clusters 2 and 10 have been merged. The latter decision is reasonable given how much clusters 2 and 10 overlap.

(a)
(b)
(c)
(d)
Figure 3: CERNN clustering projected onto the first two principal components of the data. Ellipses depict the first two eigenvectors (and their corresponding eigenvalues) of the estimated covariances of each cluster.

7 Discussion

The initial insight of Stein (1975) has led to several methods for shrinkage estimation of a sample covariance matrix . These methods preserve the eigenvectors of while pushing towards a multiple of the identity matrix. Our Bayesian prior does precisely this in a nonlinear fashion. Finite sample experiments comparing CERNN and CNR show that CERNN and CNR complement each other. CNR performs better when only a few eigenvalues of the population covariance are very large. CERNN performs better when there is a more even mix of high and low eigenvalues. Both perform at least as well and often better than the simple and asymptotically optimal LW estimator.

CERNN does require a singular value decomposition (SVD), as does CNR. Although highly optimized routines for accurately computing the SVD are readily available, such calculations are not cheap. Randomized linear algebra may provide computational relief (Halko et al., 2011; Mahoney, 2011). If one can tolerate a small loss in accuracy, the SVD of a randomly sampled subset of the data or a random projection of the data can give an acceptable surrogate SVD.

Applications extend well beyond the classical statistical methods illustrated here. For example, in gene mapping with pedigree data, a covariance matrix is typically parameterized as a mixture of three components, one of which is the global kinship coefficient matrix capturing the relatedness between individuals in the study (Lange, 2002). The kinship matrix can be estimated from a high density SNP (single nucleotide polymorphism) panel rather than calculated from possibly faulty genealogical records. Because a typical study contains thousands of individuals typed at hundreds of thousands of genetic markers, this application occurs in the regime . The construction of networks from gene co-expression data is another obvious genetic example (Horvath, 2011). Readers working in other application areas can doubtless think of other pertinent examples.

Acknowledgments

This research was supported by NIH (United States Public Health Service) grants GM53275 and HG006139.

Footnotes

  1. journal:

References

  1. Arthur, D., Vassilvitskii, S., 2007. -means++: The advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. SODA ’07. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, pp. 1027–1035.
  2. Bache, K., Lichman, M., 2013. UCI machine learning repository.
    URL http://archive.ics.uci.edu/ml
  3. Banerjee, O., El Ghaoui, L., d’Aspremont, A., Jun. 2008. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. Journal of Machine Learning Research 9, 485–516.
  4. Bickel, P. J., Levina, E., 2008. Regularized estimation of large covariance matrices. Annals of Statistics 36 (1), 199–227.
  5. Cai, T., Zhou, H., 2012. Minimax estimation of large covariance matrices under norm. Statistica Sinica 22, 1319–1378.
  6. Daniels, M. J., Kass, R. E., 2001. Shrinkage estimators for covariance matrices. Biometrics 57 (4), 1173–1184.
  7. Dey, D. K., Srinivasan, C., 1985. Estimation of a covariance matrix under Stein’s loss. Annals of Statistics 13 (4), 1581–1591.
  8. Fan, J., Liao, Y., Mincheva, M., 2011. High-dimensional covariance matrix estimation in approximate factor models. Annals of Statistics 39 (6), 3320–3356.
  9. Ferguson, T. S., 1996. A course in large sample theory. CRC Texts in Statistical Science Series. Chapman and Hall.
  10. Fraley, C., Raftery, A. E., Sep. 2007. Bayesian regularization for normal mixture estimation and model-based clustering. J. Classif. 24 (2), 155–181.
  11. Friedman, J., Hastie, T., Tibshirani, R., 2008. Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9 (3), 432–441.
  12. Friedman, J. H., 1989. Regularized discriminant analysis. Journal of the American Statistical Association 84 (405), 165–175.
  13. Guo, Y., Hastie, T., Tibshirani, R., 2007. Regularized linear discriminant analysis and its application in microarrays. Biostatistics 8 (1), 86–100.
  14. Haff, L. R., 1991. The variational form of certain Bayes estimators. The Annals of Statistics 19 (3), 1163–1190.
  15. Halko, N., Martinsson, P. G., Tropp, J. A., 2011. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53 (2), 217–288.
  16. Hathaway, R. J., 1985. A constrained formulation of maximum-likelihood estimation for normal mixture distributions. Annals of Statistics 13 (2), 795–800.
  17. Hero, A., Rajaratnam, B., 2011. Large-scale correlation screening. Journal of the American Statistical Association 106 (496), 1540–1552.
  18. Hero, A., Rajaratnam, B., 2012. Hub discovery in partial correlation graphs. Information Theory, IEEE Transactions on 58 (9), 6064–6078.
  19. Horvath, S., 2011. Weighted Network Analysis. Applications in Genomics and Systems Biology. Springer, New York.
  20. Huang, J. Z., Liu, N., Pourahmadi, M., Liu, L., 2006. Covariance matrix selection and estimation via penalised normal likelihood. Biometrika 93 (1), 85–98.
  21. Ingrassia, S., 2004. A likelihood-based constrained algorithm for multivariate normal mixture models. Statistical Methods and Applications 13 (2), 151–166.
  22. Ingrassia, S., Rocci, R., 2007. Constrained monotone EM algorithms for finite mixture of multivariate Gaussians. Computational Statistics & Data Analysis 51 (11), 5339–5351.
  23. Khare, K., Rajaratnam, B., 2011. Wishart distributions for decomposable covariance graph models. Annals of Statistics 39, 514–555.
  24. Lange, K., 2002. Mathematical and Statistical Methods for Genetic Analysis, 2nd Edition. Statistics for Biology and Health. Springer-Verlag, New York.
  25. Ledoit, O., Wolf, M., 2004. A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis 88 (2), 365–411.
  26. Ledoit, O., Wolf, M., 2012. Nonlinear shrinkage estimation of large-dimensional covariance matrices. Annals of Statistics 40 (2), 1024–1060.
  27. Mahoney, M. W., Feb. 2011. Randomized algorithms for matrices and data. Found. Trends Mach. Learn. 3 (2), 123–224.
  28. Marĉenko, V. A., Pastur, L. A., 1967. Distribution of eigenvalues for some sets of random matrices. Mathematics of the USSR-Sbornik 1 (4), 507–536.
  29. McLachlan, G., Peel, D., 2000. Finite Mixture Models. Wiley, New York.
  30. Mirsky, L., 1975. A trace inequality of John von Neumann. Monatshefte für Mathematik 79, 303–306.
  31. Oh, S.-Y., Rajaratnam, B., Won, J.-H., 2013. CondReg: Condition Number Regularized Covariance Estimation. R package version 0.16.
  32. Peng, J., Wang, P., Zhou, N., Zhu, J., 2009. Partial correlation estimation by joint sparse regression models. Journal of the American Statistical Association 104 (486), 735–746.
  33. Pourahmadi, M., 2011. Covariance estimation: The GLM and regularization perspectives. Statistical Science 26 (3), 369–387.
  34. Pourahmadi, M., 2013. High-Dimensional Covariance Estimation: With High-Dimensional Data. Wiley.
  35. Pourahmadi, M., Daniels, M. J., Park, T., 2007. Simultaneous modelling of the Cholesky decomposition of several covariance matrices. Journal of Multivariate Analysis 98 (3), 568–587.
  36. Rajaratnam, B., Massam, H., Carvalho, C. M., 2008. Flexible covariance estimation in graphical Gaussian models. Annals of Statistics 36 (6), 2818–2849.
  37. Ravikumar, P., Wainwright, M. J., Raskutti, G., Yu, B., 2011. High-dimensional covariance estimation by minimizing -penalized log-determinant divergence. Electronic Journal of Statistics 5, 935–980.
  38. Rohde, A., Tsybakov, A. B., 2011. Estimation of high-dimensional low-rank matrices. Annals of Statistics 39 (2), 887–930.
  39. Sheena, Y., Gupta, A. K., 2003. Estimation of the multivariate normal covariance matrix under some restrictions. Statistics & Decisions 21 (4), 327–342.
  40. Stein, C., 1975. Estimation of a covariance matrix, 39th A. Meet. Institute of Mathematical Statistics, Atlanta.
  41. Warton, D. I., 2008. Penalized normal likelihood and ridge regularization of correlation and covariance matrices. Journal of the American Statistical Association 103 (481), 340–349.
  42. Won, J.-H., Lim, J., Kim, S.-J., Rajaratnam, B., 2012. Condition-number-regularized covariance estimation. Journal of the Royal Statistical Society: Series B (Statistical Methodology).
  43. Wu, W. B., Pourahmadi, M., 2003. Nonparametric estimation of large covariance matrices of longitudinal data. Biometrika 90 (4), 831–844.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minumum 40 characters
   
Add comment
Cancel
Loading ...
112597
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description