Gaussian and Robust Kronecker Product Covariance Estimation: Existence and Uniqueness
We study the Gaussian and robust covariance estimation, assuming the true covariance matrix to be a Kronecker product of two lower dimensional square matrices. In both settings we define the estimators as solutions to the constrained maximum likelihood programs. In the robust case, we consider Tyler’s estimator defined as the maximum likelihood estimator of a certain distribution on a sphere. We develop tight sufficient conditions for the existence and uniqueness of the estimates and show that in the Gaussian scenario with the unknown mean, samples are almost surely enough to guarantee the existence and uniqueness, where and are the dimensions of the Kronecker product factors. In the robust case with the known mean, the corresponding sufficient number of samples is .
keywords:Constrained covariance estimation, robust estimation, high-dimensional estimation, Kronecker product structure.
Covariance estimation is a fundamental problem in multivariate statistical analysis. It arises in diverse applications such as signal processing, where knowledge of the covariance matrix is unavoidable in constructing optimal detectors kelly1986adaptive, genomics, where it is widely used to measure correlations between gene expression values schafer2005shrinkage; xie2003covariance; hero2012hub, and functional MRI derado2010modeling. Most of the modern algorithms analyzing social networks are based on Gaussian Graphical Models lauritzen1996graphical, where the independences between the graph nodes are completely determined by the sparsity structure of the inverse covariance matrix banerjee2008model. In empirical finance, knowledge of the covariance matrix of stock returns is a fundamental question with implications for portfolio selection and for tests of asset pricing models such as the CAPM bai2011estimating; ledoit2003improved. Application of structured covariance matrices instead of Bayesian classifiers based on Gaussian mixture densities or kernel densities proved to be very efficient for many pattern recognition tasks, among them speech recognition, machine translation and object recognition in images dahmen2000structured. In geometric functional analysis and computational geometry adamczak2010quantitative the exact estimation of covariance matrix is necessary to efficiently compute volume of a body in high dimension. The classical problems of clustering and Discriminant Analysis are entirely based on precise knowledge of covariance matrices of the involved populations friedman1989regularized, etc.
In many modern applications, data sets are very large with both large number of samples and large dimension , often with , leading to the amount of unknown parameters greatly exceeding the number of observations. This high-dimensional regime naturally calls for exploiting or assuming additional structural properties of the data to reduce the number of estimated degrees of freedom. Usually, the specific structures are chosen to be linear or affine. The most popular examples include such models as Toeplitz snyder1989use; abramovich2007time; wiesel2013time; kavcic2000matrices; asif2005block; fuhrmann1991application; roberts2000hidden; bickel2008regularized; sun2015robust; soloveychik2014tyler, group symmetric shah2012group; soloveychik2014groups, sparse banerjee2008model; ravikumar2011high; rothman2008sparse, low rank fan2008high; johnstone2009consistency; lounici2014high and many others. Non-linear structures are also quite common in engineering applications. Among them are the Kronecker product model tsiligkaridis2013convergence; dutilleul1999mle; werner2008estimation; dawid1981some, linear and sparse inverse covariance structures such as graphical models friedman2008sparse; yuan2007model and others.
In this paper we focus on the Kronecker Product (KP) structure, which has recently become an extremely popular model for a variety of applications, such as MIMO wireless communications werner2007estimation, geostatistics cressie2015statistics, genomics yin2012model, multi-task learning bonilla2007multi, face recognition zhang2010learning, recommendation systems allen2010transposable, collaborative filtering yu2009large and many others. The KP model assumes a covariance matrix to be a KP of two lower dimensional square matrices, which is denoted by , where and are and dimensional positive definite matrices, respectively. Given , its factors and can only be determined up to a positive scalar. This natural ambiguity is usually treated by fixing scaling of one of the factors as we do below.
Consider the Gaussian setting and assume we are given independent and identically distributed (i.i.d.) dimensional real vector measurements , where
Assume the mean is known, while the covariance is to be estimated. If the number of samples is not less than the ambient dimension, , the Maximum Likelihood Estimator (MLE) of the covariance parameter almost surely exists and coincides with the Sample Covariance Matrix (SCM),
When the prior knowledge suggests that the true covariance matrix is of the KP structure, it is usually more convenient to cut into columns of height each to obtain a so-called matrix normal random variable dutilleul1999mle; dawid1981some; gupta1999matrix. Following gupta1999matrix, we denote this law by
where is obtained from by the same reshaping procedure. Assume we are given i.i.d. matrix samples as in (3), and want to estimate the covariance matrix factors and . Here, the MLE solution is no longer given by an explicit formula as in (2), moreover, the resulting optimization program is non-convex due to the constraint. Luckily, there exists an alternating optimization approach, which is usually adopted dutilleul1999mle; lu2005likelihood; lu2004likelihood; werner2008estimation. This algorithm is often referred to as the Flip-Flop (FF) due to the symmetric updates of the estimates of and it produces. Below we show that the obtained constrained program becomes convex under a specific change of metric over the set of positive definite matrices, the so-called geodesic metric wiesel2012geodesic; wiesel2012unified, naturally explaining the convergence of the FF and significantly helping to further explore the optimization at hand. We refer to this iterative algorithm as the Gaussian FF (GFF) to distinguish it from another FF scheme introduced later.
In many real world applications the underlying multivariate distribution is actually non-Gaussian and robust covariance estimation methods are required. This occurs whenever the distribution of the measurements is heavy-tailed or a small proportion of the samples exhibits outlier behavior huber1964robust; maronna1976robust. Probably the most common extension of the Gaussian family of distributions allowing for treating heavy-tailed populations is the class of elliptically shaped distributions frahm2004generalized. Elliptical populations served as the basis for defining a family of the so-called covariance -estimators maronna1976robust, of which we focus on Tyler’s estimator tyler1987distribution. Given samples Tyler’s covariance matrix estimator is defined as the solution to the fixed point equation
When are i.i.d. Generalized Elliptically (GE) distributed frahm2004generalized, their shape matrix is positive definite and , Tyler’s estimator exists with probability one and is a consistent estimator of up to a positive scaling factor. The GE family includes as particular cases the generalized Gaussian, the compound Gaussian, the elliptical distributions and many others frahm2004generalized. Therefore, Tyler’s estimator has been successfully used to replace the SCM in many applications sensitive to outliers or heavy-tailed noise, such as anomaly detection in wireless sensor networks chen2011robust, antenna array processing ollila2003robust, and radar detection abramovich2007time; ollila2012complex; bandiera2010knowledge; pascal2008covariance.
It was recently demonstrated that Tyler’s estimator can be viewed as an MLE of a certain spherical distribution wiesel2012geodesic; greco2013cramer; soloveychik2014non. In spite of the fact that the obtained MLE program is not convex, it was later shown to become convex under the geodesic metric change we mentioned above wiesel2012geodesic; wiesel2012unified. Both these fundamental discoveries paved a way to the creation of a very useful natural optimization framework characterizing Tyler’s estimator, which made possible definition of structured analogs of Tyler’s estimator under geodesically convex constraints soloveychik2014groups; sun2015robust; wiesel2015structured. The present article extensively uses this new framework to study the existence and uniqueness of the KP constrained Tyler’s MLE and the convergence properties of the Robust Flip-Flop (RFF) analog of the GFF algorithm obtained from it. Another very popular in engineering applications example of a linear geodesically convex structure is the so-called group symmetry shah2012group. Interestingly, a very recent paper soloveychik2014groups utilized the aforementioned optimization methodology to thoroughly investigate the group symmetric Tyler’s estimator (STyler) and its performance benefits. It is important to note that multiple geodesically convex constraints can be efficiently superimposed when the underlying physics suggests such prior knowledge, e.g quite often in practice the KP structure is followed by group symmetries, leading to a further decrease in the number of estimated degrees of freedom.
In both Gaussian and robust cases, one of the central questions in high-dimensional environment is: What is the minimal number of samples guaranteeing the existence and uniqueness of the corresponding covariance MLE? As we have already mentioned, in the unconstrained Gaussian MLE it is known that samples are enough to guarantee the existence and uniqueness almost surely when the mean is known, and when the mean is unknown. This number is, of course, enough in the constrained case as well, however, one would expect that this threshold can be reduced due to the decrease in the estimated number of parameters. Different necessary and sufficient conditions on the number of samples in the Gaussian KP scenario were proposed by a large number of works, see dutilleul1999mle; srivastava2008models; lu2004likelihood; werner2008estimation; ros2016existence and references therein. In particular, in dutilleul1999mle it was claimed that the number of samples needed to guarantee both the existence and uniqueness of the GFF solution in the unknown mean case, equals . Later, srivastava2008models showed that, in fact, matrix valued measurements are required to guarantee the uniqueness, assuming the estimator exists. In ros2016existence the authors showed by a few simple counterexamples that both results from dutilleul1999mle and srivastava2008models are not correct. Instead, they claimed that “As yet, there do not seem to be existence results for the case ,” therefore, leaving this question open.
Unlike the Gaussian setting, in the robust scenario the mean is usually assumed to be known. To the best of our knowledge, in this case the question of the minimal number of samples needed to ensure existence and uniqueness was not properly addressed thus far, except for the trivial necessary condition stemming from the definition of RFF, as shown below.
The main goal of this paper is to present tight thresholds for the necessary and sufficient conditions guaranteeing existence and uniqueness in both Gaussian and robust cases. Namely, we show that in the Gaussian setting, when the mean is not known, if , the estimator, even if it exists, is not unique with probability one and, when , the solution exists and is unique almost surely. We also provide a discussion explaining that between these bounds the probabilities of existence and non-existence of a unique solution are positive. In the robust case with the mean known, the threshold is . More specifically, if is less than this number, no unique solution exists, while if is greater than this value, the estimator almost surely exists and is unique.
The rest of the text is organized as follows. After we introduce notations, we define the Gaussian setting and formulate the problem. Then we discuss the state of the art results concerning the necessary and sufficient conditions for the existence and uniqueness. Section 4 presents our main result, shows the main idea of the proof and demonstrates it on a simple two dimensional example. Then we treat the robust scenario and provide our main contribution there. Finally, Sections LABEL:GaussProofSec and LABEL:RobustProofSec contain all the proofs for the Gaussian and robust cases, correspondingly.
We deal with real Euclidean spaces denoted by , whose elements are columns written as lower case bolds . The standard scalar products over such spaces are denoted by and the corresponding norms by . By we denote the Euclidean space of real matrices, written as upper case bolds . stands for the linear space of symmetric matrices and - for the open cone of positive definite matrices inside it. Note that inherits from the natural structure of a Euclidean space with the Frobenius norm. denotes the identity matrix of an appropriate dimension. For two spaces and denotes their tensor product space and for two matrices denotes their Kronecker product. The spectral norm of a matrix is denoted by and its determinant by . Given a set , its boundary is denoted by . For two sets and denotes their Cartesian (direct) product. Given a subset of a linear space, denotes its span and - its cardinality. We use standard abbreviation a.s. to denote the almost sure convergence when the measure can be inferred from the context. The symbol replaces saying “is distributed identically to”.
2 Gaussian Setting and Problem Formulation
Assume we are given i.i.d. Gaussian matrix samples
where and . Denote , then, up to an additive constant and scaling, the negative log-likelihood reads as
and is defined over the set , with
Here, the matrix is identified with the positive operator acting by the rule . The scalar product on is given by . Note that can be identified with the set
where the specific normalization can be chosen arbitrarily.
Below we assume the following notational convention: when the set is viewed as a subspace of as in (7), the arguments of the negative log-likelihood are written as , however, when is identified with a subset of defined by (8), we write the arguments as , with
Below we use the same rule for other similar functions and the specific representation of the underlying set can be inferred from the way arguments are written.
Identification (8) allows us to consider elements (under proper normalization if needed) as pairs . In addition, it endows with a smooth manifold structure making a smooth function over it. The covariance MLE under the KP constraint can now be written as a solution to the following program
As in the unconstrained Gaussian case, this program decouples into minimization w.r.t. (with respect to) the unknown mean , yielding
and minimization w.r.t. to and . Note that , and denote , then the first-order optimal conditions for and read as
There does not exist a closed form analytic solution to (12), therefore, it is usually solved numerically via the so-called Flip-Flop (FF) iterative scheme dutilleul1999mle, which we call the Gaussian FF (GFF). The GFF algorithm works as follows. Starting from an initial guess for , we plug it into the right-hand side of (12) and get a new pair . After we normalize this pair to make it formally belong to , we apply the procedure to instead and so on as shown in diagram (LABEL:ff-diag),
The consecutive pairs serve as approximations to the true solution, therefore, the convergence of this sequence to the minimum of on (if it exists) constitutes one of the central topics of our paper. We start form listing the existing results on the questions at hand.
3 Existence, Uniqueness and Convergence: State of the Art
Having derived the G-CARMEL (Gaussian KRonecker product MLE) solution and obtained an iterative scheme for its calculation, our next goal is to determine the necessary and sufficient conditions for its existence and uniqueness and for the convergence of the GFF procedure. The only parameter under our control is the required number of i.i.d. normal samples, therefore, below we focus on the question: How many measurements one needs to guarantee existence, uniqueness and convergence almost surely?
Existence. We start from the sufficient conditions. It was claimed in dutilleul1999mle that samples are needed for the existence and uniqueness of the MLE solution in the Gaussian case. However, it was later shown by a counterexample ros2016existence that the uniqueness does not follow from this condition. In addition, the authors of ros2016existence write that “Moreover, it is not known whether it [this condition] guaranties existence, because it is not sufficient to show that all updates of the FF algorithm have full rank as is done in dutilleul1999mle. It could still happen that the sequence of updates converges (after infinitely many steps) to a Kronecker product that does not have a full rank with the likelihood converging to its supremum.” It is also claimed in ros2016existence that no less than samples are required to ensure the existence a.s. This number of measurements coincides with the one needed in the unconstrained case and does not explore the KP structure. Finally, the authors of ros2016existence conclude that nothing can be said regarding the existence, if the number of samples lies inside the interval .
The necessary conditions were also treated in dutilleul1999mle, where the author claims that if the estimator exists, then . This is clearly true, since if the number of samples is less than this threshold, at least one of the right-hand sides in (12) is rank deficient and cannot be invertible.
Uniqueness. As summarized in ros2016existence, the author of dutilleul1999mle claims that the G-CARMEL is unique whenever . Later, the authors of srivastava2008models stated that indeed is needed to ensure the uniqueness. Here again, ros2016existence succeeded to find counterexamples showing that both these bounds do not guarantee uniqueness. Moreover, the paper ros2016existence describes the exact parts of the proofs which seem to contain mistakes, however, the correct lower bounds on the number of required samples are not provided. In fact, to the best of our knowledge, tight sufficient conditions for the uniqueness have not been reported so far.
Convergence of the Gaussian Flip-Flop Algorithm. The last question regarding the G-CARMEL on which we focus, is the convergence of the GFF iterative scheme. In dutilleul1999mle the author establishes the convergence of the GFF technique empirically. He claims that if the limiting points of the sequences and (in his notations) do not depend on the initial point, and an additional condition on the second derivatives of the objective is satisfied at the limiting points, then these limits provide the G-CARMEL solution. If such limiting points are not uniquely determined, but rather depend on the initial guesses, they must provide local extrema of the likelihood function. Unfortunately, this empirical approach can hardly be applied in practice and does not provide a strict criterion for the convergence of the GFF.
The authors of lu2005likelihood; lu2004likelihood claim that when the number of samples is , the GFF is guarantied to converge, however they doubt if it really converges to the MLE, since the “parameter space of -separable covariance matrices is not convex”. They emphasize that for some values of the algorithm can converge to many different estimates, depending on the starting value. Finally, they conjecture that for large enough “the limit point of the GFF can safely be regarded as the unique MLE” without proving this statement.
In werner2008estimation theoretical asymptotic properties of the GFF algorithm are considered and the algorithm’s performance for small sample sizes is investigated with a simulation study.
The main contributions of the Gaussian part of our paper consist in
proving tight sufficient and necessary conditions for the a.s. existence and uniqueness of the G-CARMEL estimate,
showing that the sufficient conditions also imply convergence of the GFF iterations to the unique solution starting from any initial guess.
4 Main Results and Arguments
In this section we state our main result in the normal case, give the intuition behind the proof argument and demonstrate our technique on a simple example in a low dimension.
4.1 The Main Statement
Assume are independently sampled from a continuous distribution over and consider the problem of minimizing over , then
if , there is no unique minimum,
if , there is a unique minimum a.s.,
if , the GFF converges starting from any point of to this unique minimum a.s.
This is a direct corollary of Theorem LABEL:theorem:continuityExpVar from Section LABEL:main_th_pr_sec. ∎
The statement of the theorem is valid for any continuous distribution and is not limited to the Gaussian ones. Indeed, the claim does not assume any specific statistical model and does not provide statistical guaranties (e.g. consistency), but rather treats the questions of the existence and uniqueness of the minimum.
Note the gap between items 1) and 2) containing one (when ) or two (when ) integer points which cannot be eliminated. We discuss this phenomenon below in more detail.
4.2 Sketch of the The Proof
In this section we discuss the main building blocks of the proof of Theorem 1 leaving the technical details to Section LABEL:GaussProofSec.
Reduction to the Centered Case. Let
then minimization of over does not require optimization w.r.t. the mean parameter. The general case with the unknown expectation can be reduced to it through the following observation. Given a family of random matrices, there always exists another family of random matrices, such that
Lemma LABEL:lemma:continuityExpVar from Section LABEL:GaussProofSec shows why this is true and justifies our transition to the zero mean case. In the remainder of this section we treat the zero mean setting.
Necessary Conditions. Since we require the solution to be composed of invertible matrices and , (12) must hold at the extremum point. Note that its right-hand side is not invertible for , therefore, returning to the non-centered case and compensating for this by adding one sample, yields item 1) of Theorem 1.
Sufficient Conditions. To derive the sufficient conditions, in Section LABEL:GaussProofSec we change the parametrization of by
and introduce a specific metric over , w.r.t. which the set and the function are convex. The desired solution exists and is unique if and only if continuously tends to on the boundary as shown in Lemma LABEL:lemma:hatgProp, in which case it is also strictly convex. Theorem LABEL:theorem:continuity then demonstrates that this happens a.s. w.r.t. the distribution of when . In the next section we demonstrate the reasoning behind these claims by exploiting the case in more detail.
Convergence of the GFF. Suppose we are given a pair of matrices and use (13) to generate the sequence