Distributionally Robust Formulation and Model Selection for the Graphical Lasso

# Distributionally Robust Formulation and Model Selection for the Graphical Lasso

Pedro Cisneros-Velarde, Sang-Yun Oh, Alexander Petersen
Center for Control, Dynamical Systems and Computation,
Department of Statistics and Applied Probability,
University of California, Santa Barbara
###### Abstract

Building on a recent framework for distributionally robust optimization, we consider estimation of the inverse covariance matrix for multivariate data. We provide a novel notion of a Wasserstein ambiguity set specifically tailored to this estimation problem, leading to a tractable class of regularized estimators. Special cases include penalized likelihood estimators for Gaussian data, specifically the graphical lasso estimator. As a consequence of this formulation, the radius of the Wasserstein ambiguity set is directly related to the regularization parameter in the estimation problem. Using this relationship, the level of robustness of the estimation procedure can be shown to correspond to the level of confidence with which the ambiguity set contains a distribution with the population covariance. Furthermore, a unique feature of our formulation is that the radius can be expressed in closed-form as a function of the ordinary sample covariance matrix. Taking advantage of this finding, we develop a simple algorithm to determine a regularization parameter for graphical lasso, using only the bootstrapped sample covariance matrices, meaning that computationally expensive repeated evaluation of the graphical lasso algorithm is not necessary. Alternatively, the distributionally robust formulation can also quantify the robustness of the corresponding estimator if one uses an off-the-shelf method such as cross-validation. Finally, we numerically study the obtained regularization criterion and analyze the robustness of other automated tuning procedures used in practice.

## 1 Introduction

In statistics and machine learning, the covariance matrix of a random vector is a fundamental quantity for characterizing marginal pairwise dependencies between variables. Furthermore, the inverse covariance matrix provides information about the conditional linear dependency structure between the variables. For example, in the case that is Gaussian, if and only if the th and th variables of are conditionally independent given the rest. Such relationships are of interest in many applications such as environmental science, biology, and neuroscience (Guillot et al., 2015; Huang et al., 2010; Krumsiek et al., 2011), and have given rise to various statistical and machine learning methods for inverse covariance estimation.

Given an independent sample the sample covariance , which is the maximum likelihood estimator if is Gaussian, can be a poor estimator of unless is very small. Driven by a so-called high dimensional setting where , where is not invertible, regularized estimation of the precision matrix has gained significant interest (Cai et al., 2011; Edwards, 2010; Friedman et al., 2007; Khare et al., 2015; Won et al., 2012; Yuan and Lin, 2007). Such regularization procedures are useful even when is a stable estimate (i.e., positive definite with small condition number), since the inverse covariance estimate is dense and will not reflect the sparsity of corresponding nonzero elements in .

Distributionally Robust Optimization: Let be the set of symmetric matrices, and be the subset of positive definite symmetric matrices. Given a loss function for and , a classical approach would be to estimate by minimizing the empirical loss over perhaps including a regularization or penalty term. The error in this estimate arises from the discrepancy between the true data-generating distribution and the observed training samples, and can be assessed by various tools such as concentration bounds or rates of convergence. In contrast, distributionally robust optimization (DRO) is a technique that explicitly incorporates uncertainty about the distribution of the into the estimation procedure. For an introduction on the general topic of DRO, we refer to the works (Shafieezadeh-Abadeh et al., 2015; Blanchet and Murthy, 2016; Shafieezadeh-Abadeh et al., 2019) and the references therein. In the context of inverse covariance estimation, a distributionally robust estimate of is obtained by solving

 infK∈S++dsupP∈SEP[l(K;X)], (1)

where , known as an ambiguity set, is a collection of probability measures on . As pointed out in (Blanchet and Murthy, 2016), a natural choice for is the neighborhood , with being a chosen baseline model, being some tolerance level which defines the uncertainty size of the ambiguity set, and being some discrepancy metric between two probability measures. In a practical setting, we have access to some samples (or data points) from the unknown distribution, and thus, a good candidate for the baseline model is the empirical measure.

Very recent work by Nguyen et al. (2018), also analyzed by Blanchet and Si (2019), used the DRO framework to construct a new regularized (dense) inverse covariance estimator. Working under the assumption that is Gaussian, the authors construct an ambiguity set of Gaussian distributions that, up to a certain tolerance level, are consistent with the observed data. Recent work on DRO in other machine learning problems has revealed explicit connections to well-known regularized estimators, specifically regularized logistic regression (Shafieezadeh-Abadeh et al., 2015) and the square-root lasso for linear models (Blanchet et al., 2016); however, such a connection to regularized sparse inverse covariance estimators that are used in practice has yet to be made.

The Graphical Lasso: One of the most common methods to recover the sparsity pattern in is to add an -regularization term to the Gaussian likelihood function, motivated by the consideration of the Gaussian Graphical Model (GGM). A sparse estimate of is produced by minimizing

 Lλ(K)=1nn∑i=1X⊤iKXi−log|K|+λd∑i=1d∑j=1|kij|, (2)

where is the entry of and is a user-specified regularization parameter (Banerjee et al., 2008; Friedman et al., 2007; Yuan and Lin, 2007). Although several algorithms exist to solve this objective function (Friedman et al., 2007; Rolfs et al., 2012; Hsieh et al., 2014), the minimizer of (2) is often referred to as graphical lasso estimator (Friedman et al., 2007). The first two terms of (2) are related to Stein’s loss (James and Stein, 1961) when evaluated at the empirical measure, and also correspond to the negative log-likelihood up to an additive constant if is Gaussian. The performance of the graphical lasso estimator in high-dimensional settings has been investigated (Rothman et al., 2008; Jankova and van de Geer, 2018), as well as modifications and extensions that implement some notion of robustness, i.e., for making it robust to outliers or relaxing the normality assumptions in the data (Khare et al., 2015; Lam and Fan, 2009; Loh and Tan, 2018; Xue and Zou, 2012; Yang and Lozano, 2015).

Besides its theoretical relevance, the graphical lasso and its extensions also enjoy many practical advantages. For example, it has been used as a network inference tool. In these applications, the precision matrix can indicate which nodes in a network are conditionally independent given information from remaining nodes, thus giving an indication of network functionality. This has been important in neuroscience applications when studying the inference of brain connectivity (Yang et al., 2015; Smith et al., 2011; Huang et al., 2010). Applications in gene regulatory networks and metabolomics have also been reported (Menéndez et al., 2010; Sulaimanov et al., 2018; Krumsiek et al., 2011).

The performance of the graphical lasso estimator hinges critically on the choice of While there have been studies on how to properly tune to obtain a consistent estimator or to establish correct detection of nonzero elements in the precision matrix (Rothman et al., 2008; Banerjee et al., 2008; Mazumder and Hastie, 2012), in practice, this selection is often made through automated methods like cross-validation.

Contributions: In this paper, we propose a distributionally robust reformulation of the graphical lasso estimator in (2). Following Shafieezadeh-Abadeh et al. (2015); Blanchet et al. (2016); Esfahani and Kuhn (2018), we utilize the Wasserstein metric to quantify distributional uncertainty for the construction of the ambiguity set. The following points summarize our main contributions.

• We formulate a class of DRO problems for inverse covariance estimation, leading to a tractable class of -norm regularized estimators. As the graphical lasso estimator (2) is a special case, this provides us with a new interpretation of this popular technique. This DRO formulation is made possible by a novel type of ambiguity set, now defined as a collection of measures on matrices. This nontrivial adaptation is necessary due to the fact that a direct generalization of other DRO approaches using vector-valued data (e.g., Shafieezadeh-Abadeh et al. (2019)) does not result in a closed-form regularization problem, and thus does not provide the desired connection to the graphical lasso.

• We use this formulation to suggest a criterion for the selection of the regularization parameter in the estimation problem in the classical regime . This criterion follows the Robust Wasserstein Profile (RWP) inference recently introduced by Blanchet et al. (2016), which makes no assumption on the normality of the data, and which we tailor to our specific problem. The proposed criterion expresses the regularization parameter as an explicit function of the sample covariance unlike other instances where RWP has been implemented which rely on stochastic dominance arguments.

• We formulate a novel robust selection (RobSel) algorithm for regularization parameter choice. Focusing on the graphical lasso, we provide numerical results that compare the performance of cross-validation and our proposed algorithm for the selection of the regularization term.

The paper is organized as follows. In Section 2 we describe our main theoretical result: the distributionally robust formulation of the regularized inverse covariance (log-likelihood) estimation, from which graphical lasso is a particular instance. In Section 3 we propose a criterion for choosing the regularization parameter inspired by this formulation and outline the bootstrap-based RobSel algorithm for its computation. In Section 4 we present some numerical results comparing the proposed criterion of Section 3 with cross-validation. Finally, we state some concluding remarks and future research directions in Section 5. All proofs of theoretical results can be found in the supplementary material.

## 2 A Distributionally Robust Formulation of the Graphical lasso

First, we provide preliminary details on notation. Given a matrix , denotes its entry and denotes its vectorized form, which we assume to be in a row major fashion. For matrices denoted by Greek letters, its entries are simply denoted by appropriate subscripts, i.e. . The operator , when applied to a matrix, denotes its determinant; when applied to a scalar or a vector, it denotes the absolute value or entry-wise absolute value, respectively. The -norm of a vector is denoted by . We use the symbol to denote convergence in distribution.

Recall that is a zero-mean random vector with covariance matrix . Let be the probability law for and be the precision matrix. Define the graphical loss function as

 l(X;K)=X⊤KX−log|K|=trace(KXX⊤)−log|K|. (3)

Then is a convex function of over the convex cone . Using the first-order optimality criterion, we observe that sets the gradient equal to the zero matrix (see (Boyd and Vandenberghe, 2004, Appendix A) for details on this differentiation). Hence,

 argminK∈S++dEQ0[l(X;K)]=Ω.

so that (3) is a consistent loss function.

Now, if we consider an iid random sample , with empirical measure then

 argminK∈S++dEQn[l(X;K)] =argminK∈S++d1nn∑i=1l(Xi;K) =A−1n

with . Thus, as described in the Introduction, a natural approach would be to implement the DRO procedure outlined in Esfahani and Kuhn (2018) by building an ambiguity set based on perturbations of leading to the DRO estimate given by (1). However, this approach does not convert (1) into a regularized estimation problem as desired, since the inner supremum cannot be explicitly given in closed-form. For more details, see section A in the supplementary material.

As an alternative, let represent the measure of the random matrix on induced by and, similarly let be empirical measure of the sample , Redefining the graphical loss function as

 l(W;K)=trace(KW)−log|K|, (4)

then

 Ω =argminK∈S++dEP0[l(W;K)] A−1n =argminK∈S++dEPn[l(W;K)].

This observation leads to a tractable DRO formulation by constructing ambiguity sets built around the empirical measure The DRO formulation for inverse covariance estimation becomes

 minK∈S++dsupP: Dc(P,Pn)≤δEP[l(W;K)]. (5)

The ambiguity set in this formulation is specified by the collection of measures , which we now describe. Given two probability distributions and on and some transportation cost function (which we will specify below), we define the optimal transport cost between and as

 Dc(P1,P2)=inf{Eπ[c(U,V)]|π∈P(Sd×Sd), πU=P1, πV=P2} (6)

where is the set of joint probability distributions of supported on , and and denote the marginals of and under , respectively. In this paper, we are interested in cost functions

 c(U,V)=∥∥{vec}(U)−{vec}(V)∥∥ρq, (7)

with , , . As pointed out by Blanchet et al. (2016), the resulting optimal transport cost is the Wasserstein distance of order Our first theoretical result demonstrates that the optimization in (5) corresponds to a class of regularized estimators under the graphical loss function (4).

###### Theorem 2.1 (DRO formulation of regularized inverse covariance estimation).

Consider the cost function in (7) for a fixed . Then,

 minK∈S++dsupP: Dc(P,Pn)≤δEP[l(W;K)]=minK∈S++d{trace(KAn)−log|K|+δ1/ρ∥∥% {vec}(K)∥∥p}, (8)

where .

Theorem 2.1 is a remarkable theoretical result that provides a mapping between the regularization parameter and the uncertainty size of the ambiguity set in the DRO formulation. Then, the regularization problem reduces to determining a good criterion for choosing , which we explore in Section 3. Moreover, we obtain the graphical lasso formulation (2) by setting in (7). From (8), a smaller ambiguity set implies less robustness being introduced in the estimation problem by reducing the importance of the regularization term. Conversely, a larger regularization term increases the number of nuisance distributions inside the ambiguity set, and thus the robustness.

###### Remark 2.2.

The ambiguity set used in (8) makes no assumptions on the normality of the distribution of the samples . Then, (8) tells us that adding a penalization to the precision matrix gives a robustness in terms of the distributions that the samples may have, which do not necessarily have to be Gaussian for this formulation to hold. Furthermore, it holds independent of the relationship between and

## 3 Selection of the regularization parameter

This section follows closely the line of thought recently introduced by Blanchet et al. (2016) in the analysis of regularized estimators under the DRO formulation. Specifically, we will demonstrate that the ambiguity set represents a confidence region for and use the techniques of Blanchet et al. (2016) to explicitly connect the amibiguity size with a confidence level. As previously stated, is a differentiable function on with , so that

 EP0[∂∂Kl(W;K)∣∣K=Ω]=\mathbbold0d×d. (9)

Hence, even though the loss function has been inspired from the log-likehood estimation of the covariance matrix for samples of Gaussian random vectors, equation (9) is transparent to any underlying distribution of the data. For any define the set

 (10)

corresponding to all probability measures with covariance , i.e. for which is an optimal loss minimization parameter; here, denotes the set of all probability distributions supported on . Thus, contains all probability measures with covariance matrix agreeing with that of

Implicitly, the Wasserstein ambiguity set is linked to the collection of covariance matrices

 Cn(δ):={K∈S++d| there exists P∈O(K)∩{P|Dc(P,Pn)≤δ}}=⋃P: Dc(P,Pn)≤δargminK∈S++dEP[l(W;K)]. (11)

We refer to as the set of plausible selections for .

###### Lemma 3.1 (Interchangeability in the DRO formulation).

Consider the setting of Theorem 2.1. Then, for the following holds with probability one:

 infK∈S++dsupP: Dc(P,Pn)≤δEP[l(W;K)]=supP: Dc(P,Pn)≤δinfK∈S++dEP[l(W;K)]. (12)

Lemma 3.1 states that any estimator obtained by minimizing the left-hand side of (12) must be in , otherwise the right-hand side of (12) would be strictly greater than the left. Thus, in line with the goal of providing a robust estimator, the idea is to choose so that also contains the true inverse covariance matrix with high confidence.

As is the weak limit of , we will eventually have that with high probability for any so that is a confidence region for From this observation, we can choose the uncertainty size optimally by the criterion

 δ=inf{δ>0|P(Ω∈Cn(δ))≥1−α}, (13)

i.e., for a specified confidence level , we choose so that is a -confidence region for .

To continue our anlaysis, we make use of the so-called Robust Wasserstein Profile (RWP) function introduced by Blanchet et al. (2016),

 Rn(K)=inf{Dc(P,Pn)|P∈O(K)}=inf{Dc(P,Pn)∣∣EP[∂∂K′l(W;K′)∣∣K′=K]=\mathbbold0d×d}, (14)

for which has the geometric interpretation of being the minimum distance between the empirical distribution and any distribution that satisfies the optimality condition for the precision matrix . Then, using the equivalence of events , (13) becomes equivalent to

 δ=arginf{δ>0|P(Rn(Ω)≤δ)≥1−α}, (15)

i.e., the optimal selection of is the quantile of . Indeed, the set is the smallest ambiguity set around the empirical measure such that there exists a distribution for which is an optimal loss minimization parameter. In contrast to previously reported applications of the RWP function on linear regression and logistic regression (Blanchet et al., 2016), our problem allows for a (finite sample) closed form expression of this function. This is due to the fact that we have recast the covariance as the mean of the random matrix so that the following result gives a nontrivial generalization of (Blanchet et al., 2016, Example 3).

###### Theorem 3.2 (RWP function).

Consider the cost function in (7) for a fixed . For consider as in (14). Then,

 Rn(K)=∥∥{vec}(An−K−1)∥∥ρq. (16)

We now establish important convergence guarantees on the RWP function in the following corollary.

###### Corollary 3.3 (Asymptotic behavior of the RWP function).

Suppose that the conditions of Theorem 3.2 hold, and that . Let be a matrix of jointly Gaussian random variables with zero mean and such that Then,

 nρ/2Rn(Ω)⇒∥∥{vec}(H)∥∥ρq. (17)
###### Proof.

By the central limit theorem, we observe that , and by the continuous mapping theorem, we get that

###### Remark 3.4.

Turning our attention back to Theorem 2.1, a robust selection for the ambiguity size or regularization parameter , as obtained from Theorem 3.2, is

 δ1/ρ=inf{δ>0|P(∥∥{vec}(An−Σ)∥∥q≤δ)≥1−α} (18)

As a result, this robust selection for results in a class of estimators, given by minimizers of the right-hand side of (8), that are invariant to the choice of in (7). Thus, for simplicity, we will set in the remainder of the paper.

###### Remark 3.5.

Let be the quantile from the distribution of the right-hand side of (17). Then, for any fixed the robust selection in (18) satisfies so that the optimal decay rate of for is automatically chosen by the RWP function.

As solving (18) requires knowledge of , we now outline the robust selection (RobSel) algorithm for data-adaptive choice of the regularization parameter for our inverse covariance estimation with an penalization parameter. The special case corresponds to the graphical lasso in (2), in which case we will also use the notation . The asymptotic result in Corollary 3.3 invokes a central limit theorem, and thus motivates the approximation of the RWP function through bootstrapping, which we further explain and evaluate its numerical performance in the next section. Let be a prespecified confidence level and a large integer such that is also an integer.

RobSel can potentially provide considerable computational savings over cross-validation in practice. Computing sample covariance matrices for each of bootstrap samples has cost . On the other hand, it is known that each iteration of graphical lasso can cost in the worst case (Mazumder and Hastie, 2012); therefore, performing an -fold cross-validation to search over -grid of regularization parameters, each taking -iterations of graphical lasso, would cost .

## 4 Numerical results and analysis

The true precision matrix used to generate simulated data has been constructed as follows. First, generate an adjacency matrix of an undirected Erdős-Renyi graph with equal edge probability and without self-loops. Then, the weight of each edge is sampled uniformly between , and the sign of each non-zero weight is positive or negative with equal probability of 0.5. Finally, the diagonal entries of this weighted adjacency matrix are set to and the matrix is made diagonally dominant by following a procedure described in (Peng et al., 2009), which ensures that the resulting matrix is positive definite. Throughout this numerical study section, a randomly generated sparse matrix (edge probability and ) is fixed. Using this , a total of datasets (of varying size ) were generated as independent observations from a multivariate zero-mean Gaussian distribution, i.e., .

Consider the problem of choosing the regularization parameter (equivalently, the ambiguity size parameter ) to obtain graphical lasso estimates of using the simulated datasets. An R software package, glasso, from CRAN was used throughout our numerical experiments. Below, we compare two different criteria for choosing . The first criterion is Robust Selection (RS), which follows our proposed RobSel algorithm with sets of bootstrap samples. We present here results mainly for but additional results in the high-dimensional regime can be found in the supplementary material. The second criterion is a -fold cross-validation (CV) procedure. The performance on the validation set is the evaluation of the graphical loss function under the empirical measure of the samples on the training set.

Recall the elements in the confusion matrix to be true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN). We compare model selection performance of chosen by the two different approaches: and . The following comparison metrics are used:

• True positive rate (TPR) and false detection rate (FDR): is the proportion of nonzero entries of that are correctly identified in , and is the proportion of zero entries of that are incorrectly identified as nonzeros in .

• Matthew’s Correlation Coefficient (MCC): MCC summarizes all counts in the confusion matrix in contrast to other measures like TPR and FDR. More details about MCC is given in supplemental subsection D.1.

In the remainder of this section, we compare the model selection performance (FDR, TPR, MCC) from our simulation results. As mentioned in Remark 3.5, supplemental subsection D.2 shows that decreases as increases as is also observed to be true with . Furthermore, across the tested range of , the regularization are all larger than for any . Then, our distributionally robust representation (8) allows us to observe that even for small values of , CV always chooses a that corresponds to smaller ambiguity sets than RS.

To assess the accuracy of RobSel in estimating for a given , we approximated the right-hand side of (18) using the data sets and the true covariance giving the “true” value . Figures in supplemental subsection D.3 show that the performance obtained by is similar to the one obtained by for all comparison metrics. This finite sample behavior of RS indicates that the RobSel bootstrap algorithm reliably approximates the desired robustness level corresponding to the choice of These plots also indicate that the RS criterion is more conservative than CV in achieving a lower FDR across different sample sizes, due to providing larger values for (see supplemental subsection D.2). More specifically, Fig. 2 shows that RS gives a better performance than CV in terms of FDR even for smaller values of and this performance improves even more as increases.

Moreover, the trade-off between the preference for robustness and the preference for a higher density estimation of nonzero entries in the precision matrix can be observed in terms of the Matthews correlation coefficient (MCC), as shown in Fig. 2. Higher values in the curve of MCC implies values of that describe a better classification of the entries of as either zero or nonzero (see supplemental subsection D.1 for more details). We observe that CV is placed at the left of the optimal value of the MCC and its under-performance is due to the overestimation of nonzero entries in . On the other hand, RS is to the right of the optimal value and its under-performance is due to its conservative overestimation of zero entries in induced by its robust nature. Then, it is up to the experimenter to know which method to use depending on whether she desires to control for the overestimation of zero or nonzero entries. Remarkably, for large numbers of , RS seems to be much closer to the optimal performance than CV according to the MCC, and it does this by maintaining a lower FDR than CV while increasing its TPR.

Our results from the MCC analysis and supplemental subsection D.3 also indicate that we should aim for higher values of if we want a performance closer to CV in terms of TPR when using the RS criterion, with the advantage of still maintaining a better performance than CV in terms of the FDR. In contrast, if we want more conservative results, we should aim for lower values of . This is a good property of RS: it allows the use of a single parameter to adjust the importance of the regularization term. As mentioned before section 4, a practical importance of RS is that it provides a candidate for with potentially considerable computational savings over CV.

## 5 Conclusion

We provide a recharacterization of the popular graphical lasso estimator in (2) as the optimal solution to a distributionally robust optimization problem. To the best of our knowledge, this is the first work to make such a connection for sparse inverse covariance estimation. The DRO form of the estimator leads to a reinterpretation of the regularization parameter as the radius of the distributional amibiguity set, which can be chosen based on a desired level of robustness. We propose the RobSel method for the selection of the regularization parameter and compare it to cross-validation for the graphical lasso. In our numerical experiments, RobSel gives a better false detection rate than cross-validation, and, as the sample size increases, other performance metrics like the true positive rate for the two are similar. Moreover, RobSel is a computationally simpler procedure, notably only performing the graphical lasso algorithm once at the final step rather than repeatedly as is necessary for cross-validation. Future work includes theoretical justification for robust selection of the regularization parameter for the graphical lasso in the high-dimensional setting.

## References

• Banerjee et al. (2008) O. Banerjee, L. E. Ghaoui, and A. d’Aspremont. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. J. Mach. Learn. Res., 9:485–516, 2008. ISSN 1532-4435.
• Blanchet and Murthy (2016) J. Blanchet and K. Murthy. Quantifying distributional model risk via optimal transport. 2016. doi: arXiv:1604.01446v2.
• Blanchet and Si (2019) J. Blanchet and N. Si. Optimal uncertainty size in distributionally robust inverse covariance estimation. 2019. doi: arXiv:1901.07693.
• Blanchet et al. (2016) J. Blanchet, Y. Kang, and K. Murthy. Robust Wasserstein profile inference and applications to machine learning. 2016. doi: arXiv:1610.05627v2.
• Boyd and Vandenberghe (2004) S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. ISBN 0521833787.
• Cai et al. (2011) T. Cai, W. Liu, and X. Luo. A constrained minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association, 106(494):594–607, jun 2011.
• Chicco (2017) D. Chicco. Ten quick tips for machine learning in computational biology. BioData Mining, 10(1):35, Dec 2017.
• Edwards (2010) D. Edwards. Introduction to Graphical Modelling. Springer-Verlag New York, 2010. ISBN 978-0-387-95054-9.
• Esfahani and Kuhn (2018) P. M. Esfahani and D. Kuhn. Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming, 171(1-2):115–166, 2018.
• Friedman et al. (2007) J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441, 12 2007. ISSN 1465-4644.
• Guillot et al. (2015) D. Guillot, B. Rajaratnam, and J. Emile-Geay. Statistical paleoclimate reconstructions via markov random fields. Ann. Appl. Stat., 9(1):324–352, 03 2015. doi: 10.1214/14-AOAS794.
• Hsieh et al. (2014) C.-J. Hsieh, M. A. Sustik, I. S. Dhillon, and P. Ravikumar. Quic: Quadratic approximation for sparse inverse covariance estimation. Journal of Machine Learning Research, 15:2911–2947, 2014.
• Huang et al. (2010) S. Huang, J. Li, L. Sun, J. Ye, A. Fleisher, T. Wu, K. Chen, and E. Reiman. Learning brain connectivity of Alzheimer’s disease by sparse inverse covariance estimation. NeuroImage, 50(3):935 – 949, 2010. ISSN 1053-8119.
• Isii (1962) K. Isii. On sharpness of tchebycheff-type inequalities. Annals of the Institute of Statistical Mathematics, 14(1):185–197, 1962. doi: 10.1007/BF02868641.
• James and Stein (1961) W. James and C. Stein. Estimation with quadratic loss. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, pages 361–379, Berkeley, Calif., 1961. University of California Press.
• Jankova and van de Geer (2018) J. Jankova and S. van de Geer. Inference in high-dimensional graphical models. 2018. doi: arXiv:1801.08512.
• Khare et al. (2015) K. Khare, S.-Y. Oh, and B. Rajaratnam. A convex pseudolikelihood framework for high dimensional partial correlation estimation with convergence guarantees. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 77(4):803–825, sep 2015. doi: 10.1111/rssb.12088.
• Krumsiek et al. (2011) J. Krumsiek, K. Suhre, T. Illig, J. Adamski, and F. J. Theis. Gaussian graphical modeling reconstructs pathway reactions from high-throughput metabolomics data. BMC Systems Biology, 5(1):21, Jan 2011. ISSN 1752-0509.
• Lam and Fan (2009) C. Lam and J. Fan. Sparsity and rates of convergence in large covariance matrix estimation. The Annals of Statistics, 37(6B):4254–4278, 2009.
• Loh and Tan (2018) P.-L. Loh and X. L. Tan. High-dimensional robust precision matrix estimation: Cellwise corruption under -contamination. Electron. J. Statist., 12(1):1429–1467, 2018. doi: 10.1214/18-EJS1427.
• Mazumder and Hastie (2012) R. Mazumder and T. Hastie. Exact covariance thresholding into connected components for large-scale graphical lasso. J. Mach. Learn. Res., 13(1):781–794, 2012. ISSN 1532-4435.
• Menéndez et al. (2010) P. Menéndez, Y. Kourmpetis, C. ter Braak, and F. van Eeuwijk. Gene regulatory networks from multifactorial perturbations using graphical lasso: Application to the dream4 challenge. PLOS ONE, 5(12):1–8, 12 2010.
• Nguyen et al. (2018) V. A. Nguyen, D. Kuhn, and P. M. Esfahani. Distributionally robust inverse covariance estimation: The Wasserstein shrinkage estimator. 2018. doi: arXiv:1805.07194.
• Peng et al. (2009) J. Peng, P. Wang, N. Zhou, and J. Zhu. Partial correlation estimation by joint sparse regression models. Journal of the American Statistical Association, 104(486):735–746, 2009.
• Powers (2011) D. M. Powers. Evaluation: From precision, recall and f-factor to roc, informedness, markedness & correlation. Journal of Machine Learning Technologies, 2:37–63, 2011.
• Rolfs et al. (2012) B. Rolfs, B. Rajaratnam, D. Guillot, I. Wong, and A. Maleki. Iterative thresholding algorithm for sparse inverse covariance estimation. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1574–1582. Curran Associates, Inc., 2012.
• Rothman et al. (2008) A. J. Rothman, P. J. Bickel, E. Levina, and J. Zhu. Sparse permutation invariant covariance estimation. Electron. J. Statist., 2:494–515, 2008. doi: 10.1214/08-EJS176.
• Shafieezadeh-Abadeh et al. (2015) S. Shafieezadeh-Abadeh, P. M. Esfahani, and D. Kuhn. Distributionally robust logistic regression. In Advances in Neural Information Processing Systems 28, pages 1576–1584. 2015.
• Shafieezadeh-Abadeh et al. (2019) S. Shafieezadeh-Abadeh, D. Kuhn, and P. M. Esfahani. Regularization via mass transportation. Journal of Machine Learning Research, 20, 2019.
• Smith et al. (2011) S. M. Smith, K. L. Miller, G. Salimi-Khorshidi, M. Webster, C. F. Beckmann, T. E. Nichols, J. D. Ramsey, and M. W. Woolrich. Network modelling methods for fMRI. NeuroImage, 54(2):875 – 891, 2011. ISSN 1053-8119.
• Steele (2004) J. M. Steele. The Cauchy-Schwarz Master Class: An Introduction to the Art of Mathematical Inequalities. Cambridge University Press, 2004.
• Sulaimanov et al. (2018) N. Sulaimanov, S. Kumar, H. Koeppl, F. Burdet, M. Pagni, and M. Ibberson. Inferring gene expression networks with hubs using a degree weighted Lasso approach. Bioinformatics, 35(6):987–994, 08 2018. ISSN 1367-4803.
• Won et al. (2012) J.-H. Won, J. Lim, S.-J. Kim, and B. Rajaratnam. Condition-number-regularized covariance estimation. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(3):427–450, dec 2012.
• Xue and Zou (2012) L. Xue and H. Zou. Regularized rank-based estimation of high-dimensional paranormal graphical models. The Annals of Statistics, 40(5):2541–2571, 2012.
• Yang and Lozano (2015) E. Yang and A. C. Lozano. Robust Gaussian graphical modeling with the trimmed graphical lasso. In Advances in Neural Information Processing Systems 28, pages 2602–2610. 2015.
• Yang et al. (2015) S. Yang, Q. Sun, S. Ji, P. Wonka, I. Davidson, and J. Ye. Structural graphical lasso for learning mouse brain connectivity. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015.
• Yuan and Lin (2007) M. Yuan and Y. Lin. Model selection and estimation in the Gaussian graphical model. Biometrika, 94(1):19–35, 2007.

## Appendix A Using a different ambiguity set

Recall that is a zero-mean random vector with covariance matrix and measure , and that we consider an iid random sample , with empirical measure . Since we use the graphical loss function as in equation (3), we are interested in finding a tractable or closed-form expression for the optimization problem

 supQ: Dc′(Q,Qn)≤δEQ[l(X;K)] (19)

with . Ideally, the solution should be connected to the graphical lasso estimator, since it is one of the most commonly-used sparse inverse covariance estimators in practice. The ambiguity set in this formulation is specified by the collection of measures , which we now describe. Given two probability distributions and on and some transportation cost function (which we will specify below), we define the optimal transport cost between and as

 D′c′(Q1,Q2)=inf{Eπ[c′(u,v)]|π∈P(Rd×Rd), πu=Q1, πv=Q2} (20)

where is the set of joint probability distributions of supported on , and and denote the marginals of and under , respectively. In this paper, we are interested in cost functions

 c′(u,v)=∥u−v∥ρq, (21)

with , , .

Now, observe that the function is Borel measurable since it is a continuous function. Then, we use the duality result from Proposition 4 of (Blanchet et al., 2016, version 2) and obtain

 supP: D′c′(Q,Qn)≤δEQ[l(X;K)]=infγ≥0{γδ+1nn∑i=1(supu∈Rd{l(u;K)−γc′(u,Xi)})}. (22)

Let . Then

 supu∈Rd{l(u;K)−γc(u,Xi)}=supu∈Rd{uTKu−log|K|−γ∥u−Xi∥ρq}=supΔ∈Rd{(Δ+Xi)TK(Δ+Xi)−γ∥Δ∥ρq}−log|K|. (23)

Replacing this expression back in (22), it may be difficult, if not impossible, to obtain a closed form optimization problem over . Even if such a simplification is possible, it will not provide the desired connection to the graphical lasso estimator. That is why, in this paper, as outlined in Section 2, we redefine the ambiguity set to obtain a desired closed form as expressed in Theorem 2.1 in a more transparent way.

## Appendix B Proofs for the paper

### b.1 Proof of Theorem 2.1

###### Proof.

Consider . Observe that the function is Borel measurable since it is a continuous function. Then, we use the duality result for the DRO formulation from Proposition C.2 from the appendix of this paper and obtain

 (24)

Let . Then,

 supW∈Sd{l(W;K)−γc(W,Wi)}=supW∈Sd{trace(KW)−log|K|−γ∥∥{vec}(W)−{vec}(Wi)∥∥ρq}=supΔ∈Sd{trace(K(Δ+Wi))−γ∥∥{vec}(Δ)∥∥ρq}−log|K|=supΔ∈Sd{trace(KΔ)−γ∥∥{vec}(Δ)∥∥ρq}+trace(KWi)−log|K|=supΔ∈M(K){∥∥{vec}(Δ)∥∥q∥∥{vec}(K)∥∥p−γ∥∥{vec}(Δ)∥∥ρq}+trace(KWi)−log|K| (25)

with so that the fourth line follows from selecting a (since ) such that Holder’s inequality holds tightly (with ). In fact, Holder’s inequality holds tightly if and only if  (Steele, 2004, Chapter 9), even for the limiting case , . Observe that there exist multiple that can satisfy Holder’s inequality tightly. As a consequence, we are still free to choose the magnitude of the -norm of such (and this is what we will use next).

Now, the argument inside the supremum in the last line of (25) is a polynomial function on . We have to analyze two cases.

Case 1: . In this case we observe that, by setting :

• if , then (in particular, if , the optimizer is );

• if , then ;

so that, recalling (24), due to the outside infimum to be taken over ; and so we must have that

from which we immediately obtain (8).

Case 2: . By differentiation and basic calculus (e.g., using the first and second derivative test) we obtain that the maximizer

 Δ∗=argsupΔ∈M(K){∥∥{vec}(Δ)∥∥q∥∥{vec}(K)∥∥p−γ∥∥{vec}(Δ)∥∥ρq}

is such that . Then,

 (26)

Replacing this back in (24),

 supP: Dc(P,Pn)≤δEP[l(W;K)]=infγ≥0{γδ+1nn∑i=1(∥∥{vec}(K)∥∥ρρ−1pρ−1ρρρ−1γ1ρ−1+trace(KWi)−log|K|)}=infγ≥0⎧⎪⎨⎪⎩γδ+∥∥{vec}% (K)∥∥ρρ−1pρ−1ρρρ−1γ1ρ−1+trace(K1nn∑i=1Wi)−log|K|⎫⎪⎬⎪⎭=infγ≥0⎧⎪⎨⎪⎩γδ+∥∥{vec}% (K)∥∥ρρ−1pρ−1ρρρ−1γ1ρ−1⎫⎪⎬⎪⎭+trace(KAn)−log|K|. (27)

Now, we observe that the argument inside the infimum in the last line of (27) is a function that grows to infinity when or , so that the minimum is attained for some optimal . By using the first and second derivative tests, we obtain that the minimizer is . Then, replacing this back in (27) and then this in (24), we finally obtain (8) after some algebraic simplification. ∎

### b.2 Proof of Lemma 3.1

###### Proof.

Consider and define for a fixed . We prove (12), by a direct application of Proposition 8 of (Blanchet et al., 2016, Appendix C), observing that we satisfy the three conditions for its application:

1. [label=()]

2. is convex on and finite,

3. there exists such that the sublevel set is compact and non-empty,

4. is lower semi-continuous and convex as a function of throughout for any .

First, we observe that

 EP0[l(W,K)]=EP0[trace(KW)−log|K|≤trace(KEP0[W])<∞,

since . Then, using Theorem 2.1, the function

is finite. Moreover, we also claim it is convex for all . This follows from the fact that and , are two convex functions on , and from the fact that the nonnegative weighted sum of two convex functions is another convex function (Boyd and Vandenberghe, 2004). This proves 1.

Now, we claim that is radially unbounded, i.e., as . To see this, recall that is a differentiable convex function in that is minimized whenever , since is invertible for almost surely. Then,

 g(K)=trace(KAn)−log|K|+δ1/ρ∥∥{% vec}(K)∥∥p≥d−log|A−1n|+δ1/ρ∥∥{vec}(K)∥∥p,

from which it immediately follows that as .

Now, again, since is also convex and continuous in , we conclude that the level sets are compact and nonempty as long as . This proves 2. Moreover, since is convex and continuous on any , it follows that for any is also continuous and convex on any , thus 3 follows immediately.

### b.3 Proof of Theorem 3.2

###### Proof.

Consider . Setting , it is clear that we satisfy the conditions for applying Proposition C.1 in the Appendix, and so we obtain

 Rn(K)=supΛ∈Sd{−1nn∑i=1supU∈Sd{trace(Λ⊤(U−K−1))−∥∥{vec}(U)−{vec}(Wi)∥∥ρq}} (28)

Now, letting

 (29)

with as in the proof of Theorem 2.1, so that the third line follows from selecting a such that Holder’s inequality holds tightly (with ), whose existence has been explained in the proof of Theorem 2.1. Thus, we are still free to choose the magnitude of the -norm of such (and this is what we will use next).

Now, the argument inside the supremum in the last line of (29) is a polynomial function on . We have to analyze two cases.

Case 1: . In this case we observe that, by setting :

• if , then (in particular, if , the optimizer is );

• if , then ;

so that, recalling (28), we see that if , then . Then, we obtain that

 Rn(K) =supΛ∈Sd:∥∥{vec}% (Λ)∥∥p≤1{−1nn∑i=1trace(Λ⊤(Wi−K−1)} =supΛ∈Sd:∥∥{vec}% (Λ)∥∥p≤1{−trace(Λ⊤(An−K−1)} =supΛ∈Sd:∥∥{vec}% (Λ)∥∥p≤1{{vec}(Λ)⊤% {vec}(An−K−1)} =∥∥{vec}(An−K−1)∥∥q

where the third line results from the fact that is a free variable so we can flip its sign, and the last line follows from the the analysis of conjugate norms and the fact that . We thus obtained (16).

Case 2: . By differentiation and basic calculus (e.g., using the first and second derivative test) we obtain that the maximizer

 Δ∗=argsupΔ∈Sd{∥∥{% vec}(Δ)∥∥q∥∥{vec}(Λ)∥∥p−γ∥∥{vec}(Δ)∥∥ρq}

is such that . Then, replacing this back in (28),

Again, by differentiation and basic calculus, we obtain that the maximizer is such that . Replacing this value back in our previous expression, we get that , and thus we showed (16). ∎

## Appendix C Applicability of the dual representations of the RWP function and the DRO formulation

The dual representations of the RWP function and the DRO formulation for the case in which the space of probability measures is is studied in the paper (Blanchet et al., 2016). In this paper, we are interested in the case . In other words, we consider the samples to be in instead of . We want to emphasize that the derivations of these dual representations rely on the dual formulation of the so called “problem of moments” or a specific class of “Chebyshev-type inequalities” referenced in the work by Isii (1962). The derivation by Isii is actually more general in the sense that is applied to more general probability spaces than the ones used in this paper and in (Blanchet et al., 2016) (in fact, it is stated for general spaces of non-negative measures).

Throughout this section, we consider an integrable function , and a lower semi-continuous function such that