AdaPtive Noisy Data Augmentation for Regularized Estimation of Undirected Graphical Models
We propose an AdaPtive Noise Augmentation (PANDA) technique to regularize the estimation and construction of undirected graphical models. PANDA iteratively optimizes the objective function given the noise augmented data until convergence to achieve regularization on model parameters. The augmented noises can be designed to achieve various regularization effects on graph estimation, such as the bridge (including lasso and ridge), elastic net, adaptive lasso, and SCAD penalization; it also realizes the group lasso and fused ridge. We examine the tail bound of the noise-augmented loss function and establish that the noise-augmented loss function and its minimizer converge almost surely to the expected penalized loss function and its minimizer, respectively. We derive the asymptotic distributions for the regularized parameters through PANDA in generalized linear models, based on which, inferences for the parameters can be obtained simultaneously with variable selection. We show the non-inferior performance of PANDA in constructing graphs of different types in simulation studies and apply PANDA to an autism spectrum disorder data to construct a mixed-node graph. We also show that the inferences based on the asymptotic distribution of regularized parameter estimates via PANDA achieve nominal or near-nominal coverage and are far more efficient, compared to some existing post-selection procedures. Computationally, PANDA can be easily programmed in software that implements (GLMs) without resorting to complicated optimization techniques.
keywords: adjacency matrix, augmented Fisher information, generalized linear model (GLM), maximum a posterior estimation, sparsity, inference
1.1 Noise Injection
Noise injection (NI) is a simple and effective regularization technique that can improve the generalization ability of statistical learning and machine learning methods. We can roughly classify the NI techniques into two types. The first type refers to additive or multiplicative noises injected into the observed data or latent variables without changing the dimension of the original data . We refer to the second type as noise augmentation which expands the dimensionality of the original data (either or increases). Both types would lead to less overfitting and smaller generalization errors of the trained models and and parameters as compared to those learned from the original data without any regularization.
NI has wide applications in regularizing and learning neural networks (NN). Matsuoka (1992) proves that injecting noises to the input layers when training NN decreases the learned NN’s sensitivity to small input perturbation. Holmstrom and Koistinen (1992) interpret NI in the input nodes from the perspective of kernel smoothing in classification and mapping problems. The best known NI technique in NN training is the multiplicative Bernoulli noise, which is shown to achieve the regularization effect (for dropout) (Srivastava et al., 2014) or the plus some sparsity regularization on model parameters in the setting of generalized linear models (GLMs) (Kang et al., 2018). Grandvalet and Boucheron (1997) and Wager et al. (2013) also show that Bernoulli and constant-variance Gaussian NI in GLMs is equivalent to the Tikhonov regularization, after taking expectation of the second order approximated loss function over the distribution of injected noises. Whiteout (Li and Liu, 2017) injects adaptive additive and multiplicative Gaussian noises in NNs, where the variance of the Gaussian noise is a function of NN parameters and contains tuning parameters that lead to a variety of regularizers, including the bridge, ridge, lasso, adaptive lasso, elastic net, and group lasso. Gal and Ghahramani (2016) develop a theoretical framework that connects dropout in deep NNs with approximate Bayesian inference in deep Gaussian processes. Noh et al. (2017) suggest that NI regularization optimizes the lower bound of the objective function marginalized over the distribution of hidden nodes.
Given the success of NI in regularizing NNs, one would conjecture that, conditional on properly designed noises, NI can be potentially useful in regularizing other types of large and complex models. That is indeed the case; but we only found a couple of cases beyond the framework of NNs. In the first case, NI is applied to the linear discriminant analysis (Skurichina and Duin, 1999), where redundant features are augmented to the observed data ( increases while remains the same), yielding similar effects as other regularization techniques. In the second case, the regularization in linear regression setting can be realized by appending a diagonal matrix (where is the tuning parametric) to the design matrix and rows of 0 to the centered outcome (Allen, 1974; Hastie et al., 2009a). Both cases also happen to be noise augmentation.
In this discussion, we explore the utility of NI, more specifically NA, in regularizing undirected graphical models (UGMs), where the injected noises are adaptive to the most updated parameters during an iterative computation procedure rather than being drawn from a fixed distribution, and can be designed to achieve various regularization effects on the model parameters.
1.2 Undirected Graphical Models (UGM)
A graphical model is a probabilistic model that expresses the conditional dependence structure among random variables in a graph. The random variables are often referred to as the nodes of the graph. If two nodes are dependent conditional on all the other nodes in the graph, then an edge is drawn between the two node; otherwise, there is no edge. The edges in a UGM have no direction. We denote a UGM by with nodes, where refers to the data observed in the nodes and is a unknown symmetric adjacency matrix (weighted or unweighted). A non-zero entry represents conditional dependence between nodes and . Construction and estimation of given is often the main goal in UGM problems.
Many real-life graphs are believed to be sparse, such as biological networks (Leclerc, 2008), meaning that the proportion of none-zero ’s in is small. In addition, data collected for estimating edges often have . Given both the practical and technical needs, regularization techniques that promote sparsity in are often employed when constructing a UGM. One popular approach is the neighborhood selection (NS) method which estimates by columnwise modeling the conditional distribution of each node given all the other nodes, leading to regression models. When the conditional distributions belong to an exponential family, the generalized linear models (GLM) can be employed, including linear regression for Gaussian nodes, logistic regression and Ising models for Bernoulli nodes (Ravikumar et al., 2010; Hofling and Tibshirani, 2009; Kuang et al., 2017; Jalali et al., 2011), and Poisson regression for count nodes (Allen and Liu, 2012). Also noted is that the nodes in a graph do not have to be of the same type; and there exist works for mixed graph models (MGMs) with nodes of mixed types (Fellinghauer et al., 2013; Yang et al., 2012, 2014). In terms of the regularization techniques that promote sparsity in the relationships among the nodes, the lasso (Meinshausen and Bühlmann, 2006), the graphical Dantzig selector (Yuan, 2010; Cai et al., 2011), the graphical scaled lasso (Sun and Zhang, 2012) and the SQRT-Lasso (Liu and Wang, 2012; Belloni et al., 2012) have been proposed, among others.
When all nodes in a UGM follow a multivariate Gaussian distribution, the UGM is referred to as the Gaussian graphical model (GGM). There exist approaches for edge estimation specifically for GGM in addition to the general NS approach mentioned above. For example, Huang et al. (2006) obtain the Cholesky decomposition (CD) of the precision matrix of the multivariate Gaussian distribution, and then apply the penalty to the elements of the triangular matrix from the CD. Levina et al. (2008) apply the adaptive banding method (Bickel and Levina, 2008) with a nested Lasso penalty on the regression coefficients of linear regressions after the CD. Liu and Xi (2015) reformulate the NS as a regularized quadratic optimization problem without directly employing the Gaussian likelihood. Another line of research focuses on estimating as a whole while ensuring its positive-definiteness (PD) and sparsity. For example, Yuan and Lin (2007) propose a penalized likelihood approach that accomplishes model selection and estimation simultaneously and also ensures the PD of the estimated . J. Friedman and Tibshirani (2008); O. Banerjee and d’Aspremont (2008); Rothman et al. (2008) propose efficient computational algorithms to implement the penalized likelihood approach. Theoretical properties of the penalized likelihood methods are developed in Ravikumar et al. (2008); Rothman et al. (2008); Lam and Fan (2009).
1.3 Our Contributions
We propose AdaPtive Noisy Data Augmentation (PANDA) - a general, novel, and effective NI technique to regularize the estimation and construction of UGMs. Denote the sample size of the observed data by , PANDA augments the observations with properly designed noise terms to achieve the desired regularization effects on model parameters. One requirement on is , which allows for the ordinary least squares (OLS) or maximum likelihood estimation (MLE) approaches to be employed to estimate the model parameters without resorting to complicated algorithms to optimize objective functions with regularizers.
To the best of our knowledge, PANDA is the first NI, more specifically, the data or noise augmentation technique for regularizing UGMs. Our overarching goal is to show that PANDA delivers non-inferior performance while enjoying learning, inferential, and computational advantages compared to the existing UGM estimation approaches. Our contributions are listed below.
By properly designing the variance of the augmented noise, PANDA can achieve various regularization effects, including bridge () () with lasso () and ridge () as special cases, elastic net (), SCAD, group lasso, and graphical ridge for single graph estimation.
PANDA can be used to construct mixed graph models, without additional complexity compared to constructing a graph with the same types of nodes.
Computation in PANDA is straightforward and only employs the OLS in linear regression and the MLE in GLMs to iteratively estimate the model parameters on the augmented data. The variance terms of the augmented noise are adaptive to the most updated parameter estimates until the algorithm converges.
We establish the Gaussian tail of the noise-augmented loss function and the almost sure convergence to its expectation as or increases, which is a penalized loss function with the targeted regularizer, providing theoretical justification for PANDA as a regularization technique and that the noise-augmented loss function is trainable for practical implementation. providing the theoretical justification for PANDA.
We connect PANDA with the Bayesian framework and show that the regularized parameter estimate in PANDA is equivalent to the “maximum a posterior” (MAP) in the Bayesian framework.
PANDA offers an alternative approach to post-selection procedures for obtaining inferences for regression coefficients from GLMs with sparsity regularization, whether the estimates are zero-valued or not. Our empirical results suggest the inferences based on PANDA are valid and more efficient compared to some existing post-selection procedures.
The rest of the paper is organized as follows. Section 2 presents several PANDA algorithms and their associated regularization effects for constructing GGM and UMG in general. Section 3 presents the Bayes interpretation for PANDA. Section 4 establishes the consistency on the noise-augmented loss function and the regularized parameter estimates, presents the Fisher information of the model parameters in augmented data and a formal test for convergence in PANDA algorithms. It also provides the asymptotic distributions for the parameter estimates via PANDA in the GLM setting. Section 5 compares PANDA to the constrained optimization approach in edge detection for several types of UGMs, and to the post-selection inferential approach in statistical inferences in GLMs. Section 6 applies PANDA to estimating the association among the attributes in a real autism spectrum disorder data set. Section 7 provides some concluding remarks and offers future research directions on PANDA.
In this section, we present PANDA to regularize the construction of GGMs and UGMs via NS (Sec 2.1, 2.1.1, 2.1.3). In the case of GGM construction, in addition to NS, PANDA can also be implemented in the context of other types of regularization than the NS framework, which will be detailed in Sec 2.2.2, 2.2.3, and 2.2.4.
2.1 Neighborhood selection (NS) via PANDA in UGM
Let be the index for the nodes in a UGM. The neighborhood selection (NS) approach, as referred to in this paper, for constructing a UGM assumes the conditional distribution of given comes from an exponential family
where if the canonical link is used (e.g., the identity link for Gaussian ; the logit link for Bernoulli ). Eqn (1) suggests that the relationship among the nodes can be recovered by running GLMs times; that is, there is no edge between nodes and in the graph if ; otherwise, the two nodes are connected with an edge. Yang et al. (2012, 2015) establish, under some regularity conditions, that the structure of a UGM can be recovered exactly via M-estimators with high probability when node-conditional distributions belong to an exponential family in Eqn (1). Regularization (e.g., sparsity regularization) is often imposed when running the node-wise GLM to estimate , followed by developing an optimization algorithms to solve for (refer to Section 1 for some existing work in this direction). When a graph contains nodes of different types (e.g. node is Gaussian and node is Bernoulli), due to the asymmetry in the regression modles on and , and would have different interpretation from a regression perspective. However, the actual magnitude of would not be important if the goal is to decide there is an edge between and or not.
PANDA estimates by first augmenting the observed data with a noisy data matrix. We recommend centerizing the observed data on each “covariate” node in in a UGM and standardizing all nodes in a GGM (or standardizing and centering the “outcome” node ) prior to the augmentation. Figure 1 depicts a schematic of the data augmentation step in PANDA for a single graph. The augmented values to is a constant and is the sample average of the outcome node ( for for GGM unless stated otherwise). The augmented observations for the covariate node () are drawn independently from a Gaussian distribution with mean 0 and variance that depends on and the tuning parameters (Eqns (2) to (7)). We refer to these distributions as the Noise Generating Distributions (NGD).
, , are tuning parameters, either user-specified or chosen by a model selection criterion such as cross validation (CV), AIC, or BIC. Different formulation of the variance term leads to different regularization effects on . Specifically, Eqn (2) leads to the bridge-type regularization which including the lasso () and ridge regression () as special cases, Eqn (3) to elastic net, Eqn (4) to adaptive lasso, and Eqn (5) to SCAD, respectively. Eqns (2) to (5) suggest that the dispersion of the noise terms varies by node: nodes associated with small-valued will be augmented with more spread out noises, and those with large-valued will be augmented with noises around zero.
In addition to Eqns (2) to (5), PANDA can also realize other types of regularization. For example, to simultaneously regularize a group of nodes that share connection patterns with the same node (e.g., genes on the same pathway, binary dummy variables created from the same categorical node), we can generate augmented noises in these nodes simultaneously from Eqn (6) to yield a group lasso-like penalty on , and from Eqn (7) to yield a fused-ridge type penalty on .
|where||entries in are for ; and 0 otherwise.|
The group-lasso regularization in Eqn (6) sets either at zero or nonzero simultaneously, whereas the fused ridge regularization in Eqn (7) promotes numerical similarity among in the same group We could also obtain a fused-lasso type of regularization on by letting () in Eqn 7. However, it does not necessarily outperform the fused ridge regularizer in terms of promoting similarity on parameter estimates. Since the fused ridge is more stable computational in the context of PANDA, we therefore focus our discussion on the fused ridge in the rest of the paper.
Eqns (2) to (7) suggest that the variance of the augmented noise depend on the unknown . When implementing PANDA in practice, we start with some initial values for and then estimate it iteratively. In each iteration, the augmented noises are drawn from the NGD with the variance constructed using the most updated . The iterative procedure continues until the convergence criterion is met.
2.1.1 PANDA-NS for GGM
Let , where is the covariance matrix, then the conditional distribution given is for , where is the -th diagonal element of , is the submatrix of with the -th row and the -th column removed, and is the -th row of with the -th element removed, and . The conditional distribution suggests the following linear model
The intercepts () can be set at 0 with centered . Let be the precision matrix and are the -th entry in ; then and for (Hastie et al., 2009b), implying that () is equivalent to . Running regressions separately in Eqn (8), with or without regularization on , does not lead to a symmetric estimate nor does it guarantee its positive definiteness. If the main goal is to determine the existence of an edge between nodes and , there are two common practices leading to a null edge between nodes and (Meinshausen and Bühlmann, 2006): the intersection rule and the union rule , with the latter resulting in less edges.
PANDA regularizes the estimation of with iterative injection of Gaussian noises drawn from the NGDs. During an iteration, in the regression with outcome node , PANDA augments centered observed data in node with , and those in node () with drawn from a NGD in Eqns (2) to (5). , the size of augmented noisy data, should be large enough so that and can be estimated with OLS by running the regression model in Eqn (8) on the augmented data.
Proposition 1 establishes that PANDA, in expectation over the distribution of the injected noise, minimizes the overall penalized SSE in the linear regression models with a penalty term on for . In other words, PANDA achieves the same global optimum as in Yuan (2010) by iteratively solving the OLS of until convergence. The proof of Proposition 1 is given in Appendix A.
Proposition 1 (regularization effect of PANDA-NS for GGM).
The loss function given the original data is the overall sum of squared errors (SSE) , and the loss function based on the augmented data is . The expectation of over the distribution of noise is
The penalty term takes different forms for different NGDs. Specifically,
when , resulting in a bridge-type penalty (the lasso and ridge-type penalties are special cases at and , respectively).
when , resulting in a elastic net-type penalty.
when , where is a
-consistent estimator of , resulting in an adaptive-lasso-type penalty.
for , resulting in a SCAD-type penalty.
when , where is the index for the groups, resulting in a group-lasso-type penalty.
Remark 1 (convergence criterion).
We provide three choices to evaluate the convergence of the PANDA algorithm: 1) eyeball the trace plots of , which is the most straightforward and often sufficient and effective; 2) use a cutoff value, say on the absolute percentage change on from two consecutive iterations: if , then we may declare convergence; 3) apply a formal statistical test on , the details of which is provided in Section 4.4. Note that due to the randomness of the augmented noises from iteration to iteration, there is always some fluctuation around for finite and . It is important to keep this in mind when evaluating convergence. For example, in the second criterion, is expected to be small upon convergence, but being arbitrarily close to 0 would be difficult to achieve with finite or . In the empirical studies in Sections 5 and 6, was on the order of upon convergence.
Remark 2 (maximum iteration ).
Remark 3 (choice of and ).
The expected regularization in Proposition 1 can be realized either by letting as in , or by letting as in under the constraint for a given . The constraint guarantees that injected noise does not over-regularize or trump the information about contained in the observed data even when is large. For example, , for the lasso-type noise, and would be treated together as one tuning parameter. In practice, we can set either or at a large number to achieve the regularization effect. Our empirical results suggest the algorithm seems to converge faster and the loss function experiences less fluctuation by using a large ( can be as small as 1 or 2) than using a large . Regarding what specific value to use on , the only requirement is so that an unique OLS can be obtained from each regression in each iteration; but a large would need less iterations to converge. Regarding the choice of , it more or less depends on ; if a large still results in noticeable fluctuation around , then a large can be used to speed up the convergence on . There are also other considerations on the choices of and in non-Gaussian UGMs and when using PANDA to obtain inferences on parameters, which are discussed in Sections 2.1.3) and 4.3), respectively.
Remark 4 (hard thresholding and choice of ).
The hard thresholding is necessary as well as justified. It is needed for setting non-significant edges at 0 because, though the estimates of the zero-valued can get arbitrarily close, the exact 0 estimate cannot be achieved computationally in PANDA. The hard thresholding is justified because of the estimation and selection consistency property of PANDA established in Section 4.1. In addition, after the convergence of the PANDA algorithm, there is still mild fluctuation around the parameter estimates, especially when or are not large. We would need a sequence of estimates on to average out the random fluctuation; and we refer to this sequence as the banked estimates, the length of which is . In the empirical studies we have conducted, is sufficient.
Remark 5 (non-convex targeted regularizers).
PANDA optimizes a convex objective function in each iteration in the regression framework once the NA step is completed even when the targeted regularizer itself is non-convex, such as the SCAD. As such, PANDA will not run into computational difficulties as experienced by non-convex optimization. That said, the final solutions for parameter estimates will depend highly on the staring values of the parameter – different starting value could lead to different local optima.
2.1.2 Connection between PANDA-NS and weighted ridge regression for GGM
Algorithm 1 shows the OLS estimator is obtained from the noise-augmented data in each iteration. Corollary 1 states that this OLS estimator is also a weighted ridge estimator. Compared to the regular ridge estimator, where the same constant is used for all the diagonal elements of , different constants are used for different diagonal elements in the weighted ridge estimator.
Corollary 1 (PANDA and weighted ridge regression).
The OLS estimator from the regression with outcome node in PANDA on the noise augmented data is equivalent to the weighted ridge estimator .
The proof is straightforward. Let . The OLS estimator on the augmented data is , leading to Corollary 1. When , . For example, if (), then . Therefore, the regularization effect varies by the magnitude of – the closer is to 0, the more regularization (shrinkage to 0) there is on the estimate .
2.1.3 PANDA for UGM with non-Gaussian nodes
When the conditional distribution of every node given the other nodes follow an exponential family, then regardless whether the nodes are of the same or mixed types, PANDA-NS can regularize the graph construction via running GLM with the canonical link functions. Proposition 2 states the expected regularization effects of PANDA in UGM. The proof is given in Appendix B.
Proposition 2 (Regularization effects of PANDA in UGMs).
Let the loss function given the observed data be (summation of negative log-likelihood functions), and the loss function given with the noise augmented data be
Apply the Taylor expansion to around and take expectation over the distribution of , we have
where and are constants independent of .
The actual form in Eqn 11 depends on the node type of and the NGD from which is drawn. Table 1 lists some examples on if the lasso-type NSG is used ( in Eqn (2)) for graphs with the same type of nodes. For examples, if all nodes follow a Bernoulli distribution given all the other nodes, then the graph is called Bernoulli graph model (BGM); similarly for EGM (Exponential), PGM (Poisson), and NBGM (Negative Binomial).
|NBGM||( is the # of failures)|
Similar to Proposition 1, the expectation of in Proposition 2 can be achieved by letting as in , or, suggested by Eqn (11), by letting with the constraint ; that is, . Between and , the latter would be preferable in that the higher-order term in Eqn (11), meaning the targeted regularizer can be achieved arbitrarily well. with fixed has no effect on the higher-order term, which can only reply on small or small , to become ignorable relative to the lower-order term , the targeted regularizer. In other words, the higher-order term, which is a function of , might bring additional regularization to on top of the targeted regularization.
To illustrate the differences between the regularization effects between letting and , we display in Figure 2 the relationships between the realized by PANDA and for several graph types, along with their empirical versions when the lasso-typed augmented noises are used (the regularization effect in EGM looks very similar to the PGM and the results from EGM are not provided). The targeted regularizer is lasso (). With ( fixed at 1, and ), the realized penalty (red lines) is identical to lasso in all four graphs; and its empirical version (the blue dots) at is very close to the analytic form except for some very mild fluctuation. The realized regularization on with while is small (orange lines) varies by graph. When is small, the distinction between and is minimal in four cases as the higher-order term is ignorable in each graph. As as increases, the regularization deviates from linearity (the target regularization) since the the higher-order residual term in Eqn (11) becomes less ignorable. Specifically, the realized regularization is sub-linear for BGM through logistic regression and for NBGM through NB regression (though not obvious), and super-linear in PGM through Poisson regression (and EGM). The only exception is GGM through linear regression where the higher-order term is analytically 0.
In Figure 3, we show how the regularized parameter estimates obtained with large vs. with large change with when the lasso-type noise is used in PANDA. Specifically, we run PANDA in simulated data in linear regression and Poisson regression, respectively, with () and (). In both cases, there are 30 predictor () and . In the linear regression, the predictors were simulated from N; in the Poisson regression, the predictors were simulated from Unif. Out of the 30 regression coefficients, 9 of them were set at 0, and the 21 nonzero coefficients ranged from 0.5 to 1. Under these settings, the trajectories of the regularized estimates for the 9 zero-valued parameters are similar for large and large in both regression; but large had a higher computation cost.
In practice, when is large, setting in a PANDA algorithm is sufficient to achieve the expected regularization effect. On the other hand, a very large will slow the computation in each iteration. Therefore, we would recommend set at a somewhat large value to yield the expected regularization effect, and then set at a small value to average out the fluctuation around the estimated parameters.
Due to space limitation, we list the computational algorithm in PANDA for constructing UGM in Algorithm S.1 in the Supplementary Materials. Most of the steps are similar to Algorithm 1 for GGM, with a few differences. First, there is no standardization of data; second, the loss function optimized in each iteration is the sum of the negative log-likelihood in Eqn (10) across the nodes; third, MLE (not OLS) is calculated from regressing on all other nodes for in each iteration. The guidelines for choosing of the algorithmic parameters (e.g., ) and evaluating the convergence as laid out in Remarks 2 to 5 also apply to the UGM algorithm.
2.2 Other regularization for GGM via PANDA
For GGM, given the connection between the graph structure and the precision matrix of the multivariate Gaussian distribution, additional approaches have been proposed to construct a GGM. We list three of these approaches that can all be realized through PANDA.
2.2.1 PANDA-SPACE for hub nodes detection in GGM
The elements in the precision matrix of a multivariate Gaussian distribution are related to the partial correlation coefficients in linear regression. Specifically, the partial correlation between node and node is ( Lemma 1 in Peng et al. (2009)). SPACE (Sparse PArtial Correlation Estimation) is an approach to select nonzero partial correlation when (Peng et al., 2009). Non-zero implies non-zero and an edge between nodes and in GGM. The biggest advantage of SPACE, compared to NS, is that not only does it identify edges, it is also efficient for identifying hub nodes. Corollary 2 shows that PANDA can realize SPACE by imposing a bridge-type penalty on . The data augmentation scheme is similar to Figure 1. In each iteration, PANDA runs linear regressions based on the noise-augmented data, obtain estimates for , and , and calculates and .
Corollary 2 (Panda-Space).
Let for , and . Then
2.2.2 PANDA-CD for GGM
The Cholesky decomposition (CD) approach refers to estimating through the LDL decomposition, a variant of the CD. Compared to the NS in Section 2.1.1, the CD approach guarantees symmetry and positive definiteness of the estimated . WLOG, let , and the corresponding negative log-likelihood is . There exists a unique LDL decomposition , such that , where and is a lower uni-triangular matrix with elements for , 0 for , and 1 for . Therefore,