Generalized Score Matching for Non-Negative Data

Generalized Score Matching for Non-Negative Data

\nameShiqing Yu \emailsqyu@uw.edu
\addrDepartment of Statistics
University of Washington, Seattle, WA, U.S.A. \AND\nameMathias Drton \emailmd5@uw.edu
\addrDepartment of Mathematical Sciences
University of Copenhagen, Copenhagen, Denmark
and
\addrDepartment of Statistics
University of Washington, Seattle, WA, U.S.A. \AND\nameAli Shojaie \emailashojaie@uw.edu
\addrDepartment of Biostatistics
University of Washington, Seattle, WA, U.S.A.
Abstract

A common challenge in estimating parameters of probability density functions is the intractability of the normalizing constant. While in such cases maximum likelihood estimation may be implemented using numerical integration, the approach becomes computationally intensive. The score matching method of Hyvärinen (2005) avoids direct calculation of the normalizing constant and yields closed-form estimates for exponential families of continuous distributions over . Hyvärinen (2007) extended the approach to distributions supported on the non-negative orthant, . In this paper, we give a generalized form of score matching for non-negative data that improves estimation efficiency. As an example, we consider a general class of pairwise interaction models. Addressing an overlooked inexistence problem, we generalize the regularized score matching method of Lin et al. (2016) and improve its theoretical guarantees for non-negative Gaussian graphical models.

\firstpageno

1 \jmlrheading2020191-LABEL:LastPage5/18; Revised 4/194/1918-278Shiqing Yu, Mathias Drton, Ali Shojaie \ShortHeadingsGeneralized Score MatchingYu, Drton and Shojaie

\editor

Aapo Hyvarinen

{keywords}

exponential family, graphical model, positive data, score matching, sparsity

1 Introduction

Score matching was first developed in Hyvärinen (2005) for continuous distributions supported on all of . Consider such a distribution , with density and support equal to . Let be a family of distributions with twice continuously differentiable densities. The score matching estimator of using as a model is the minimizer of the expected squared distance between the gradients of and a log-density from . So we minimize the loss with respect to densities from . The loss depends on , but integration by parts can be used to rewrite it in a form that can be approximated by averaging over the sample without knowing . A key feature of score matching is that normalizing constants cancel in gradients of log-densities, allowing for simple treatment of models with intractable normalizing constants. For exponential families, the loss is quadratic in the canonical parameter, making optimization straightforward.

If the considered distributions are supported on a proper subset of , then the integration by parts arguments underlying the score matching estimator may fail due to discontinuities at the boundary of the support. For data supported on the non-negative orthant , Hyvärinen (2007) addresses this problem by modifying the loss to , where denotes entrywise multiplication. In this loss, boundary effects are dampened by multiplying gradients elementwise with the identity functions .

In this paper, we propose generalized score matching methods that are based on elementwise multiplication with functions other than . As we show, this can lead to drastically improved estimation accuracy, both theoretically and empirically. To demonstrate these advantages, we consider a family of graphical models on , which does not have tractable normalizing constants and hence serves as a practical example.

Graphical models specify conditional independence relations for a random vector indexed by the nodes of a graph (Lauritzen, 1996). For undirected graphs, variables and are required to be conditionally independent given if there is no edge between and . The smallest undirected graph with this property is the conditional independence graph of . Estimation of this graph and associated interaction parameters has been a topic of continued research as reviewed by Drton and Maathuis (2017).

Largely due to their tractability, Gaussian graphical models (GGMs) have gained great popularity. The conditional independence graph of a multivariate normal vector is determined by the inverse covariance matrix , also termed concentration or precision matrix. Specifically, and are conditionally independent given all other variables if and only if the -th and the -th entries of are both zero. This simple relation underlies a rich literature including Drton and Perlman (2004), Meinshausen and Bühlmann (2006), Yuan and Lin (2007) and Friedman et al. (2008), among others.

More recent work has provided tractable procedures also for non-Gaussian graphical models. This includes Gaussian copula models (Liu et al., 2009; Dobra and Lenkoski, 2011; Liu et al., 2012), Ising models (Ravikumar et al., 2010), other exponential family models (Chen et al., 2015; Yang et al., 2015), as well as semi- or non-parametric estimation techniques (Fellinghauer et al., 2013; Voorman et al., 2014). In this paper, we apply our method to a class of pairwise interaction models that generalizes non-negative Gaussian random variables, as recently considered by Lin et al. (2016) and Yu et al. (2016), as well as square root graphical models proposed by Inouye et al. (2016) when the sufficient statistic function is a pure power. However, our main ideas can also be applied for other classes of exponential families whose support is restricted to a rectangular set.

Our focus will be on pairwise interaction power models with probability distributions having (Lebesgue) densities proportional to

(1)

on . Here and are known constants, and and are unknown parameters of interest. When we define and . This class of models is motivated by the form of important univariate distributions for non-negative data, including gamma and truncated normal distributions. It provides a framework for pairwise interaction that is concrete yet rich enough to capture key differences in how densities may behave at the boundary of the non-negative orthant, . Moreover, the conditional independence graph of a random vector with distribution as in (1) is determined just as in the Gaussian case: and are conditionally independent given all other variables if and only if in the interaction matrix . Section 5.1 gives further details on these models. We will develop estimators of in (1) and the associated conditional independence graph using the proposed generalized score matching.

A special case of (1) are truncated Gaussian graphical models, with . Let , and let be a positive definite matrix. Then a non-negative random vector follows a truncated normal distribution for mean parameter and inverse covariance parameter , in symbols , if it has density proportional to

(2)

on . We refer to as the covariance parameter of the distribution, and note that the parameter in (1) is . Another special case of (1) is the exponential square root graphical models in Inouye et al. (2016), where .

Lin et al. (2016) estimate truncated GGMs based on Hyvärinen’s modification, with an penalty on the entries of added to the loss. However, the paper overlooks the fact that the loss can be unbounded from below in the high-dimensional setting even with an penalty, such that no minimizer may exist. Since the unpenalized loss is quadratic in the parameter to be estimated, we propose modifying it by adding small positive values to the diagonals of the positive semi-definite matrix that defines the quadratic part, in order to ensure that the loss is bounded and strongly convex and admits a unique minimizer. We apply this to the estimator for GGMs considered in Lin et al. (2016), which uses score-matching on , and to the generalized score matching estimator for pairwise interaction power models on proposed in this paper. In these cases, we show, both empirically and theoretically, that the consistency results still hold (or even improve) if the positive values added are smaller than a threshold that is readily computable.

The rest of the paper is organized as follows. Section 2 introduces score matching and our proposed generalized score matching. In Section 3, we apply generalized score matching to exponential families, with univariate truncated normal distributions as an example. Regularized generalized score matching for graphical models is formulated in Section 4. The estimators for pairwise interaction power models are shown in Section 5, while theoretical consistency results are presented in Section 6, where we treat the probabilistically most tractable case of truncated GGMs. Simulation results and applications to RNAseq data are given in Section 7. Proofs for theorems in Sections 26 are presented in Appendices A and B. Additional experimental results are presented in Appendix C.

1.1 Notation

Constant scalars, vectors, and functions are written in lower-case (e.g., , ), random scalars and vectors in upper-case (e.g., , ). Regular font is used for scalars (e.g. , ), and boldface for vectors (e.g. , ). Matrices are in upright bold, with constant matrices in upper-case (, ) and random matrices holding observations in lower-case (, ). Subscripts refer to entries in vectors and columns in matrices. Superscripts refer to rows in matrices. So is the -th component of a random vector . For a data matrix , each row comprising one observation of variables/features, is the -th feature for the -th observation. Stacking the columns of a matrix gives its vectorization . For a matrix , denotes its diagonal, and for a vector , is the diagonal matrix with diagonals .

For , the -norm of a vector is denoted

with . A matrix has Frobenius norm

and max norm Its - operator norm is

with shorthand notation ; for instance, .

For a function , we define as the partial derivative with respect to , and . For , , we let be the vector of derivatives. Likewise is used for second derivatives. The symbol denotes the indicator function of the set , while is the vector of all ’s. For , , . A density of a distribution is always a probability density function with respect to Lebesgue measure. When it is clear from the context, denotes the expectation under a true distribution .

2 Score Matching

In this section, we review the original score matching and develop our generalized score matching estimators.

2.1 Original Score Matching

Let be a random vector taking values in with distribution and density . Let be a family of distributions of interest with twice continuously differentiable densities supported on . Suppose . The score matching loss for , with density , is given by

(3)

The gradients in (3) can be thought of as gradients with respect to a hypothetical location parameter, evaluated at the origin (Hyvärinen, 2005). The loss is minimized if and only if , which forms the basis for estimation of . Importantly, since the loss depends on only through its log-gradient, it suffices to know up to a normalizing constant. Under mild conditions, (3) can be rewritten as

(4)

plus a constant independent of . The integral in (4) can be approximated by a sample average; this alleviates the need for knowing the true density , and provides a way to estimate .

2.2 Generalized Score Matching for Non-Negative Data

When the true density is supported on a proper subset of , the integration by parts underlying the equivalence of (3) and (4) may fail due to discontinuity at the boundary. For distributions supported on the non-negative orthant, , Hyvärinen (2007) addressed this issue by instead minimizing the non-negative score matching loss

(5)

This loss can be motivated via gradients with respect to a hypothetical scale parameter (Hyvärinen, 2007). Under mild conditions, can again be rewritten in terms of an expectation of a function independent of , thus allowing one to form a sample loss.

In this work, we consider generalizing the non-negative score matching loss as follows.

{definition}

Let be the family of distributions of interest, and assume every has a twice continuously differentiable density supported on . Suppose the -variate random vector has true distribution , and let be its twice continuously differentiable density. Let be a.s. positive functions that are absolutely continuous in every bounded sub-interval of , and set . For with density , the generalized -score matching loss is

(6)

where .

{proposition}

The distribution is the unique minimizer of for .

{proof}

First, observe that and . For uniqueness, suppose for some . Let and be the respective densities. By assumption a.s. and a.s. for all . Therefore, we must have a.s., or equivalently, almost surely in . Since and are continuous densities supported on , it follows that for all .

Choosing all recovers the loss from (5). In our generalization, we will focus on using functions that are increasing but are bounded or grow rather slowly. This will alleviate the need to estimate higher moments, leading to better practical performance and improved theoretical guarantees.

We will consider the following assumptions:

where , , “” is a shorthand for “for all being the density of some ”, and the prime symbol denotes component-wise differentiation. While the second half of (A2) was not made explicit in Hyvärinen (2005, 2007), (A1)-(A2) were both required for integration by parts and Fubini-Tonelli to apply.

Once the forms of and are given, sufficient conditions for for Assumptions (A1)-(A2) to hold are easy to find. In particular, (A1) and (A2) are easily satisfied and verified for exponential families.

Integration by parts yields the following theorem which shows that from (6) is an expectation (under ) of a function that does not depend on , similar to (4). The proof is given in Appendix A.1.

{theorem}

Under (A1) and (A2), the loss from (6) equals

(7)

plus a constant independent of .

Given a data matrix with rows , we define the sample version of (7) as

(8)

Subsequently, for a distribution with density , we let . Similarly, when a distribution with density is associated to a parameter vector , we write . We apply similar conventions to the sample version . We note that this type of loss is also treated in slightly different settings in Parry (2016) and Almeida and Gidas (1993).

{remark}

In the one-dimensional case, using the notation in Parry et al. (2012), and correspond to and , respectively, and can be generated by (c.f. Equations (39), (51), (53) and Section 10.1 therein). Thus Theorem 2.2 follows from this correspondence. While (A1) is equivalent to the condition implied by the boundary divergence in that paper, (A2), which we assume for invoking Fubini-Tonelli due to multi-dimensionality, is not present. On the other hand, while Parry (2016) treats the multivariate case, it does not cover the connection between our and . Since is concave but not strictly concave in , the results in Parry (2016) only imply that is a minimizer, a weaker conclusion than Proposition 2.2.

3 Exponential Families

In this section, we study the case where is an exponential family comprising continuous distributions with support . More specifically, we consider densities that are indexed by the canonical parameter and have the form

(9)

where comprises the sufficient statistics, is a normalizing constant depending on only, and is the base measure, with and a.s. differentiable with respect to each component. Define and .

{theorem}

Under Assumptions (A1)-(A2) from Section 2.2, the empirical generalized -score matching loss (8) can be rewritten as a quadratic function in :

(10)
(11)
(12)

are sample averages of functions of the data matrix only.

Define , , and . {theorem} Suppose that

  • (C1)   is a.s. invertible, and

  • (C2)  , , and exist and are entry-wise finite.

Then the minimizer of (10) is a.s. unique with closed-form solution . Moreover,

Theorems 3 and 3 are proved in Appendix A.2. Theorem 3 clarifies the quadratic nature of the loss, and Theorem 3 provides a basis for asymptotically valid tests and confidence intervals for the parameter . Note that Condition (C1) holds if and only if a.s. and has rank a.s. for some .

The conclusion in Theorem 3 indicates that, similar to the estimator in Hyvärinen (2007) with , the closed-form solution for our generalized allows one to consistently estimate the canonical parameter in an exponential family distribution without needing to calculate the often complicated normalizing constant or resort to numerical methods. Computational details are explicated in Section 5.3.

Below we illustrate the estimator in the case of univariate truncated normal distributions. We assume (A1)-(A2) and (C1)-(C2) throughout.

{example}

Univariate () truncated normal distributions for mean parameter and variance parameter have density

(13)

If is known but unknown, then writing the density in canonical form as in (9) yields

Given an i.i.d. sample , the generalized -score matching estimator of is

If , and the expectations are finite (for example, when for ), then

We recall that the Cramér-Rao lower bound (i.e. the lower bound on the variance of any unbiased estimator) for estimating is

{example}

Consider the univariate truncated normal distributions from (13) in the setting where the mean parameter is known but the variance parameter is unknown. In canonical form as in (9), we write

Given an i.i.d. sample , the generalized -score matching estimator of is

If, in addition to the assumptions in Example 3, , then

Moreover, the Cramér-Rao lower bound for estimating is

{remark}

In Example 3, if , then also satisfies (A1)-(A2) and (C1)-(C2) and one recovers the sample variance , which obtains the Cramér-Rao lower bound.

In these examples, there is a benefit in using a bounded function , which can be explained as follows. When , there is effectively no truncation to the Gaussian distribution, and our method adapts to using low moments in (6), since a bounded and increasing becomes almost constant as it reaches its asymptote for large. Hence, we effectively revert to the original score matching (recall Section 2.1). In the other cases, the truncation effect is significant and our estimator uses higher moments accordingly.

Figure 2 plots the asymptotic variance of from Example 3, with known. Efficiency as measured by the Cramér-Rao lower bound divided by the asymptotic variance is also shown. We see that two truncated versions of have asymptotic variance close to the Cramér-Rao bound. This asymptotic variance is also reflective of the variance for smaller finite samples.

Figure 1: Log of asymptotic variance and efficiency with respect to the Cramér-Rao bound for ( known).
Figure 2: Log of asymptotic variance and efficiency with respect to the Cramér-Rao bound for ( known).
Figure 1: Log of asymptotic variance and efficiency with respect to the Cramér-Rao bound for ( known).

Figure 2 is the analog of Figure 2 for from Example 3 with known. While the specifics are a bit different the benefits of using bounded or slowly growing are again clear. We note that when is small, the effect of truncation to the positive part of the real line is small.

In both plots we order/color the curves based on their overall efficiency, so they have different colors in one from the other, although the same functions are presented. For all functions presented here (A1)–(A2) and (C1)–(C2) are satisfied.

4 Regularized Generalized Score Matching

In high-dimensional settings, when the number of parameters to estimate may be larger than the sample size , it is hard, if not impossible, to estimate the parameters consistently without turning to some form of regularization. More specifically, for exponential families, condition (C1) in Section 3 fails when . A popular approach is then the use of regularization to exploit possible sparsity.

Let the data matrix comprise i.i.d. samples from distribution . Assume has density belonging to an exponential family , where . Adding an penalty to (10), we obtain the regularized generalized score matching loss

(14)

as in Lin et al. (2016). The loss in (14) involves a quadratic smooth part as in the familiar lasso loss for linear regression. However, although the matrix is positive semidefinite, the regularized loss in (14) is not guaranteed to be bounded unless the tuning parameter is sufficiently large—a problem that does not occur in lasso. We note that here, and throughout, we suppress the dependence on the data for , and derived quantities.

For a more detailed explanation, note that that by (11), for some . In the high-dimensional case, the rank of , or equivalently , is at most . Hence, is not invertible and does not necessarily lie in the column span of . Let be the kernel of . Then there may exist with . In this case, if

there exists with and . Evaluating at for scalar , the loss becomes , which is negative and linear in , and thus unbounded below. In this case no minimizer of (14) exists for small values of . This issue also exists for the estimators from Zhang and Zou (2014) and Liu and Luo (2015), which correspond to score matching for GGMs. We note that in the context of estimating the interaction matrix in pairwise models, ; thus, the condition reduces to , or when both and are estimated.

To circumvent the unboundedness problem, we add small values to the diagonal entries of , which become , . This is in the spirit of work such as Ledoit and Wolf (2004) and corresponds to an elastic net-type penalty (Zou and Hastie, 2005) with weighted penalty . After this modification, is positive definite, our regularized loss is strongly convex in , and a unique minimizer exists for all . For the special case of truncated GGMs, we will show that a result on consistent estimation holds if we choose for a suitably small constant , for which we propose a particular choice to avoid tuning. This choice of depends on the data through .

{definition}

For , let . The regularized generalized -score matching estimator with tuning parameter and amplifier is the estimator

(15)

In the case where for some , we also call the multiplier. We note that from (15) is a piecewise linear function of (Lin et al., 2016).

5 Score Matching for Graphical Models for Non-negative Data

In this section we apply our generalized score matching estimator to a general class of graphical models for non-negative data.

5.1 A General Framework of Pairwise Interaction Models

We consider the class of pairwise interaction power models with density introduced in (1). We recall the form of the density:

(16)

where and are known constants, and the interaction matrix and the vector are parameters. When , we use the convention that and apply the logarithm element-wise. Our focus will be on the interaction matrix that determines the conditional independence graph through its support . However, unless is known or assumed to be zero, we also need to estimate as a nuisance parameter. In the case where we assume is known (i.e. the linear part is not present), we call the distribution (and the corresponding estimator) a centered distribution (estimator), in contrast to the general case termed non-centered when we assume or unknown.

We first give a set of sufficient conditions for the density to be valid, i.e., the right-hand side of (16) to be integrable. The proof is given in Appendix A.3.

{theorem}

Define conditions

  1. is strictly co-positive, i.e., for all ;

  2. ;

  3. , , and for ().

In the non-centered case, if (CC1) and one of (CC2) and (CC3) holds, then the function on the right-hand side of (16) is integrable over . In the centered case, (CC1) and are sufficient.

We emphasize that (CC1) is a weaker condition than positive definiteness. Criteria for strict co-positivity are discussed in Väliaho (1986).

5.2 Implementation for Different Models

In this section we give some implementation details for the regularized generalized -score matching estimator defined in (15) applied to the pairwise interaction models from (16). We again let . The unregularized loss is then

The general form of the matrix and the vector in the loss were given in equations (10)–(12). Here is block-diagonal, with the -th block

(17)
(18)

where the product between a vector and a matrix means an elementwise multiplication of the vector with each column of the matrix, and . Furthermore, , where and correspond to each entry of and , respectively. The -th column of , written as , is

where is the -vector with 1 at the -th position and 0 elsewhere, and the -th entry of is

These formulae also hold for since and only depend on the gradient of the log density, and also holds for . In the centered case where we know , we only estimate , and is still block-diagonal, with the -th block being the submatrix in (17), while is just . Since only appears in the part of the density, the formulae only depend on in the centered case.

We emphasize that it is indeed necessary to introduce amplifiers or a multiplier in addition to the penalty. It is clear from (18) that (or if centered). Thus, is non-invertible when (or if centered) and need not lie in its column span.

We claim that including amplifiers/multipliers for the submatrices only is sufficient for unique existence of a solution for all penalty parameters . To see this, consider any nonzero vector . Partition it as with . Let be our amplified version of the matrix from (21), so

As itself is positive semidefinite, we find that if at least one of the first entries of is nonzero then

If only the last entry of is nonzero then

almost surely; recall that . We conclude that (and thus the entire amplified ) is a.s. positive definite, which ensures unique existence of the loss minimizer.

Given the formulae for and , one adds the penalty on to get the regularized loss (24). Our methodology readily accommodates two different choices of the penalty parameter for and . This is also theoretically supported for truncated GGMs, since if the ratio of the respective values and is fixed, the proof of the theorems in Section 6 can be easily modified by replacing by . To avoid picking two tuning parameters, one may also choose to remove the penalty on altogether by profiling out and solve for , with the minimizer of the profiled loss