Nonlinear Statistical Learning with Truncated Gaussian Graphical Models

# Nonlinear Statistical Learning with Truncated Gaussian Graphical Models

## Abstract

We introduce the truncated Gaussian graphical model (TGGM) as a novel framework for designing statistical models for nonlinear learning. A TGGM is a Gaussian graphical model (GGM) with a subset of variables truncated to be nonnegative. The truncated variables are assumed latent and integrated out to induce a marginal model. We show that the variables in the marginal model are non-Gaussian distributed and their expected relations are nonlinear. We use expectation-maximization to break the inference of the nonlinear model into a sequence of TGGM inference problems, each of which is efficiently solved by using the properties and numerical methods of multivariate Gaussian distributions. We use the TGGM to design models for nonlinear regression and classification, with the performances of these models demonstrated on extensive benchmark datasets and compared to state-of-the-art competing results.

## 1Introduction

Graphical models, which use graph-based visualization to represent statistical dependencies among random variables, have been widely used to construct multivariate statistical models [22]. A sophisticated model can generally represent richer statistical dependencies, but the inference can quickly become intractable as the model’s complexity increases. A simple model, on the contrary, is easy to infer, but its representational power is limited.

To balance representational versatility and inferential tractability, latent variables are often added into the graphical model to obtain a tractable joint probability distribution which, when the latent variables are integrated out, induces a complicated and expressive marginal distribution over the target variables, i.e., the variables of interest [8]. Since the complexity of these models is induced by integration, expectation-maximization (EM) [4] can be employed to facilitate inference. The approach of EM is to break the original problem of inferring the marginal distribution into a sequence of easier problems, each of which is to infer the expected logarithmic joint distribution, where the expectation is taken over the terms of latent variables in the logarithmic domain, using the information from the previous iteration of this sequential procedure. The restricted Boltzmann machine (RBM) [16] and sigmoid belief network (SBN) [28], as well as the deep networks built upon them [31], are good examples of using latent variables to enhance modeling versatility while at the same time admitting tractable statistical inference.

Gaussian graphical models (GGM) constitute a subset of graphical models that have found successful application in a diverse range of areas [18]. The popularity of GGM may partially be attributed to the abundant applications for which the data are Gaussian distributed or approximately so, and partially attributed to the attractive properties of the multivariate Gaussian distribution which facilitate inference. However, there are many applications where the data are distributed in a way that heavily deviates from Gaussianity, and the GGM may not reveal meaningful statistical dependencies underlying the data.

What is worse, adding latent variables into a GGM does not induce enhanced marginal versatility for the target variables, as it typically does in other graphical models; this is so because the marginals of a multivariate Gaussian distribution are still Gaussian. In addition, the conditional mean of given is always linear in whenever are jointly Gaussian. In this sense, a GGM is inherently a linear model no matter how many latent variables are added.

To overcome the linearity of GGMs, [6] ([6]), [15] ([15]), and [7] ([7]) proposed to apply nonlinear transformations to Gaussian hidden variables. More recently, a deep latent Gaussian model was proposed in [30], in which each hidden layer is connected to the output layer through a neural network with Gaussian noise. In these models, non-linear transforms are applied to Gaussian variables to obtain nonlinearity at the output layer. The non-linear transforms, however, destroy the nice structure of a GGM, such as the quadratic energy function and a simple conditional dependency structure, rendering it difficult to obtain analytic learning rules and, as a result, one has to resort to less efficient sampling-based inference.

In this paper, we introduce a novel approach to inducing nonlinearity in a GGM. The new approach is simple: it adds latent variables into a GGM and truncates them below zero so that they are nonnegative. We term the resulting framework as truncated Gaussian graphical model (TGGM). Although simple, the truncation leads to a remarkable result: after the truncated latent variables are integrated out, the target variables are no longer Gaussian distributed and their expected relations are no longer linear. Therefore the TGGM induces a nonlinear marginal model for the target variables, forming a striking contrast to the GGM. It should be emphasized that the approach proposed here is different from those in [32], which constrain the target (observed) variables to be nonnegative, without using latent variables; the nonnegative constraint in those approaches relaxes the convex function in the Gaussian distribution to a non-convex energy function that admits multimodal distributions.

The foremost advantage of the TGGM-induced nonlinear model over previous nonlinear models is the ease and efficiency with which inference can be performed. The advantage is attributed to the following two facts. First, as the nonlinear model is induced from a TGGM by integrating out the latent variables, EM can be used to break the inference into a sequence of TGGM inference problems. Second, as the truncation in a TGGM does not alter the quadratic energy function or the conditional dependency structure of the GGM, it is possible for a TGGM inference algorithm to utilize many well-studied properties of multivariate Gaussian distributions and the associated numerical methods [20]. A second important advantage is that the conditional dependency structure of a TGGM is uniquely encoded in the precision matrix (or inverse covariance matrix) of the corresponding GGM (before the truncation is performed). By working with the precision matrix, one can conveniently design diverse structures and construct abundant types of nonlinear statistical models to fit the application at hand.

We provide several examples of leveraging the TGGM to solve machine-learning problems. In the first, we use the TGGM to construct a nonlinear regression model that can be understood as a probabilistic version of the rectified linear unit (ReLU) neural network [12]. In the second, we solve multi-class classification by using the multinomial probit link function [1] to transform the continuous target variables of a TGGM into categorical variables. Our main focus in the first two examples is on shallow structures, with one latent (hidden) layer of nonlinear units used in a TGGM. In the third example, we consider extensions to deep structures, by modifying the blocks in the precision matrix that are related to latent truncated variables. We use EM as the primary inference method, with the variational Bayesian (VB) approximation used for multivariate truncated Gaussian distributions. The performances of the TGGM models are demonstrated on extensive benchmark datasets and compared to state-of-the-art competing results.

## 2Nonlinearity from Truncated Gaussian Graphical Models (TGGMs)

Let and respectively denote the target and latent variables of a TGGM, and we use to denote the input variable. The TGGM is defined by the following joint probability density function

where is a multivariate Gaussian density of with mean and precision matrix and is the associated truncated density defined as

where is an indicator function and is multiple integral. The marginal TGGM model is defined by

To see how the truncation affects the marginal TGGM , we rewrite equivalently as

where , , , and . From and follows

It is seen from that the target distribution induced by a TGGM consists of two multiplicative terms. The first term is a Gaussian distribution induced by the associated GGM (for which is not truncated). The second term modulates the Gaussian term into a more complicated and non-Gaussian distribution. As an example, one can verify that (Equation 4) is a skewed normal with , , , and [27].

The modeling versatility of a TGGM is primarily influenced by , the number of truncated latent variables, and , which encodes marginal dependencies of these variables. With a proper choice of and , one can construct TGGM models to solve diverse nonlinear learning tasks. The nonlinearity induced by a TGGM is seen from the expression of which, using , is found to be

where is the expectation with respect to . Due to the truncation , the expectation is a nonlinear function of . By contrast, if is not truncated, one has , which is a linear function of . Thus, a TGGM induces nonlinearity through the truncation of its latent variables.

The nonlinearity can be controlled by adjusting . For example, if we set , where is a identity matrix, we obtain

where is the -th element of and the -th row of using Matlab notations, and is the mean of the univariate truncated normal distribution . The formula of is given by [20]

where is the probability density function (PDF) of the standard normal distribution, and its cumulative distribution function (CDF). Figure ? shows as a function of , for various values of , alongside , which is the activation function used in ReLU neural networks [12]. It is seen that is a soft version of and .

## 3Nonlinear Regression with TGGMs

We begin with a nonlinear regression model constructed from a simple TGGM, in which we restrict and to diagonal matrices: and . By (Equation 6)-(Equation 7) and the arguments there, the in such a TGGM implements the output of a soft-version ReLU neural network, which has a single layer of hidden units with the activation function , and uses and as the input-to-hidden and hidden-to-output weights, respectively.

### 3.1Maximum-Likelihood (ML) Parameter Estimation

Given a training dataset consisting of the inputs (covariates) and the outputs (responses) , the log-likelihood function is

where , and

Let be an arbitrary PDF with parameters , defined on . It follows from (Equation 8)

where denotes the Kullback-Leibler (KL) distance, and, if such that , the limit values are used to lead to and . In general is parameterized differently from the TGGM; when , however, we let use the same parameterization as the TGGM so that . In this case, we drop the subscript to simply write , which, by (Equation 10), can be further written as

From (Equation 11) follows the EM algorithm. First, it is clear that . Thus, for a sequence satisfying

one deduces , where the last inequality follows from (Equation 11). By successively solving (Equation 12), starting from initial , the EM algorithm produces a sequence that monotonically increases . To ensure , one only requires . Therefore it is not necessary to solve (Equation 12) completely; rather it is sufficient to perform a single-step gradient ascent from ,

where . To find the gradient, it is helpful to write , where is the energy function and the normalization. The gradient can then be expressed as

where denotes the expectation with respect to (w.r.t.) , and the expectation w.r.t. . Specifically, the partial derivatives of w.r.t. and can respectively be derived as

where is a column vector of ones. The derivatives w.r.t. and can be derived similarly.

### 3.2ML Estimation versus Backpropagation

As mentioned earlier, for a TGGM with diagonal and , implements the output of a soft-version ReLU neural network that use (Equation 7) as the activation function at each hidden unit. This suggests one can use back-propagation (BP) to minimize the error between and the training samples of , as one does in training a standard ReLU network [12].

A popular choice of the error function used by BP is the squared error which, in the case here, is given by . Minimization of the squared error is equivalent to maximization of the likelihood under the assumption that . However, we have shown in (Equation 4) that is a non-Gaussian distribution. Therefore, BP does not maximize the likelihood of the TGGM in the rigorous sense.

To gain a deeper understanding of the relation between BP and ML estimation, we analyze the update equations of BP and compare them to those of the ML estimator. The BP performs gradient descent of the squared error, with the required partial derivatives given by

where is the Hadamard product and is a matrix of variances. Comparing to , we can see that the direction of is an approximation to that of by replacing the posterior expectations and with the corresponding prior expectations and . Hence, the ML estimator makes a more sufficient use of the available information, in the sense that it takes into account while BP does not.

To relate to , we require the lemma below.

Since the prior and the conditional are both Gaussian, the joint distribution is also Gaussian. As a result, the posterior is a Gaussian distribution with the mean given by . It then follows that , which is substituted into to yield .

As ( ?) holds for any , it is also true when , in which case the value of has little influence on the direction of ; Therefore, we can make sufficiently small such that , and consequently . Comparing the latter equation to , we see that the gradients and are different in three aspects: (i) the in is replaced by in ; (ii) a new factor arises in ; and (iii) the multiplicative constants are different. Since (iii) has no influence on the directions of the gradients, we focus on (i) and (ii). The new factor in does not depend on , so it plays no direct role in back-propagating the information from the output layer to the input layer. The only term of that contains is , which plays the primary and direct role in sending back the information from the output layer when updating the input-to-hidden weights . Since is obtained under the assumption that is generated from Gaussian latent variables , it is clear that the gradient used by BP does not fully reflect the underlying truncated characteristics of in the TGGM model. On the contrary, is the true posterior mean of under the truncation assumption.

In summary, BP uses update rules that are closely related to those of the ML estimator, but it does not fully exploit the available information in updating the TGGM parameters. In particular, BP ignores when it uses , instead of , to update ; it makes an incorrect assumption about the latent variables when it uses , instead of , to update . These somewhat defective update equations are attributed to the fact that BP makes a wrong assumption from the very beginning, i.e., BP assumes is a Gaussian distribution while the distribution is truly non-Gaussian as shown in (Equation 4). For these reasons, BP usually produces worse learning results for a TGGM than the ML estimator, and this will be discussed further in the experiments.

### 3.3Technical Details

A key step of the ML estimator is calculation of the prior and posterior expectations and in (Equation 21) and ( ?). Since is diagonal, the components in are independent; further, the training samples are assumed independent to each other. Therefore factorizes into a product of univariate truncated normal densities, , where each univariate density is associated with a single truncated variable and a particular training sample. Each of these densities has its mean and variance given by and , respectively, where is defined in , and

is the variance of the truncated normal [20]. Due to the independences, one can easily compute , with .

For the posterior expectation , it could be computed by means of numerical integration. Multivariate integrations in normal distributions have been well studied and many effective algorithms have been developed [10]. Another approach is to use the mean-field variational Bayesian (VB) method [21], which approximates the true posterior with a factorized distribution , parameterized by . The approximate posterior is found by minimizing , or maximizing the lower bound , as shown in (Equation 10).

As is independent of , , given and , the KL distance can be equivalently expressed as and each term in the sum can be minimized independently. Given , the th term of the KL distance is minimized by

where and denotes the expectation w.r.t. . From , one obtains

where , is a vector with its -th element defined as , , , is the -th row of with its -th element deleted, and is the subvector of missing the -th element.

The KL distance monotonically decreases as one cyclically computes (Equation 19) through . One shall perform enough cycles until the KL distance converges. Upon convergence, is used as the best approximation to , , and their means and variances, as given by the formulae in (Equation 7) and (Equation 17), are used to compute the posterior expectations in (Equation 21) and ( ?).

After (Equation 21)-( ?) are computed and the TGGM parameters in are improved based on the gradient ascent in (Equation 13) , one iteration of the ML estimator is completed. Given the updated , one then repeat the cycles with (Equation 19) to find the approximate posteriors and again make another update of , and so on. The complete ML estimation algorithm is summarized in Algorithm 1, where represents the number of cycles with (Equation 19) to find the best posterior distribution for each newly-updated . We can see that the complexity mainly comes from the estimation of expectation , which is .

Finally, it should be noted that the expectations require frequent calculation of the ratio of . In practice, if it is computed directly, we easily encounter two issues. First, repeated computation of the integration is a waste of time; second, when is small, e.g. , the CDF and the PDF are so tiny that a double-precision floating number can no longer represent them accurately. If we compute them with the double-precision numbers, we easily encounter the issue of . Fortunately, both these issues can be solved by using a lookup table. Specifically, we pre-compute at densely-sampled discrete values of using high-accuracy computation, such as the symbolic calculation in Matlab, and store the results in a table. When we need for any , we look for two values of that are closest to and use the interpolation of the two to estimate .

## 4Extension to Other Learning Tasks

### 4.1Nonlinear Classification

Let denote possible classes. Let be a class-dependent matrix obtained from by setting the -th column to one and deleting the -th row [24]. We define a nonlinear classifier as

The inner integral is due to the multinomial probit model [1] which transforms the TGGM’s output vector in (Equation 1) into a class label according to . Therefore, . A change of variables leads to .

The model described by can be trained by an ML estimator, similarly to the case of regression, with the main difference being the additional latent vector , which can be treated in a similar way as . The posterior is still a truncated Gaussian distribution, whose moments can be computed using the methods in Section 3. The model predicts the class label of using the rule , where and .

### 4.2Deep Learning

The TGGM defined in (Equation 1) can be viewed as a neural network, where the input, hidden, and output layers are respectively constituted by , , and , and the hidden layer has outgoing connections to the output layer and incoming connections from the input layer. The topology of the hidden layer is determined by . So far, we have focused on , which defines a single layer of hidden nodes that are not interconnected. By using a more sophisticated , we can construct a deep TGGM with two or more hidden layers and enhanced representational versatility.

As an example, we let and define . Taking into account the normalization, the distribution can be written as , where and depend on . This distribution, along with , yields a TGGM with two hidden layers. Extensions to three or more hidden layers and to classification can be constructed similarly. A deep TGGM can be learned by using EM to maximize the likelihood, wherein the derivatives of the lower bound can be represented as , as in Section 3.1.

Training a multi-hidden-layer TGGM is almost the same as training a single-hidden-layer TGGM, except for the difference in estimating the prior expectation . In the single-hidden-layer case, since is diagonal, factorizes into a set of univariate truncated normals, and therefore the expectation can be computed efficiently. In the multi-hidden-layer case, however, is a multivariate truncated normal, and thus the prior expectation is as difficult to compute as the posterior expectation. Following Section 3.3, we use mean-filed VB to approximate by factorized univariate truncated normals and estimate with the univariate distributions.

In practice, we find that starting from the VB approximation of can improve the VB approximation of , a feature similar to that observed in the contrastive divergence (CD) [16].

## 5Experiments

We report the performance of the proposed TGGM models on publicly available data sets, in comparison to competing models. In all experiments below, RMSProp [35] is applied to update the model parameters by using the current estimated gradients, with RMSprop delay set to be .

Regression

The root mean square error (RMSE), averaged over multiple trials of splitting each data set into training and testing subsets, is used as a performance measure to evaluate the TGGM against the ReLU neural network. The results reported in [14] are used as the reference to the performances of ReLU neural networks. The comparison is based on the same data and same training/testing protocols in [14], by using a consistent setting for the TGGM as follows: a single hidden layer is used in the TGGM for all data sets, with hidden nodes used for Protein Structure and Year Prediction MSD, the two largest data sets, and hidden nodes used for the other data sets.

Two methods, BP and ML estimation, are applied to train each TGGM, resulting in two versions of the TGGM for each data set, referred to TGGM-BP and TGGM-ML, respectively. For both training methods, is initialized as Gaussian random numbers, with each component a random draw from . To speed up, each gradient update uses a mini-batch of training samples, resulting in stochastic gradient search. The batch size is for the two largest data sets and for the others. For ML estimation, the number of cycles used by mean-field VB is set to , and .

The testing RMSE’s of the TGGM are summarized in Table ?, alongside the corresponding results [14] for the ReLU neural networks trained by BP and probabilistic backpropagation (PBP). BP provides a point estimate of the model parameters while PBP provides the posterior distribution. It is seen from Table ? that TGGM-BP performs slightly better than ReLU on most data sets. The gain may be attributed to the soft activation function which provides the freedom in choosing appropriate according to data’s characteristics, unlike ReLU which fixes to . The nonzero slope in for may be another contributing factor, as it has been shown in [13] that replacing the zero part of ReLU with a sloping line leads to better results. Furthermore, it can be observed that TGGM-ML outperforms TGGM-BP on most data sets. This is because ML-based training fully exploits the flexibilities provided by a probabilistic model and, as discussed in Section 3.2, is more accurate in reflecting the underlying model, in contrast to TGGM-BP which made a wrong assumption about the model at the very beginning.

We also observe that, if is set close to , the performance of TGGM-BP approaches that of the ReLU neural network. This is not surprising since approaches the ReLU activation function as . As we increase the value of , the TGGM’s performance improves gradually, until it reaches a saturating value. This is reasonable because becomes more linearly (w.r.t. ) as becomes larger, which weakens its nonlinear representational abilities. Empirically, we find that TGGM-BP performs similarly within an appropriate range of . The results in Table ? are based on , which is found to be a good setting for all data sets. The impact of on the RMSE results is illustrated in Figure 1. Note that can also be learned directly from the data, which is an interesting future work.

We found that the optimal for TGGM-ML is typically larger than that for TGGM-BP. This is perhaps because a larger provides increased flexibility in the TGGM, which can be exploited by a probabilistic inference method like expectation-maximization. As a result, we use for TGGM-ML in the experiments.

Classification

Three public benchmark data sets are considered for this task: MNIST, 20 NewsGroups, and Blog. The MNIST data set includes 60,000 training images and 10,000 testing images of handwritten digits of zero through ten. The 20 NewsGroups data sets is composed of 18,845 documents, written with a vocabulary of 2,000 words, from 20 different groups, with the data partitioned into a training set of 11,315 documents and a testing set of 7,531 documents [23]. The Blog data set contains 13,245 documents, written with 17,292 words, about the US presidential elections; the data are partitioned into 7,832 training documents and 5,413 testing documents [2]. One-hidden-layer and two-hidden-layer TGGM models are considered, with each hidden layer containing 100 or 200 nodes. Similar to the regression model, the TGGM classifier is trained by both BP and ML, with the resulting models termed as TGGM-BP and TGGM-ML, respectively.

The models are randomly initialized with Gaussian random numbers drawn from . The step-size for gradient ascent is chosen from by maximizing the accuracy on a cross-validation set. The TGGMs use a minibatch size of 500 for MNIST and 200 for the other two data sets, while the ReLU uses 100 for all data sets. Variance parameters are set to for TGGM-ML and for TGGM-BP, in both one and two-layer models. When ML estimation is applied, the number of VB cycles is initially set to 30 and then gradually decreases to 5. The data sets are also used to train and test a ReLU neural network implemented in Caffe [19], to produce the competing results for comparison. From Table ?, it can be seen that TGGM-BP generally outperforms ReLU on the two document data sets and maintains a comparable performance on the image data set, for both one- and two- hidden-layer models. It is further observed that TGGM-ML has the best performance on all three data sets, with the best performance achieved by the two-hidden-layer models on MNIST and Blog, and by the one-hidden-layer model on 20 NewsGroup.

[t!]

## 6Conclusions

We have proposed a nonlinear statistical learning framework termed TGGM. By introducing truncated latent variables into the traditional GGM, we obtain the TGGM as a non-Gaussian nonlinear model with significantly enhanced modeling ability compared to the GGM. We demonstrate that regression and classification can be realized through appropriately constructed TGGMs. With carefully designed graphical structures, deep versions of TGGMs have also been obtained. It is shown that, for regression and classification, TGGMs can be approximately viewed as a deterministic neural network with an activation function similar to ReLU. Because of this, TGGMs can be trained with BP. However, BP does not exactly maximize the likelihood of a TGGM, due to the inherent Gaussian assumption it makes. To overcome this limitation, we have developed an algorithm to correctly maximize the likelihood under the truncated Gaussian assumption. Experimental results show that the TGGM trained by BP generally performs better than the ReLU network, indicating the advantage of the new activation function. It is further shown that the TGGM trained by ML learning achieves the best performance on most data sets in consideration. It should be emphasized that the tasks considered in this paper are only specific applications of the TGGM framework under special forms of the precision matrices. In the future, we will consider TGGMs with lateral connections between hidden nodes. We may also generalize the supervised TGGM to the unsupervised case, using constructs similar to RBMs. Moreover, investigation of how the quality of uncertainty estimates affects the performance is also of interest.

## Acknowledgements

The authors would like to thank the anonymous reviewers for their valuable and constructive comments. This research was supported in part by ARO, DARPA, DOE, NGA and ONR.

Supplementary of “Nonlinear Statistical Learning with Truncated Gaussian Graphical Models”

## ATraining TGGM for Classification

Similar to the regression model, the derivatives of can be derived as