Generalizing Energy-based Generative ConvNets from Particle Evolution Perspective

Generalizing Energy-based Generative ConvNets from Particle Evolution Perspective


Compared with Generative Adversarial Networks (GAN), Energy-Based generative Models (EBMs) possess two appealing properties: i) they can be directly optimized without requiring an auxiliary network during the learning and synthesizing; ii) they can better approximate underlying distribution of the observed data by learning explicitly potential functions. This paper studies a branch of EBMs, i.e., energy-based Generative ConvNets (GCNs), which minimize their energy function defined by a bottom-up ConvNet. From the perspective of particle physics, we solve the problem of unstable energy dissipation that might damage the quality of the synthesized samples during the maximum likelihood learning. Specifically, we firstly establish a connection between classical FRAME model [1] and dynamic physics process and generalize the GCN in discrete flow with a certain metric measure from particle perspective. To address KL-vanishing issue, we then reformulate GCN from the KL discrete flow with KL divergence measure to a Jordan-Kinderleher-Otto (JKO) discrete flow with Wasserastein distance metric and derive a Wasserastein GCN (wGCN). Based on these theoretical studies on GCN, we finally derive a Generalized GCN (GGCN) to further improve the model generalization and learning capability. GGCN introduces a hidden space mapping strategy by employing a normal distribution for the reference distribution to address the learning bias issue. Due to MCMC sampling in GCNs, it still suffers from a serious time-consuming issue when sampling steps increase; thus a trainable non-linear upsampling function and an amortized learning are proposed to improve the learning efficiency. Our proposed GGCN is trained in a symmetrical learning manner. Besides, quantitative and qualitative experiments are conducted on several widely-used face and natural image datasets. Our experimental results surpass those of existing models in both model stability and the quality of generated samples. The source code of our work is publicly available at

FRAME models, energy-based model, generative ConvNets, KL divergence, JKO discrete flow.

1 Introduction

Unsupervised learning of complex data distributions and sample generation are being extensively researched in the machine learning and computer vision community [2, 3, 4]. Generative modeling learns the underlying data distribution given a quantity of data observations. A well-learned generative model can synthesize data samples similar to real-world ones; doing so is thus regarded as the ability to imitate human creativity. There mainly exist two types of representative generative models – either likelihood-free models such as Generative Adversarial Networks (GANs) [5] or likelihood-based models such as Variational AutoEncoders (VAEs) [6] and flow-based model [7].

Energy-Based Model (EBMs) [8] as a remarkable likelihood-based model present an appealing property that captures the dependencies among variables by evolving a scalar energy to each variable and requires no probabilistic normalization. Compared with GAN, which generates samples by applying a generator and a discriminator learned adversarially, EBM can be directly optimized without any auxiliary; by learning with explicit potential function, it approximates the observed data distribution more closely and hence synthesizes more authentic samples. EBM has the merits of, compared with VAEs, enabling optimization of the exact log-likelihood instead of its lower bound, and, compared with flow-based models, the ability of using the learned model as a prior for image modeling, e.g., by transforming it into a discriminative Convolutional Neural Network (CNN) [9] for classification.

Fig. 1: Learning evolution visualization of GCN and wGCN. The contour map with particles in this figure is an energy dissipation process simulation of both algorithms from the particle perspective. Image sequences are the generating steps and selected typical results for a “spaceship” obtained by GCN and wGCN, respectively. The wGCN algorithm produces higher-quality images than GCN which collapses at the very beginning of the sampling iteration. The process of GCN manifests seriously unstable when suffering from model collapse, while the dissipation process of wGCN is more orderly and ends up with closer to energy point.

This paper studies a well-exploited EBM model, energy-based Generative ConvNet (GCN) [10]. GCN stems from Filters, Random fields, And Maximum Entropy (FRAME) [11] model, which is proposed long before the popularity of deep learning and aims at modeling texture contexts. GCN can be regarded as a deep variant of FRAME which incorporates deep convolutional filters to deal with more complex and high-dimensional sample synthesis. In general, an EBM utilizes Markov chain Monte Carlo (MCMC) to sample from logarithmic partition functions; hence to further solve the difficulty of MCMC on traversing different modes, a multi-scale GCN (MGCN) [12] is proposed and synthesizes the samples from small and coarse grids to a fine grid.

Nevertheless, these energy-based GCN models follow the maximum entropy principle (MEP) and are cast into a learning problem of KL divergence between the real-world data distribution and the model distribution . With MCMC dynamic discrete sampling, this learning process of minimizing the KL divergence indicates that the learning gradient of each step approximately follows the gradient of a KL discrete flow. In our work, we revisit GCN with this KL discrete flow from a particle evolution perspective and accordingly reformulate GCN. As for the particle evolution perspective, we consider all the samples as a group of disordered dynamic particles and the learning process can be compared to the evolution of particles with a discrete flow (driven by KL divergence, thus known as KL discrete flow) at each learning time step. We also provide theoretical proofs on this equivalence from two viewpoints.

However, energy-based GCN models solve an intractable energy function by a step-by-step iterative MCMC sampling and updating. This solving mechanism exhibits a quite unstable learning and thus tends to generate blurry and twisted images or even ones with crashing contexts. As shown in Fig.1, GCN model generates collapsed images marked in yellow, indicating a severe model collapse problem. Considering the synthesis process as an energy dissipation, the energy states of GCN are disorderly and converge to a wrong non-zero energy location in the contour map. From the particle perspective, it is more intuitive that the unstable learning process is mainly caused by a KL-vanishing problem. Thus we reformulate GCN from the KL discrete flow to a Jordan-Kinderlehrer-Otto (JKO) discrete flow [13] with a Wasserstein distance measure and derive a Wasserastein GCN (wGCN). As shown in Fig.1, the particles of our proposed wGCN evolve more orderly and stably converge to energy location.

Based on the Wasserstein JKO discrete flow, we further generalize seamlessly GCN to derive a Generalized Generative ConvNet Model (GGCN). This generalization strategy is non-trivial, rather than simply replacing deep FRAME with GCN. In GCN, on one hand, strong dependency on observed data limits the model learning ability and flexibility; on the other hand, MCMC sampling limits the learning efficiency and this would become rigorous for MGCN whose learning time grows massively with grids of sampling process in different scales. To address these issues, our proposed GGCN extends its reference distribution by learning a hidden space mapping and proposes a trainable non-linear upsampling function to mimic the grid-by-grid upscale and sampling process. It avoids the decreasing of synthesis quality caused by accumulating errors among multi-grid and further improve the model generalization ability.

Considering that iterative learning of several components in GGCN is an urgent problem, we present a symmetrical learning strategy which contains two similar sub-learning processes. One is learning the GGCN that synthesis by sampling; the other is learning non-linear upsampling function together with the hidden space by generating. Considering the limitation of the efficiency problem in MCMC based learning of EBMs, it still suffers from a serious time-consuming problem when the sampling step increases. This is the main reason that so far EBMs have been rarely trained on large-scale datasets. Accordingly, inspired by the efficient variational reasoning [14], we employ a similar amortized learning method to further shorten the sampling process of GGCN.

In summary, our contributions are as follows:

  • We reformulate classical energy-based GCN models from the traditional information theory viewpoint to a particle evolution perspective. From this particle perspective, to address the KL-vanishing issue, we further generalize GCN from a KL discrete flow to a JKO discrete flow driven by Wasserastein metric in a principle manner and derive a wGCN model that manages to solve the instability of energy dissipation process in GCN, owning to the better geometric characteristic of Wasserstein distance.

  • We propose a Generalized Generative ConvNet model that inherits the mechanism of Wasserstein JKO discrete flow formulation. To minimize the learning bias, we introduce a hidden space mapping strategy and employ a normal distribution for the reference distribution; a trainable non-linear upsampling function and an amortized sampling are employed to further improve the learning efficiency. Furthermore, by learning the hidden space, we make the whole model learned in a symmetrical mechanism.

  • Extensive experiments are conducted on several large-scale datasets for image generation tasks, demonstrating the improvements on the generation of proposed models quantitatively and qualitatively.

2 Related Work

Generative EBM originating from statistical physics explicitly defines the probability distributions of the signal; in that context, they are ordinarily called the Gibbs distributions [15]. With the extensive development of CNNs that have been proven to be powerful discriminators, the generative perspective on this model has recently been extensively explored by an increasing number of studies. [16] is the first study to introduce a generative gradient for pretraining a discriminative ConvNet by a nonparametric importance sampling scheme, and [17] proposes learning FRAME using pretrained filters of a modern CNN. [10] further studies the theory of GCN in-depth and shows that the model has a unique internal autoencoding representation where the filters in the bottom-up operation take up a new role as the basis functions in the top-down representation, which is consistent with the FRAME model. In this paper, a FRAME model and its variants such as GCN and multi-scale GCN described above share the same MLE-based learning mechanism that follows an analysis-by-synthesis approach and first generates synthesized samples from the current model using Langevin dynamics (LD) and subsequently learns the parameters based on the distance between observed and synthesized samples.

A preliminary version of our work was published in [18], which also explains the FRAME from the particle evolution perspective. However, we further generalize seamlessly the GCN with the JKO discrete flow and derive our Generalized Generative ConvNet Model (GGCN) in this paper. GGCN extends its reference distribution to a hidden space to alleviate the dependency of limited observed data and proposes a trainable non-linear upsampling function together with an amortized sampling method to improve learning efficiency. Briefly, the proposed GGCN improves the learning flexibility and model generalization.

Another popular branch of deep generative models is black-box or implicit models that map between some simple low-dimensional prior and complex high-dimensional signals via a top-down deep neural network, such as the GANs, VAEs, flow-based models and their variants. These models have been remarkably successful in generating realistic images. GANs and VAEs are trained together with an assisting model, where VAEs try to optimize the evidenced lower bound that causes them to generate images that are fairly blurry in practical terms, while training GANs entails instability due to the difficulty of finding the exact Nash equilibrium [19], the lack of latent-space encoders and the difficulty of assessing overfitting or memorization [20]. On the other hand, flow-based models either cannot be trained efficiently by maximum likelihood [21], or have poor performance [7, 22].

Fig. 2: Comparisons of four categories of generative models on framework. The downside Generalized Generative ConvNets (GGCN) model is one of our presented models.

We provide guidelines in Fig.2 to briefly illustrate the relations in the leaning framework among those generative models. Unlike the majority of implicit generative models that use an auxiliary network to guide the training of the generator, energy-based generative models maintain a single model that simultaneously serves as a discriminator and a generator. The latter models generate samples directly from the input data rather than from the latent space, which to a certain extent ensures that the model can be efficiently trained and can produce stable synthesized results with a relatively lower complexity of model structure. Nonetheless, such a model can also be used as an auxiliary model and combined with an implicit model to facilitate the model training [23, 24, 14]. Technically, the learning mechanism of our new symmetrically learned GCN is more similar to cooperative learning [23] as a result of all parameter estimation steps having one goal in common: to make approximate , but considering it is derived by generalizing a single GCN, we entail this new model a generalized GCN model (GGCN). GGCN also implements a generator co-existing with GCN, but unlike GAN of which the discriminator is to confront the generator, the generator introduced is to generalize the base GCN in GGCN.

3 Revisiting GCN Model from Particle Evolution Perspective

Considering the dynamic learning of conventional GCN, we regard all the samples as highly disorder dynamic particles in the data space which therefore contains literally physical energy. Real data samples are located where the energy is zero, which can be namely regarded as those particles in a stable state. The learning procedure can be equivalently compared to a process that randomly initialized particles with high energy are gradually led to the low-energy states. The force which leads the particles can be step-by-step modified by estimating the current states of the particles. In consideration of the time dimension, it can be easily imagined that the dynamic particles in discrete time points drift as flows or streams. With this thought, we attempt to reformulate GCN from this particle evolution perspective.

3.1 Generative ConvNet Model Theory

Based on an energy function, GCN is defined on the exponential tilt of a gaussian reference distribution , which is a reformulation of MRF [17] and can be written as following Gibbs distribution:


where is a nonlinear activation function for an input , is the th element of -dimensional model parameter , is the parameters of filters and is the bias; denotes a filtered image or a feature map; , which denotes the Gaussian white noise model with mean and variance ; is the potential function.

The learning process of a GCN model (similar to that of MRF models) follows the iterative MLE parameter estimation that firstly updates model parameters by trying to increase the log likelihood  and subsequently samples from the current distribution using parallel MCMC chains. Assuming that is the parameter of sampled probability distribution at time , the sampling process, according to [25], does not necessarily converge at each . We here provide one persistent sampler that converges globally to reduce the number of calculations, through which we can get numerical solution of the :


where in and are the feature responses over the real-world data distribution and the model distribution , respectively. According to [17], above is estimated with LD which injects noise into the parameter updates. A GCN can be considered as a FRAME with trainable deep filters, namely the parameter is estimated by maximzing likelihood function :


where the is the estimated data distribution at each sampling step.

Inherently, in energy-based models, this learning and sampling process characterizes an evolution in a state space associated with learning gradients; Inspired by the gradient flow derived from energy models [26], it essentially can be further regarded as an evolution of well-established derivations of stochastic particle model equations, where could be considered as Brownian particles. This gradient flows has proven useful in solving theoretically and numerically diffusion equations that model porous media or crowd evolution[27]. For GCN, the learning gradient of each step approximately follows a discrete gradient flow, which is named discrete flow for short, and allows us to attach a precise interpretation to the conventional energy-based model from traditional information and statistical theory to particle perspective.

3.2 Revisit GCN from Particle Evolution Perspective

From a particle evolution perspective, we establish a connection between GCN and dynamic physics process and provide a generalized formulation of GCN in discrete flow with a certain metric measure. Especially, GCN follows a discrete flow with KL divergence measure as its objective of minimizing the KL divergence between and ; its entire optimization process can be regarded as a particle evolution.

Herein we reformulate GCN in the context of a KL discrete flow. Namely, dynamic sampling and updating equations of GCN will be derived in this section, only based on the discrete flow theory. The discrete flow is related to discrete probability distributions (evolution discretized in time) of finite-dimensional problems. More precisely, it represents the system of independent Brownian particles , the positions of which in given by a Wiener process (or Brownian motion, a stochastic process of moving particles) satisfy the following stochastic differential equation (SDE):


where is the drift term, the index of is left out for simplicity, represents the diffusion term, denotes the Wiener process, and subscript denotes values at time . An empirical measure , which is of those particles and proven to approximate Eq. 1 by an implicit descent step . is the so-called KL discrete flow at time that consists of the KL divergence and energy function ,


The connotation of a KL discrete flow that describes a system of Brownian particles with probability distributions inspires a brand new perspective on interpreting a GCN model in the discrete state. If we regard the observed signals with the generating function that has the Markov property as Brownian particles, we will prove a Theorem 1 which implies that LD in Eq.2 can be deduced from a KL discrete flow for learning GCN. This theorem can be sufficiently and necessarily proved through Lemma 1. The detailed proof of Theorem 1 and Lemma 1 will be provided in Appendix A.

Lemma 1.

For i.i.d. particles with the common generating function that has the Markov property, the empirical measure satisfies the large deviation principle with a rate functional in the form of .

Theorem 1.

Given a base measure and a clique potential , the density of GCN in Eq. 1 can be obtained sufficiently and necessarily by solving the following constrained optimization problem:


Let be the Lagrange multiplier integrated in  [11] and ensure that ; then, the optimization objective 6 can be reformulated as follows (as distinction from Eq. 5, represents the LD):


Since , the SDE iteration of in Eq. 4 can be expressed in the Langevin form as follows:


where is the gaussian term. By Lemma 1, if we fix , the sampling scheme in Eq. 8 approaches the KL discrete flow , and the flow will fluctuate if varies. Parameter is updated by calculating , which implies that can dynamically transform the transition map (if we consider the all the learned distributions a transformed version of reference distribution ) into the desired form. The sampling process of a GCN model can be summarized as follows:


where is the derivative of initial Gaussian noise . Examining the objective function Eq. 7 shows that there is an alternately mechanism involved in updating and .

Obviously, the derived Eq. 9 for updating is completely the same as that of MLE in GCN; thus, optimize to minimize the KL discrete flow from this particle perspective, as Theorem 1 has proved, is equivalent to that in its traditional information theoretical point of view. This theoretically justifies our GCN reformulation. i.e., the minimum amount of transformations of the reference measure  (or ). However, such a minimization of the KL divergence might be the cause of the energy unpredictably collapsing to ; namely, the learned model will reduce to producing the initial noise instead of the desired minimum modification. Hence, if , the learned model will tend to degenerate, the images synthesized from GCN trained by the KL divergence will collapse immediately, and the image quality may barely recover. This is identified as KL-vanishing issue.

4 Generalizing Energy-based Generative ConvNet Model

From the particle evolution perspective as discussed above, a GCN learning process can be interpreted with a KL discrete flow; in turn, the KL vanishing problem becomes manifest and likely causes severe model collapse. To address this problem, our reformulation of GCN claimed in Sec. 3 provides a principle manner to extend GCN with the discrete flow; namely, KL divergence term in GCN could be substituted by an appropriate metric measure under SDE. These still hold for GCN or MGCN which can be as a GCN with trainable deep filters. Accordingly, we generalize this KL discrete flow with KL divergence measure into a JKO discrete flow with Wasserastein distance measure in prototype energy-based GCN model, and derive Wasserastein GCN in Sec. 4.1. This JKO discrete flow is connected with the theory of Fokker-Planck Equation and the Wasseratein distance [13].

To further improve its model generalization and its learning capability, we employ a normal distribution as hidden space rather than the original downscaled ground-truth represented reference distribution in multi-scale GCN; considering that the grid-by-grid upscaling functions make the bias accumulated, a trainable non-linear upsampling strategy is proposed. Together with all the above, a generalized Generative ConvNet model (GGCN) is derived in Sec. 4.2.

4.1 Wasserastein GCN Model

Although explaining GCN from a KL approach is rational from the perspective of the information theory methodology, there is a risk of the KL-vanishing problem as discussed above. The main reason is the non-convergence of the parameter updating mechanism of MLE. Then here in this section, we manage to solve this issue by generalizing the KL discrete flow with KL divergence measure into a JKO discrete flow with Wasserastein distance measure, which can be theoretically proved in particle perspective, and incorporate the Wasserastein JKO discrete flow into GCN to derive a Wasserastein GCN model (wGCN).

JKO Discrete flow with Wasserstein metric

To avoid the KL-vanishing problem, we would like to introduce the Wasserstein metric driven JKO discrete flow. This flow is annotated as . In this particle perspective, we explain this substitution according to [28], in a KL method can be closer given an empirical measure than the same measure applied with the Wasserstein distance. The closer measure makes the transition of learning more inclined to wrong or collapse location in the energy manifold. Additionally, [29] also claims that a better convergence and approximate results can be obtained since the Wasserstein metric defines a weaker topology. The conclusion that if time step rationalizes the proposed method (meaning a relatively long sampling process). The proof of this conclusion in the one-dimensional case is presented in [30, 31, 32].

This JKO discrete flow connects the energy model with the theory of Fokker-Planck Equation, which describes the evolution of the probability density function of particle velocity; the required metric in the case of the Fokker-Planck equation is the Wasserstein metric on probability densities [13]. The solution of Fokker-Pluck equation is demonstrated as, at each instant in time, the direction of the steepest descent of associated free energy functional, which usually is regarded as discrete gradient flow. Thus here we introduce the Fokker-Planck equation to build the connection. This formulation reveals an appealing, and previously unexplored, the relationship between the Fokker-Planck equation and the associated free energy functional.

We begin with a redefinition of notations. Let denote the space of Borel probability measures on any given subset of space , where , . Given some sufficient statistics , scalar and base measure , the space of distributions satisfying a linear constraint is defined as . The Wasserstein space of order is defined as , where denotes the -norm on . is the number of elements in domain . denotes the gradient, and denotes the divergence operator.

Fokker-Planck Equation. Let be an integral function and denote its Euler-Lagrange first variation; then, the Fokker-Planck equations are as follows:


Compared with Eq. 5, this function models particle density of probability into a continued state [13].

Wasserstein Metric. The solution steepest descent, or discrete gradient flow, makes sense only in context with an appropriate metric. The required metric for the Fokker-Planck equation is the Wasserstein metric on probability densities [13]. The Benamou-Brenier form of Wasserstein metric [33] of order involves solving the following smooth OT problem over any probabilities and in . The solution can be derived by using the continuity equation shown in Eq. 10,


where belongs to the tangent space of the manifold (here we suppose particles are moving through a manifold) governed by some potential and associated with curve .

JKO Discrete Flow. Following the initial study [13] that shows how Fokker-Planck diffusions of distributions in Eq. 10 are recovered when minimizing entropy functionals according to Wasserstein metric , the JKO discrete flow is used by our method to replace the initial KL divergence with the entropic Wasserstein distance . The function of the flow is as follows:


Wasserstein GCN

Wasserstein GCN (wGCN) is associated with a JKO discrete flow  instead of original KL discrete flow GCN. Next in this section we will further discuss this replacement and provide a new JKO discrete flow LD optimization objective.

Remark 1.

The initial Gaussian term is left out for convenience to facilitate the derivation; otherwise, entropy in Eq. 12 would be written as relative entropy .

By Theorem 1, instead of can be calculated approximately, and a steady state will approach Eq. 1. Using as a dissipation mechanism instead of allows regarding the diffusion Eq. 4 as the steepest descent of clique energy and entropy w.r.t. the Wasserstein metric. Solving such an optimization problem using is identical to solving the Monge-Kantorovich mass transference problem [33].

Using the second mean value theorem for definite integrals, we can approximate the integral in  (where ) by two randomly interpolated rectangles:


where parameterizes the time piece, and represents a random interpolated parameter since is a random (). Eq. 13 means that the functional derivative of w.r.t. is proportional to the following:


which is exactly the result of Proposition 8.5.6 in [34]. Assume that be at least twice differentiable and treat Eq. 14 as the variational condition in Eq. 10; then, plugging Eq. 14 into the continuity equation of Eq. 10 turns the latter into a modified JKO discrete flow in the Fokker-Planck form as follows:


Then, the corresponding SDE can be written in the Euler-Maruyama form as follows:


By remark 1, if we reconsider the initial Gaussian term , the discrete flow of in Eq. 16 should be incremented by .

Remark 2.

If is the energy function defined in Eq. 1, then .

The above is a direct result since defined in GCN only involves inner product, piecewise linear ReLU and other linear operations, and the second derivative is obviously . Since , the time evolution of density in Eq. 15 and sample in Eq. 16 will reduce to Eq. 10 and Eq. 8, respectively. Thus, the SDE of holds by default, i.e., retains the Langevin form while the gradient of the model parameter does not vanish.

Similar to the parameterized KL flow defined in Eq. 7, we propose a similar form for the JKO case. Using Eq. 13 and Eq. 14, the final optimization objective function can be formulated as follows:


Given the entire discussion above, the learning progress of wGCN can be constructed by gradient ascent of , i.e., . The calculation steps of the formulation are summarized in Eq. 18:


Equation 18 indicates that the gradient of determined in the Wasserstein manner is being added with some soft gradient norm constraints between the last two iterations. Such a gradient norm has the following advantages compared with the original iterative process (Eq. 9). First, the norm serves as the constant speed geodesic connecting with in the manifold spanned by and , which may provide a speedup of convergence. Second, it can be interpreted as a soft force counteracting the original gradient and preventing the entire learning process from stopping. Finally, in experiments we observe that it can preserve internal structural information of data. The learning by synthesizing algorithm details specified below are provided in Algorithm 1.

4.2 Generalized Generative ConvNet Model

We reformulation the conventional energy-based GCN model from a particle evolution perspective, explain theoretically GCN in KL discrete flow in Sec. 3, and generalize the formulation of GCN from KL discrete flow to JKO discrete flow driven by Wasserastein metric in Sec. 4.1. In this section, we further generalize the GCN/wGCN with JKO discrete flow and derive our Generalized Generative ConvNet Model (GGCN). This generalization strategy is non-trivial, rather than simply extending GCN/wGCN with JKO discrete flow.

GCN is regarded as a one-scale model and its multi-scale version MGCN [12] is proposed to speed up the model learning. It is demonstrated that the MGCN is especially inefficient in training since there is a pile of several GCNs in different grids all of which have to be estimated through consuming MCMC sampling. Besides, after applying JKO discrete flow to each grid of GCN, the generative capability and generalized property is still greatly limited by the linear upscale functions and downscaled referenced distribution. This is the main reason that GCNs are rarely trained on large-scale datasets. Thus, to improve the learning efficiency we use a non-linear upsampling function to mimic the updating and sampling process of GCNs in the MGCN, which to some extent improves the model generalization property. Next we expand the initial reference distribution in the first grid of MGCN to a random gaussian reference distribution (which is invalid without non-linear trainable upsampling); finally, we introduce a learning hidden space mechanism, which further improves the learning stability. The whole GGCN can be learned in a symmetrical parameter estimation cycle.

0:  Training data ; Langevin steps ; training iterations ; learning rates 
0:  Synthesized data ; estimated parameters ,
1:  Initialize
2:  for  to  do
5:     for  to  do
8:        Sample
10:     end for
15:     Sample
16:     Update
17:     Update
18:  end for
Algorithm 1 Learning Wasserstein GCN
(a) Pipeline comparison
(b) Symmetrical learning strategy
Fig. 3: Example pipeline of MGCN and GGCN with three grids. As shown in the upper left subfigure, a multi-scale GCN is a tiled structure of several one-scale GCNs. In GGCN, due to the introduction of hidden space mapping, the first grid can be omitted since it is equivalent to an auto-encoder (or we apply this part with 11 convolutional layers, there will be no loss in performance). Our non-linear sampling improves the sampling efficiency compared with MGCN with time-consuming MCMC sampling. The right part is the training strategy guideline of GGCN. and are the response losses computed in the same method but with different parts of networks fixed. Noted that all the GCN learning step in GGCN is applied with Wasserstein method and amortized sampling process.

Generalized Generative ConvNet Model

MGCN is a pile of several of GCNs with different scales, as shown in subfigure (a) of Fig.3. Its learning and synthesis process is through learning each GCN grid-by-grid with LD sampling. Therefore, when in high dimensional data synthesis, the low learning efficiency grows significantly which also results in low synthesis quality. We thus consider an equivalent learnable and upscaling structure to mimic the learning process of middle grids of GCNs in MGCN. To further simplify the model and its learning procedure, we then only reserve the last grid of GCN in the MGCN and extend the model to learn the distribution from a hidden space.

Nonlinear Learnable Upsampling. Although a multi-grid mechanism indeed improves the synthesis quality slightly[12], it mostly depends on hundreds of MCMC sampling steps, which brings many inefficiencies especially in synthesizing high dimensional data. Noted that upscaling in MGCN is simply a linear operation, we presume that substitution of the several grids of GCN by layers of learnable nonlinear upsampling functions will speed up the learning procedure since it mimics the sampling process by simply minimizing the error between the sampled synthesized data and the real-world data. Learning the hidden space mechanism is further proposed (the same process as learning in GCN), which makes the whole model can be learned symmetrically.

Hidden Space Mapping. Two one-scale GCNs at each end as shown in Fig.3 are not necessary. Since the property that learning a GCN is to model the input data distribution, if we extend the input to a random sample from Gaussian distributed in the same dimension as the downscaled real data, then the function of the first GCN becomes modeling a given random distribution. Thus this first GCN can be omitted and the whole pipeline turns into the structure as shown in the right subfigure (b) of Fig.3. Suppose the original sample synthesis a generating process, a mapping from a projection space of real data distribution to real distribution is learned through MCMC, and the latter process can be considered as a learning generator process. This process is likewise learning generator in GAN, thus spontaneously, we expand the input space to the hidden space.

Given a probability distribution with density function , we assume to be the nonlinear upsampling network, where represents a sampled initial gaussian noise, and denotes parameters. The given information only includes random noise (from an unknown real-world distribution ), trainable function and its derivative . Our learning objective is to find an optimal satisfying such that the output matches the objective probability density function . Parameters are learned by the following:


where , is the synthesized sample, is the batch size and denotes the gradient of .

Symmetrical Learning Strategy

As shown in subfigure (a) of Fig.3, we simplify the learning procedure to a trade-off between learning GCN and a learning hidden space strategy with non-linear sampling functions. The training strategy is illustrated in the subfigure (b) of Fig.3, and it is a symmetrical structure learned with the same iterative method. To be more specific, we define the hidden variable with non-linear upsampling function as a generator and the GCN in GGCN as a synthesizor. As a matter of fact, the whole model is still a GCN with a parameterized mapping that synthesizes by updating the hidden space; therefore the synthesis is a middle result of the whole model learning and sampling. Noted that the GCN part in GGCN is modified to wGCN as the final model for implementation. The detailed algorithm can be found in Algorithm 2. As shown in Algorithm 2 line 4, an LD is still necessary, which may result in a relatively long leaning process, we thus introduce an amortized sampling to GGCN, denoted as GGCNAMS.

Learning the Hidden Space. Even if the nonlinear upsampling function is introduced and an acceleration method is applied to GCN, the whole framework still has a disadvantage of high sensitivity to random initialization, network structure, and hyperparameter selection.

We consider this to be mainly caused by the possibly incorrect hidden space initialization , i.e., the downscaled real data for the first grid of MGCN. Thus, we apply a hidden space optimization strategy proposed by [35, 36] that is capable of capturing the connection between the noise vector and the signal and preventing the learning procedure from going astray. Hidden variable is updated by minimizing the reconstruction loss in nonlinear upsampling function as follows:


where denotes a final approximating reconstruction loss. Note that optimizing either or network parameter will not influence the Wasserstein process of updating , see Algorithm 2.

The GGCN model has the following advantages: (1) the generator mimics the sampling process by simply minimizing the error between the sampled synthesized data and the real-world data; (2) the low-dimensional hidden space can be mapped into any high-dimensional training data by this function (also means friendly to high dimensional data synthesis); (3) higher learning efficiency is gained compared with previous MGCN; (4) the accumulating divergence scale-by-scale in MGCN can be reduced in GGCN and therefore gains improvements on synthesis quality.

0:  Training data ; Langevin steps ; training iterations ; learning rates 
0:  Synthesized data ; estimated parameters ,,
1:  Sample
2:  for  to  do
4:     Get by Langevin sampling respect with .
6:     (Line 4-6 are the same with learning and in 1)
9:  end for
Algorithm 2 Symmetrical Learning Strategy for GGCN

Amortized Langevin Dynamic Sampling. The traditional MCMC algorithm is an inefficient inference technique for approximating energy-based probabilistic models. A GGCN still needs a relatively long time to perform the iteration to obtain an approximation, since in Eq. 19 are synthesized by several rounds of sampling in the synthesizor (a GCN). To further accelerate our model, and inspired by an amortized SVGD method proposed by [14] that is claimed to maximally reduce the KL divergence, we use an amortized dynamic sampling method to speed up the sampling progress.

If scalar in Eq.19 is sufficiently small, the optimization of can be rewritten into the least-squares form. Using the Taylor expansion, we obtain . Since , we obtain the following:


After being rewritten in the least-squares form, the optimization problem becomes as follows:


Due to involving the matrix inverse operation, this parameter estimation routine is inevitably computationally intensive; hence, one step of the gradient descent computation will provide a sufficient approximation. The final updating equation is as follows:

Fig. 4: Synthesized results of GCN and wGCN under various hyperparameter settings, where the “Baseline” represents the models with default settings. For each image scale (6464, 128128), wGCN always generates better images than GCN. Furthermore, under different hyperparameter settings, wGCN still presents a stable learning ability. On the contrary, GCN fails to address the sample collapse or artifices problem derived from model collapse. For example, with small changes in learning rate , the generated image quality of GCN significantly deteriorates with severe image artifices. Overall, the stability of learning a wGCN model apparently surpasses that of GCN. If we consider the model collapse an unstable particle energy dissipation phenomenon, the Wassersstein method effectively improves this shortcoming in GCN since different learning strategies that seriously interfere GCN synthesis have much slighter effect on wGCN.

This updating equation defines a reconstruction approximation loss , thus we can apply this loss to Eq. 20. Although the this equation is a rough approximation [37], it is both efficient and effective in the implementation phase because if is small, one step of the gradient descent method reaches relatively close to the optimal solution. We will update by following a more explicit form that literally maximizes the gradient ascent because it is equivalent to equation 23. In this form, is trained to maximize instead of sampling from . The learning strategy of remains as in Eq. 18 and is learned through Eq. 3.

5 Experiments

In this section, we conduct extensive experimental evaluations of our proposed models on four benchmarks: CelebA [38], SUN [39], LSUN-Bedroom [40], CIFAR-10 [41], and Tiny ImageNet[38]. Firstly, we elaborate the experimental settings in Section 5.1, and provide detailed quantitative comparison results of our proposed models in Section 5.2 and Section 5.3. Then, empirical model analysis and assessment are discussed in Section 5.4 and the superiority of few sample training for our proposed models are demonstrated in Section 5.5. Finally, we extend our models to other tasks, i.e., image inpainting and interpolation in Section 5.6.

5.1 Experimental Settings

Datasets. Four benchmark datasets are selected for our experimental evaluation: one face dataset (i.e., CelebA [38]) and three natural image datasets (i.e., LSUN-Bedroom [40], CIFAR-10 [41], Tiny ImageNet [42]).

  • CelebA[38] is a widely used large-scale face dataset of human celebrities, which contains over 10,000 images. We only use its face-center-cropped version.

  • LSUN-Bedroom [40] contains 100,000 images from the bedroom category. They are selected from SUN [39], a large dataset mainly for scene recognition.

  • CIFAR-10 [41] consists of 60,000 colour images in 10 categories. Each category has 6,000 images.

  • Tiny ImageNet [42] is a light version of ImageNet dataset. This dataset has 200 classes out of 1,000 classes of ImageNet and 500 images in each category.

Considering that some large-scale datasets have no category labels (such as CelebA and LSUN-Bedroom), we only implement unconditional training on these datasets while others with label information can be carried out with conditional learning (input vector is the concatenation of random Gaussian noise and embedded label vector).

Evaluation metrics. Two metrics are chosen to evaluate our proposed models, i.e., inception scores (IS) [43] and response distance.

  • Inception score is a prevalent metric for generative models, which measures the diversity and the reality of the generated sample. It uses the network called Inception v2 [44] pretrained on ImageNet [45] to capture the classifiable properties of samples.

  • Response distance [10] is defined as follows:

    where denotes the th filter. The smaller is, the better the generated results since , which implies that provides an approximation of the divergence between the target data distribution and the generated data distribution. Furthermore, with Eq. 2, the faster declines, the faster converges. This metric is a reflection of energy in the particle perspective. The more energy gained, the larger response distance shown.

Learning settings. For GCN, we follows the default FRAME settings in [17], and update CNN parameters with learning rate . For wGCN, is set to ; is 0.2 for CelebA and 0.112 for CIFAR10 datset. The default setting of is 0.01, is 20, the number of learning iterations is 100, the step number of LD sampling within each learning iteration is 50, and the batch size and are 100. The implementation of in wGCN uses the first four convolutional layers of a pretrained VGG-16 [46] but is fixed during training.

For proposed variants of our models (i.e., GGCN, GGCNLHS, GGCNLHSAMS and GGCNAMS), we set to 1.0, the learning rate of generators and synthesizors to and batch size to 100; the step number of Langevin learning iteration is among . The corresponding generator and hidden variables will be updated after every 4 iterations of updating the synthesizor and generate images. The generator is a ResNet50 [47] based auto-encoder and the synthesizor is simply a ResNet50. There is also a shallower network version for GGCNs, detail structural information is provided in Appendix.

The dimension hidden variable in GGCNs is for ImageNet and CIFAR-10, for CelebA, and its learning rate is on ImageNet, on CelebA, and on CIFAR-10. All the parameters are leaned by Adam optimizer, except that is learned by SGD. Usually, there will be a total of 500 or 1,000 epochs of training, and a memory replay buffer similar to that in [48] is used. This image buffer serves as a one-step advance initialization of the next sampling step.

Fig. 5: Averaged learning curves of response distance on CelebA, LSUN-Bedroom, and CIFAR-10 datasets. The sooner the curves of response distance converge to zero, the more stable the training process of the respective model. All red curves corresponding to wGCN converge quicker than does the blue original GCN curve. The response distance is a reflection of energy. A quicker and closer response distance to implies a better learning state and higher quality of synthesis.
(a) MGCN
(b) wMGCN
Fig. 6: Comparison of synthesized results for the CelebA dataset. Images from (a) to (d) show the results of MGCN, wMGCN, GGCNLHS+AMS and GGCNAMS, respectively. From (a) to (b) we find that Wasserstein method can eliminate the model collapse phenomenon in (a), but there are still some twisting drawbacks in synthesized samples as shown in (b); The amortized sampling with non-linear upsampling applied model GGCN without learning hidden space no twisting as shown in (c), however, a little blurring remains. The learning hidden space mechanism will further improve the blurring, as manifested in (d) that the final results are much more clear.

5.2 Evaluation on Wasserastein metric

To address the model collapse problem, Wasserstein distance under the particle evolution conditions is introduced for conventional GCN models, deriving wGCN. To demonstrate its effectiveness, we mainly provide proposed wGCN comparison results with GCN in this section. Extensive experiments are conducted on CelebA, LSUN-Bedroom and CIFAR-10 datasets under various experimental settings. These settings involve different learning rates and different input image scales.

In Fig.4, less sample synthesized results of GCN and wGCN under various hyperparameter settings are shown. For each image scale (6464 or 128128), wGCN always generates better images than GCN. Furthermore, under different hyperparameter settings, wGCN still presents a stable learning ability. On the contrary, GCN fails to address the sample artifices problem resulted from model collapse. For example, as shown from the third to forth row in Fig.4, with small changes of learning rate , the generated image quality of GCN significantly deteriorates with severe image artifices. Obviously, the learning of wGCN is less sensitive to hyperparameter tuning and its stability of learning surpasses that of FRAME.

A similar performance is also illustrated with the averaged learning curves of GCN and wGCN in Fig.5. In Fig.5, it is observed that the averaged learning curves of wGCN are all apparently closely converging to on three datasets, which implies a faster and more stable learning of wGCN compared with GCN according to that the smaller the response distance is, the less divergence between real distribution and estimated distribution. Thus, if we consider the model collapse an unstable particle energy dissipation phenomenon, the Wassersstein method effectively improves this shortcoming in GCN since different learning strategies that seriously interfere GCN synthesis have much slighter effects on wGCN.

As demonstrated in Fig.4 and Fig.5, without the Wasserstein method, the synthesized results would certainly collapse after several iterations of GCN learning. Similarly, this model collapse also occurs for other variants of FRAME, e.g., GCN or MGCN. As shown in Fig.6 (a), most synthesized images contain gray spot areas which are obvious indications of model collapse. Conversely, variants of our proposed wFRAME, e.g., wMGCN and GGCN, greatly improve the image generation quality as shown in Fig.6 (b)-(d), demonstrating a stable learning ability.

5.3 Comparison on the proposed GGCN

Our wGCN reformulates the conventional energy-based models (i.e., FRAME and GCN) from particle evolution perspective, further generalizes the conventional GCN approach by introducing JKO discrete flow with Wasserstein distance measure and derives wFRAME variants (i.e., wMGCN, GGCN) to solve more practical problems. Detailed experimental results of these models are elaborated in the following. The models related with detailed methods can be found in Table I.

TABLE I: LCN stands for learning CNN filters; JKO is JKO discrete flow (namely the Wasserstein learning mechanism); MS represents Multi-scale; NLU means applying layers of non-linear upsampling functions which cannot coexist with MS; HS is an abbreviation of hidden space, here means extend the input to a random space and leave out the first grid of GCN; AMS is the annotation of amortized sampling method; LHS is learning the hidden space.

Experimental Results on CelebA

We provide synthesized results of MGCN, wMGCN, GGCNLHS+AMS and GGCNAMS on CelebA dataset in Fig.6. This experiment is an ablation study for building GGCN, it is conducted for locating the effect of the learning mechanisms within GGCN, which are HS, NLU and LHS. The model structure from MGCN becomes gradually more complicated to GGCN+AMS, where suffix “+AMS” means with amortized sampling method and “LHS” means without learning the hidden space. We find AMS is necessary in face generation, thus in this study GGCN is learned with AMS. This phenomenon will be explained later. After time-consuming iterations of training, an MGCN still suffers from collapse issue even with carefully finetuning. As demonstrated in Fig.6(a), most synthesized images of MGCN contain gray spot areas which obviously indicate the model collapse issue. In Fig.6(b), it is observed that our proposed wMGCN alleviates this issue by applying the JKO discrete flow and the gray spots are not shown any more in the synthesized results. However, those synthesized images of wMGCN are still blurry and distorted and artifices or ghosts are present since a mere Wasserstein learning cannot solve the problem of accumulating divergence between real data distribution and estimated data distribution from scale to scale.

Accordingly, our proposed GGCNLHS+AMS introduces JKO discrete flow with Wasserstein distance measure and further improves the image quality, shown in Fig.6(c). Compared with Fig.6(b), these results have less twisted lines, faces are more realistic. Considering the twisting issue in the synthesis might be caused by the accumulating divergence through different scales in MGCN, GGCNLHS+AMS leverages non-linear upsampling functions(NLU), maps the input to a hidden space(MS), and manages to conquer this problem. Compared with results from those methods, the images resulting from the proposed model GGCNAMS in Fig.6(d) are clearest, least distorted and comparable with real human face images.

(a) DCGAN without label
(b) GGCN without label
(c) GGCN with label
Fig. 7: Unconditional generation of GGCN (b) and DCGAN (a) on CIFAR-10. We also provide conditional generation results of GGCN in (c). Objects of (b) can be identified more easily than those in (a). (c) maintains the similar structure and context with (b).

Inception Score
CIFAR-10 without labels

GCN 4.180.13
Improved GAN 4.360.05
wGCN (ours) 4.310.12
ALI[49] 5.340.05
wFRAME[18] 5.520.13

GGCNAMS (ours) 6.790.08
CIFAR-10 with labels
FRAME [18] 4.950.05
GCN 5.500.05
WINN-5CNNs [50] 5.580.05
wGCN (ours) 5.700.06
wFRAME [18] 6.050.13
SteinGAN [14] 6.350.05
MGCN[12] 6.570.05
DCGAN [12] 6.580.05
GGCNAMS (ours) 7.010.05
GGCN (ours) 7.100.07

TABLE II: Inception scores on CIFAR-10. The upper are the results of unconditional learning, the downside are conditional learning. The state-of-art conditional result of Improved GAN reported in [43] is 8.09, but this result has not been reproduced so far. Among the explicit models, GGCN outperforms most methods on both datasets. The inception score of all real-world images of CIFAR-10 is 11.240.11.

Experimental Results on CIFAR-10

We compare our proposed models with state-of-the-art methods in Table II. Considering available category labels of CIFAR-10 dataset, the performance results of inception score on both conditional and unconditional learning are provided. Compared with base model GCN, GGCN obtains respectively 2.61 and 1.55 gains in the unconditional and conditional settings. In comparison with state-of-the-art GAN based methods, our proposed still demonstrates a significant improvement in inception score in unconditional/conditional settings. For example, GGCN gains around 0.6 improvements over DCGAN. Notice that inception score of wGCN is slightly lower than that of wFRAME, this might be caused by the applied filters of wFRAME and FRAME are from VGG models well pretrained on ImageNet, while in GCN and wGCN, the filters will probably be trained to relatively inferior features. Further, higher quality images with more realistic natural scenes are generated by GGCN than those of DCGAN, as demonstrated in Fig.7 (b). In those images of GGCN, objects in these images such as cars, horses, and airplanes are easily be recognized.

(a) Unconditional generation on tiny imagenet
(b) Conditional generation on tiny imagenet
Fig. 8: Due to the latent space updating mechanism, gap between conditional and unconditional generations are decreasing.

Experimental Results on Tiny ImageNet

Tiny ImageNet synthesis results are also provided on conditional and unconditional generation. A visualization of results for both types of generations are shown in Fig.8. Considering that the capacity of ImageNet is much larger than CelebA and CIFAR-10 datasets and its images contain relatively complicated objects and unclear background, these features challenge the learning of generative models on ImageNet, and the generations on this dataset are not as satisfying as the former two large scale datasets. However, GGCN manages to generate images with clear edges and textures, some objects and scenes can even be identified, such as the fruits and birds in both Fig.8 (a) and (b). Owning to the learning hidden space mechanism, results of these two types of generations are visually similar.

5.4 Model Analysis and Assessment

Improvement of Training Stability

Quantitatively, the curves in Fig.5 are calculated by averaging across the entire dataset. The wGCN method has a lower cost according to response distance ; namely, a direct critic in section.5.1 of filter banks between synthesized samples and target samples is smaller and decreases more steadily. More specifically, our algorithm has mostly solved the model collapse problem of GCN because it not only ensures the closeness of the generated samples and the “ground-truth” samples but also stabilizes the learning phase of model parameter . The three plots clearly show that the quantitative measures are strongly correlated with qualitative visualizations of generated samples. In the absence of model collapse, we obtain results that are comparable to or even better than those of GCN.

Since GGCN can be regarded as a GCN with a more complex sampling mechanism, the Wasserstein method that applied to GCN is also implemented to GGCN. Results of Fig.6 (a) to Fig.6 (d) shows that the collapse issue has been completely solved in GCNs with Wassersstein method since no more gray spot appearing in the synthesis results from any proposed models. This implies that the original energy dissipation process is rectified to be more stable. Another steadiness improvement is manifested in the generalization of GCN. Compare the results of Fig.6 (b) and (c), the context of the latter is more sensibly organized. As we have discussed in section 4.2.1, the divergence or learning error will be accumulated as the learning scale increases in MGCN. Obviously, the Wasserstein method merely helps with the collapse issue. Thus we own this improvement to the non-linear upsampling functions and the extension to hidden space mechanisms which largely reduce the accumulating error by combining the estimating and modification of the hidden variable.

Improvement of Image Generation Quality

According to our performance measured by response distance , the quality of image synthesis is improved. This measurement corresponds to the iterative learning process of both GCN and wGCN. The learning curves presented in Fig.5 represent the observations of the overall datasets’ synthesis. The curves also imply that wGCN converges faster than GCN.

From subfigures (b) of Fig.6, we found that even applied with Wasserstein learning method in training method, most result images of MGCN are in the condition of either unclear outline edge, or a total distorted face, or being full of unknown spots which make picture extremely dirty. On the contrary, images produced by GGCN (Fig.6 (c)) and with learning hidden space mechanism (Fig.6 (d)) are both reasonably mixed, sensibly structured and brightly colored and less distorted. Especially the synthesized results from GGCNAMS maintain more clear contexts than GGCNLHSAMS which implies LHS can improve the image quality from this respect.

On the other hand, compared with traditional GCN or MGCN, GGCNAMS can be more efficiently trained on large datasets such as CIFAR-10 and ImageNet. Since the inception score is shown to correlate well with human scoring of the realism of generated images from the generative model on CIFAR-10 dataset, higher inception score than other energy-based model (e.g. WINN [50]) to some extent implies this new GGCN synthesizes better quality than others. This conclusion also can be drawn from Fig.7 which is showing that visually GGCN generates more identified objects from DCGANs with CIFAR-10.


(a) GGCN+AMS (1 step)
(b) GGCN (100 steps)
Fig. 9: The group (a) is synthesized by the model GGCN using one-step sampling (conditional generation on CIFAR10) which is equivalent with applying the amortizing sampling; results in (b) correspond to 100-step sampling. The upper group has a lower inception score than the other results.

Fig. 10: The images in the first row shows the input images selected from the SUN dataset. Images below in present the outputs of DCGAN, wFRAME and wGCN. The right image in the last row displays the outputs of our method. All the GCNs can synthesize real and authentic images with less sample efficiently, while after thousands of epoch, GAN can barely fit the real data distribution and generates ambiguous results. The generator of GGCN can also generate authentic images, but samples tends to be a reconstruction of the training samples.
(a) Square
(b) Spot
(c) SpotSquare
Fig. 11: Inpainting generation examples of GGCNAMS. (a) is square mask inpainting, (b) is the random spot mask and (c) is the combination of the former two. As shown, GGCNAMS can complement the missing space with a reasonable context.

Noticed that the amortized sampling method has a controversy that may influence the model performance. We consider this amortized sampling a cost-effective acceleration scheme, since performance declines in some situations if it is implemented. We suspect this is due to the approximation of Taylor’s expansion in 21. Therefore we particularly conduct a comparison experiment of GGCN and GGCNAMS on CIFAR-10 with conditional learning. The visual results are presented in Fig.9, where group (a) are the results of GGCNAMS, (b) GGCN. Though GGCNAMS can be trained efficiently, GGCN can synthesize better. Nevertheless, this conclusion will not sustain on CelebA, since we found that GGCNAMS outperforms GGCN on this dataset, as shown in Fig.9. We conjecture this is due to the property of the datasets, as distinguishing features for generation on CIFAR-10 is benefited while to some extent is the burden to that of CelebA. Thus amortized sampling is not recommended to implement on the generation of CIFAR-10.

5.5 Few Sample Training

It is generally known that recent popular generative models such as GANs are all trained on the massive dataset, thus the energy-based models at the beginning are blamed for the inefficiency on training massively. However, those largely learned models are mostly cannot be trained on the small training set. Quick and accurate learning is required in few sample training GCN(or other energy based model such as FRAME) can perfectly achieve with this target while DCGAN, as a representative of popular generative models, somehow has a really hard time converging on this learning task after epochs of training.

We conduct few sample learning on wFRAME and wGCN compared with DCGAN. Most primitive energy-based model is performed in this scenario thus we wGCN or wFRAME can achieve even better results (no collapse anymore). The performance of DCGAN in modeling only a few images is shown in Fig.10, where for a fair comparison we duplicate the input images several times to a total of 10,000 to match the training environment of DCGAN. Both of the compared wGCN and wFRAME models are trained using Wasserstein method. The DCGAN’s training procedure stops as it converges, but results showing quite likewise. Besides, all the comparison experiments in section 5.2 are the few sample training results.

5.6 Image Inpainting and Interpolation

To fully validate our proposed models, we further extend our models to another two tasks, i.e., image inpainting and interpolation on the CelebA dataset, which maintains a superior performance on both experiments.

In the inpainting experiment on GGCNAMS, the image with the mask is regarded as the hidden space latent variable mapped by the generator to the real-world data space. If an upsampling generator with a random hidden variable used as input can be trained to complete the masked images as other generative models. GGCNAMS adopts an autoencoder of ResNet50 as the generator, whose input is an image multiplied by a mask array. We apply three types of masks to evaluate our model: a square box, random spots and a combination of the two. As shown in Fig.11, the model infers reasonable and texture-level smooth contents to fill the blank space. The larger the blank space is, the more different the generated image from the original one. We also evaluate our inpainting results with frequently used criteria, the figures in Table III imply that GGCNAMS can synthesis realistic faces.

Spot 0.0033 25.44 0.954 30.410
Square 0.0041 30.69 0.934 30.095
SpotSquare 0.0046 25.03 0.940 30.039
TABLE III: We evaluate MSE (mean square error), FID (Frechet inception distance score [19]), SSIM (structural similarity index measure) and PSNR (peak-signal-to-noise ratio) to GGCNAMS on dataset CelebA for inpainting task. The experiment settings are the default settings of CelebA synthesis in section 5.3.1. All criteria show that the synthesis results of three masks from GGCNAMS are close to real images.

In the interpolation experiment, we randomly pick two hidden variable vectors denoted by with 256 dimensions, clip their values to the range of [-1, 1], and then linearly generate 10 intermediate vectors using the equation . Here in Fig.12, we display some typical interpolation results of our model. As shown, no matter how different faces the picked two hidden variables can be synthesized to, the middle generated faces of inter vectors can keep the form with the facial property continuously change. For example, in the first row, the faces at two ends are generated from the picked two random vectors, the left is a woman with left face direction, while the other one is a man slightly face to the right. The faces between them form a transition from female to male, direction from left to the right. The example shows a transition crossing gender and direction, we also provide transitions between different hair cuts, expressions, skin color, etc.

Fig. 12: Image interpolation results of the proposed GGCNAMS on the CelebA dataset. Every interpolation has 8 midpoints. The transfer process has been tested across gender, between gender, and on different or identical face attributes, such as hair color, face pose, etc.

6 Conclusions

In this paper, we track and rederive the origin of the GCN model from the perspective of particle evolution and discover the potential factors that may lead to a deteriorating quality of generated samples and instability of model training, i.e., the vanishing problem inherent in minimization of the KL divergence. Based on this discovery, we propose the wGCN model by reformulating the KL discrete flow in the GCN model into the JKO flow and prove empirically that this strategy can overcome the above mentioned deficiencies. The Wasserstein method can be applied to other energy-based GCN models. Additionally, to overcome challenges of time consumption and model collapse during synthesis, we propose another model called GGCN which is modified from an MGCN but can be regarded as a one-scale GCN. GGCN is designed and trained on a specific symmetrical learning structure. Experiments are performed to demonstrate the superiority of the proposed Wasserstein method and GGCN models, and comparable results show that the Wasserstein method can significantly mitigate the vanishing issue of energy-based models and produce more visually appealing results. Moreover, the final model GGCN can be further accelerated through an cost-effective amortized sampling mechanisms. Our investigation on the generalization of energy-based GCN models presents an appealing research topic from the particle evolution perspective and is expected to provide an impetus for the present research on EBMs.

Appendix A

Theroem 1. Given a base measure and a clique potential , the density of GCN in Eq. 1 can be obtained sufficiently and necessarily by solving the following constrained optimization problem:


Necessity can be proven using MEP by calculating iteratively:

Sufficiency: Recalling the Markov property in Eq. 1, we can write the inner product as a sum of feature responses w.r.t. different cliques; then, the shared pattern activation can be approximated by the Dirac measure as follows:

The result coincides with the empirical measure of , so the proof of sufficiency turns into that of Lemma 1 that is completed elsewhere. ∎

Proof of Lemma 1. Under the Gaussian reference measure a GCN model has the piecewise Gaussian property, as summarized in Proposition 1. This proposition is a reformulation of Theorem 1 and implies that different pieces can be regarded as different generating samples acting as Brownian particles.

Proposition 1.

Equation 3 is piecewise Gaussian, and on each piece the probability density can be written as follows:


where is an approximate reconstruction of in one part of the data space by a linear transformation involving inner products of model parameters and a piecewise linear activation function (ReLU).

By Proposition 1, each particle (image piece) in a GCN model has the transition kernel of the Gaussian form (equation 25). It describes the probability of a particle moving from to in time . Let a fixed measure such as the Gaussian distribution be the initial measure of Brownian particles at time .

The Sanov theorem shows that the empirical measure of such transition particles satisfies Langevin dynamics process (LDP) with rate functional , i.e.,


Specifically, each Brownian particle has internal subparticles that are independent and belong to different cliques. Let denote the number of cliques; the Cramer’s theorem states that for i.i.d. random variables with a common generating function , the empirical mean satisfies LDP with rate functional given by the Legendre transformation of ,


Since the empirical measure of is simply the empirical mean of the Dirac measure, i.e., , the empirical measure over all particles is shown to be as follows:

where the exponent is exactly the KL discrete flow . Thus, the empirical measure of the activation patterns of all particles satisfies LDP with rate functional in discrete time.

Appendix B Network Structure Details

GGCN v0.  This network structure is the GGCN v0 for synthesizing results in Fig.6 (c) and Fig.9 (a). As shown that there is a ResNet50 verison v1 by which we achieve best results of CIFAR-10 and Tiny ImageNet results, but this will be provided in code.

Operation Kernel Step Size Heat map Norm Activation
Deconv Leaky ReLU
Deconv Leaky ReLU
Deconv Leaky ReLU
Deconv Leaky ReLU
Deconv Leaky ReLU
Conv Sigmoid
Conv Leaky ReLU
Conv Leaky ReLU
Conv Leaky ReLU
Conv Leaky ReLU
Conv Leaky ReLU
Deconv Leaky ReLU
Deconv Leaky ReLU
Deconv Leaky ReLU
Deconv Leaky ReLU
Deconv Leaky ReLU
Conv Sigmoid
Optimizer Adam (, )
Batch Size 100
Number of iterations 500
Leaky ReLU tilt 0.02
Dropout 0.0
Weight initialization Isotropic Gaussian (, ), Constant()
TABLE IV: Detailed network structure information of GGCN v0. Network structure v0 is used to obtain all of experimental examples shown in (Fig. 6). Structure v1 is a pile of three ResNets that is built to attain the best performance. Note that the network structure is shared between GGCN and GGCN+LHS. For more details, please refer the code of which the link is listed in the footnote.

Yang Wu received the B.S. degree of applied mathematics from South China University of Technology, Guangzhou, China. She is currently pursuing her Ph.D. degree at the School of Data and Computer Science of Sun Yat-Sen University, advised by Prof. Liang Lin. Her current research interests include computer vision and machine learning, particularly in generative modeling and embodied tasks.

Xu Cai received his B.S. degree of computer science and technology from Xidian University, Xidian, China, and Master degree from Sun Yat-Sen University. He is currently pursuing his Ph.D. degree at the School of Computing at National University of Singapore. His current research interest is mainly theoretical machine learning.

Pengxu Wei received the B.S. degree in computer science and technology from the China University of Mining and Technology, Beijing, China, in 2011 and the Ph.D. degree from University of Chinese Academy of Sciences in 2018. Her current research interests include visual object recognition, detection and latent variable learning. She is currently a research scientist at Sun Yat-sen University.

Guanbin Li currently an associate professor at the School of Data and Computer Science, Sun Yat-sen University. He received his Ph.D. degree from the University of Hong Kong in 2016. He was a recipient of a Hong Kong Postgraduate Fellowship. His current research interests include computer vision, image processing, and deep learning. He has authored and co-authored more than 20 papers in top-tier academic journals and conferences. He serves as an area chair for the conference of VISAPP. He has served as a reviewer for numerous academic journals and conferences, such as TPAMI, TIP, TMM, TC, TNNLS, CVPR2018 and IJCAI2018.

Liang Lin is a Full Professor at Sun Yat-sen University, and the CEO of DMAI. He served as the Executive R&D Director and Distinguished Scientist of SenseTime Group from 2016 to 2018, taking charge of cutting-edge technology transferring into products. He has authored or co-authored more than 200 papers in leading academic journals and conferences (e.g., TPAMI/IJCV, CVPR/ICCV/NIPS/ICML/AAAI). He is an associate editor of IEEE Trans, Human-Machine Systems and IET Computer Vision. He served as Area Chairs for numerous conferences such as CVPR and ICCV. He is the recipient of numerous awards and honors including Wu Wen-Jun Artificial Intelligence Award for Natural Science, ICCV Best Paper Nomination in 2019, Annual Best Paper Award by Pattern Recognition (Elsevier) in 2018, Best Paper Dimond Award in IEEE ICME 2017, Google Faculty Award in 2012, and Hong Kong Scholars Award in 2014. He is a Fellow of IET.


  1. S. C. Zhu, Y. Wu, and D. Mumford, “Filters, random fields and maximum entropy (frame): Towards a unified theory for texture modeling,” International Journal of Computer Vision, vol. 27, no. 2, pp. 107–126, 1998.
  2. J. Baek, G. J. McLachlan, and L. K. Flack, “Mixtures of factor analyzers with common factor loadings: Applications to the clustering and visualization of high-dimensional data,” IEEE transactions on pattern analysis and machine intelligence, vol. 32, no. 7, pp. 1298–1309, 2009.
  3. Y. Wei, Y. Tang, and P. D. McNicholas, “Flexible high-dimensional unsupervised learning with missing data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
  4. J. Liang, J. Yang, H.-Y. Lee, K. Wang, and M.-H. Yang, “Sub-gan: An unsupervised generative model via subspaces,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 698–714.
  5. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems, 2014, pp. 2672–2680.
  6. D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” International Conference on Learning Representations (ICLR), Banff, AB, Canada, Conference Track Proceedings, 2014.
  7. L. Dinh, D. Krueger, and Y. Bengio, “Nice: Non-linear independent components estimation,” International Conference on Learning Representations(ICLR), 2015.
  8. Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang, “A tutorial on energy-based learning,” Predicting structured data, vol. 1, no. 0, 2006.
  9. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
  10. J. Xie, Y. Lu, S.-C. Zhu, and Y. Wu, “A theory of generative convnet,” in International Conference on Machine Learning, 2016, pp. 2635–2644.
  11. S. C. Zhu, Y. N. Wu, and D. Mumford, “Minimax entropy principle and its application to texture modeling,” Neural Computation, vol. 9, no. 8, pp. 1627–1660, 1997.
  12. R. Gao, Y. Lu, J. Zhou, S.-C. Zhu, and Y. Nian Wu, “Learning generative convnets via multi-grid modeling and sampling,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 9155–9164.
  13. R. Jordan, D. Kinderlehrer, and F. Otto, “The variational formulation of the fokker–planck equation,” SIAM journal on mathematical analysis, vol. 29, no. 1, pp. 1–17, 1998.
  14. D. Wang and Q. Liu, “Learning to draw samples: With application to amortized mle for generative adversarial learning,” arXiv preprint arXiv:1611.01722, 2016.
  15. L. D. Landau and E. M. Lifshitz, Course of theoretical physics.   Elsevier, 2013.
  16. J. Dai, Y. Lu, and Y.-N. Wu, “Generative modeling of convolutional neural networks,” International Conference on Learning Representations(ICLR), 2015.
  17. Y. Lu, S.-C. Zhu, and Y. N. Wu, “Learning frame models using cnn filters,” Proceedings of the AAAI Conference on Artificial Intelligence, 2016.
  18. X. Cai, Y. Wu, G. Li, Z. Chen, and L. Lin, “Frame revisited: An interpretation view based on particle evolution,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2019, pp. 3256–3263.
  19. M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” in Advances in Neural Information Processing Systems, 2017, pp. 6626–6637.
  20. A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” International Conference on Learning Representations(ICLR), 2016.
  21. D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling, “Improving variational inference with inverse autoregressive flow,” Advances in Neural Information Processing Systems, 2016.
  22. L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density estimation using real nvp,” International Conference on Learning Representations(ICLR), 2017.
  23. J. Xie, Y. Lu, S.-C. Zhu, and Y. N. Wu, “Cooperative training of descriptor and generator networks,” arXiv preprint arXiv:1609.09408, 2016.
  24. T. Kim and Y. Bengio, “Deep directed generative models with energy-based probability estimation,” arXiv preprint arXiv:1606.03439, 2016.
  25. L. Younes, “Parametric inference for imperfectly observed gibbsian fields,” Probability Theory and Related Fields, vol. 82, no. 4, pp. 625–645, 1989.
  26. M. A. Peletier, “Variational modelling: Energies, gradient flows, and large deviations,” arXiv preprint arXiv:1402.1990, 2014.
  27. S. Whitaker, “Flow in porous media i: A theoretical derivation of darcy’s law,” Transport in porous media, vol. 1, no. 1, pp. 3–25, 1986.
  28. G. Montavon, K.-R. Müller, and M. Cuturi, “Wasserstein training of restricted boltzmann machines,” in Advances in Neural Information Processing Systems, 2016, pp. 3718–3726.
  29. M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in International Conference on Machine Learning, 2017, pp. 214–223.
  30. S. Adams, N. Dirr, M. A. Peletier, and J. Zimmer, “From a large-deviations principle to the wasserstein gradient flow: a new micro-macro passage,” Communications in Mathematical Physics, vol. 307, no. 3, pp. 791–815, 2011.
  31. M. H. Duong, V. Laschos, and M. Renger, “Wasserstein gradient flows from large deviations of many-particle limits,” ESAIM: Control, Optimisation and Calculus of Variations, vol. 19, no. 4, pp. 1166–1188, 2013.
  32. M. Erbar, Matthias anErbar, J. Maas, M. Renger et al., “From large deviations to wasserstein gradient flows in multiple dimensions,” Electronic Communications in Probability, vol. 20, 2015.
  33. J.-D. Benamou and Y. Brenier, “A computational fluid mechanics solution to the monge-kantorovich mass transfer problem,” Numerische Mathematik, vol. 84, no. 3, pp. 375–393, 2000.
  34. L. Ambrosio, N. Gigli, and G. Savaré, Gradient flows: in metric spaces and in the space of probability measures.   Springer Science & Business Media, 2008.
  35. T. Han, Y. Lu, S.-C. Zhu, and Y. N. Wu, “Alternating back-propagation for generator network,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2017.
  36. P. Bojanowski, A. Joulin, D. Lopez-Paz, and A. Szlam, “Optimizing the latent space of generative networks,” Proceedings of International Conference on Machine Learning(ICML), 2017.
  37. Q. Liu and D. Wang, “Learning deep energy models: Contrastive divergence vs. amortized mle,” arXiv preprint arXiv:1707.00797, 2017.
  38. Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the wild,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3730–3738.
  39. J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun database: Large-scale scene recognition from abbey to zoo,” in Computer vision and pattern recognition (CVPR), 2010, pp. 3485–3492.
  40. F. Yu, Y. Zhang, S. Song, A. Seff, and J. Xiao, “Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop,” arXiv preprint arXiv:1506.03365, 2015.
  41. A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Citeseer, Tech. Rep., 2009.
  42. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.
  43. T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” in Advances in Neural Information Processing Systems, 2016, pp. 2234–2242.
  44. C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826.
  45. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition (CVPR), 2009, pp. 248–255.
  46. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” International Conference on Learning Representations(ICLR), 2015.
  47. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  48. V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
  49. D. Warde-Farley and Y. Bengio, “Improving generative adversarial networks with denoising feature matching,” International Conference on Learning Representations(ICLR), 2017.
  50. K. Lee, W. Xu, F. Fan, and Z. Tu, “Wasserstein introspective neural networks,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.