Generalizing Energybased Generative ConvNets from Particle Evolution Perspective
Abstract
Compared with Generative Adversarial Networks (GAN), EnergyBased generative Models (EBMs) possess two appealing properties: i) they can be directly optimized without requiring an auxiliary network during the learning and synthesizing; ii) they can better approximate underlying distribution of the observed data by learning explicitly potential functions. This paper studies a branch of EBMs, i.e., energybased Generative ConvNets (GCNs), which minimize their energy function defined by a bottomup ConvNet. From the perspective of particle physics, we solve the problem of unstable energy dissipation that might damage the quality of the synthesized samples during the maximum likelihood learning. Specifically, we firstly establish a connection between classical FRAME model [1] and dynamic physics process and generalize the GCN in discrete flow with a certain metric measure from particle perspective. To address KLvanishing issue, we then reformulate GCN from the KL discrete flow with KL divergence measure to a JordanKinderleherOtto (JKO) discrete flow with Wasserastein distance metric and derive a Wasserastein GCN (wGCN). Based on these theoretical studies on GCN, we finally derive a Generalized GCN (GGCN) to further improve the model generalization and learning capability. GGCN introduces a hidden space mapping strategy by employing a normal distribution for the reference distribution to address the learning bias issue. Due to MCMC sampling in GCNs, it still suffers from a serious timeconsuming issue when sampling steps increase; thus a trainable nonlinear upsampling function and an amortized learning are proposed to improve the learning efficiency. Our proposed GGCN is trained in a symmetrical learning manner. Besides, quantitative and qualitative experiments are conducted on several widelyused face and natural image datasets. Our experimental results surpass those of existing models in both model stability and the quality of generated samples. The source code of our work is publicly available at https://github.com/uiyo/GeneralizedGCN.
1 Introduction
Unsupervised learning of complex data distributions and sample generation are being extensively researched in the machine learning and computer vision community [2, 3, 4]. Generative modeling learns the underlying data distribution given a quantity of data observations. A welllearned generative model can synthesize data samples similar to realworld ones; doing so is thus regarded as the ability to imitate human creativity. There mainly exist two types of representative generative models – either likelihoodfree models such as Generative Adversarial Networks (GANs) [5] or likelihoodbased models such as Variational AutoEncoders (VAEs) [6] and flowbased model [7].
EnergyBased Model (EBMs) [8] as a remarkable likelihoodbased model present an appealing property that captures the dependencies among variables by evolving a scalar energy to each variable and requires no probabilistic normalization. Compared with GAN, which generates samples by applying a generator and a discriminator learned adversarially, EBM can be directly optimized without any auxiliary; by learning with explicit potential function, it approximates the observed data distribution more closely and hence synthesizes more authentic samples. EBM has the merits of, compared with VAEs, enabling optimization of the exact loglikelihood instead of its lower bound, and, compared with flowbased models, the ability of using the learned model as a prior for image modeling, e.g., by transforming it into a discriminative Convolutional Neural Network (CNN) [9] for classification.
This paper studies a wellexploited EBM model, energybased Generative ConvNet (GCN) [10]. GCN stems from Filters, Random fields, And Maximum Entropy (FRAME) [11] model, which is proposed long before the popularity of deep learning and aims at modeling texture contexts. GCN can be regarded as a deep variant of FRAME which incorporates deep convolutional filters to deal with more complex and highdimensional sample synthesis. In general, an EBM utilizes Markov chain Monte Carlo (MCMC) to sample from logarithmic partition functions; hence to further solve the difficulty of MCMC on traversing different modes, a multiscale GCN (MGCN) [12] is proposed and synthesizes the samples from small and coarse grids to a fine grid.
Nevertheless, these energybased GCN models follow the maximum entropy principle (MEP) and are cast into a learning problem of KL divergence between the realworld data distribution and the model distribution . With MCMC dynamic discrete sampling, this learning process of minimizing the KL divergence indicates that the learning gradient of each step approximately follows the gradient of a KL discrete flow. In our work, we revisit GCN with this KL discrete flow from a particle evolution perspective and accordingly reformulate GCN. As for the particle evolution perspective, we consider all the samples as a group of disordered dynamic particles and the learning process can be compared to the evolution of particles with a discrete flow (driven by KL divergence, thus known as KL discrete flow) at each learning time step. We also provide theoretical proofs on this equivalence from two viewpoints.
However, energybased GCN models solve an intractable energy function by a stepbystep iterative MCMC sampling and updating. This solving mechanism exhibits a quite unstable learning and thus tends to generate blurry and twisted images or even ones with crashing contexts. As shown in Fig.1, GCN model generates collapsed images marked in yellow, indicating a severe model collapse problem. Considering the synthesis process as an energy dissipation, the energy states of GCN are disorderly and converge to a wrong nonzero energy location in the contour map. From the particle perspective, it is more intuitive that the unstable learning process is mainly caused by a KLvanishing problem. Thus we reformulate GCN from the KL discrete flow to a JordanKinderlehrerOtto (JKO) discrete flow [13] with a Wasserstein distance measure and derive a Wasserastein GCN (wGCN). As shown in Fig.1, the particles of our proposed wGCN evolve more orderly and stably converge to energy location.
Based on the Wasserstein JKO discrete flow, we further generalize seamlessly GCN to derive a Generalized Generative ConvNet Model (GGCN). This generalization strategy is nontrivial, rather than simply replacing deep FRAME with GCN. In GCN, on one hand, strong dependency on observed data limits the model learning ability and flexibility; on the other hand, MCMC sampling limits the learning efficiency and this would become rigorous for MGCN whose learning time grows massively with grids of sampling process in different scales. To address these issues, our proposed GGCN extends its reference distribution by learning a hidden space mapping and proposes a trainable nonlinear upsampling function to mimic the gridbygrid upscale and sampling process. It avoids the decreasing of synthesis quality caused by accumulating errors among multigrid and further improve the model generalization ability.
Considering that iterative learning of several components in GGCN is an urgent problem, we present a symmetrical learning strategy which contains two similar sublearning processes. One is learning the GGCN that synthesis by sampling; the other is learning nonlinear upsampling function together with the hidden space by generating. Considering the limitation of the efficiency problem in MCMC based learning of EBMs, it still suffers from a serious timeconsuming problem when the sampling step increases. This is the main reason that so far EBMs have been rarely trained on largescale datasets. Accordingly, inspired by the efficient variational reasoning [14], we employ a similar amortized learning method to further shorten the sampling process of GGCN.
In summary, our contributions are as follows:

We reformulate classical energybased GCN models from the traditional information theory viewpoint to a particle evolution perspective. From this particle perspective, to address the KLvanishing issue, we further generalize GCN from a KL discrete flow to a JKO discrete flow driven by Wasserastein metric in a principle manner and derive a wGCN model that manages to solve the instability of energy dissipation process in GCN, owning to the better geometric characteristic of Wasserstein distance.

We propose a Generalized Generative ConvNet model that inherits the mechanism of Wasserstein JKO discrete flow formulation. To minimize the learning bias, we introduce a hidden space mapping strategy and employ a normal distribution for the reference distribution; a trainable nonlinear upsampling function and an amortized sampling are employed to further improve the learning efficiency. Furthermore, by learning the hidden space, we make the whole model learned in a symmetrical mechanism.

Extensive experiments are conducted on several largescale datasets for image generation tasks, demonstrating the improvements on the generation of proposed models quantitatively and qualitatively.
2 Related Work
Generative EBM originating from statistical physics explicitly defines the probability distributions of the signal; in that context, they are ordinarily called the Gibbs distributions [15]. With the extensive development of CNNs that have been proven to be powerful discriminators, the generative perspective on this model has recently been extensively explored by an increasing number of studies. [16] is the first study to introduce a generative gradient for pretraining a discriminative ConvNet by a nonparametric importance sampling scheme, and [17] proposes learning FRAME using pretrained filters of a modern CNN. [10] further studies the theory of GCN indepth and shows that the model has a unique internal autoencoding representation where the filters in the bottomup operation take up a new role as the basis functions in the topdown representation, which is consistent with the FRAME model. In this paper, a FRAME model and its variants such as GCN and multiscale GCN described above share the same MLEbased learning mechanism that follows an analysisbysynthesis approach and first generates synthesized samples from the current model using Langevin dynamics (LD) and subsequently learns the parameters based on the distance between observed and synthesized samples.
A preliminary version of our work was published in [18], which also explains the FRAME from the particle evolution perspective. However, we further generalize seamlessly the GCN with the JKO discrete flow and derive our Generalized Generative ConvNet Model (GGCN) in this paper. GGCN extends its reference distribution to a hidden space to alleviate the dependency of limited observed data and proposes a trainable nonlinear upsampling function together with an amortized sampling method to improve learning efficiency. Briefly, the proposed GGCN improves the learning flexibility and model generalization.
Another popular branch of deep generative models is blackbox or implicit models that map between some simple lowdimensional prior and complex highdimensional signals via a topdown deep neural network, such as the GANs, VAEs, flowbased models and their variants. These models have been remarkably successful in generating realistic images. GANs and VAEs are trained together with an assisting model, where VAEs try to optimize the evidenced lower bound that causes them to generate images that are fairly blurry in practical terms, while training GANs entails instability due to the difficulty of finding the exact Nash equilibrium [19], the lack of latentspace encoders and the difficulty of assessing overfitting or memorization [20]. On the other hand, flowbased models either cannot be trained efficiently by maximum likelihood [21], or have poor performance [7, 22].
We provide guidelines in Fig.2 to briefly illustrate the relations in the leaning framework among those generative models. Unlike the majority of implicit generative models that use an auxiliary network to guide the training of the generator, energybased generative models maintain a single model that simultaneously serves as a discriminator and a generator. The latter models generate samples directly from the input data rather than from the latent space, which to a certain extent ensures that the model can be efficiently trained and can produce stable synthesized results with a relatively lower complexity of model structure. Nonetheless, such a model can also be used as an auxiliary model and combined with an implicit model to facilitate the model training [23, 24, 14]. Technically, the learning mechanism of our new symmetrically learned GCN is more similar to cooperative learning [23] as a result of all parameter estimation steps having one goal in common: to make approximate , but considering it is derived by generalizing a single GCN, we entail this new model a generalized GCN model (GGCN). GGCN also implements a generator coexisting with GCN, but unlike GAN of which the discriminator is to confront the generator, the generator introduced is to generalize the base GCN in GGCN.
3 Revisiting GCN Model from Particle Evolution Perspective
Considering the dynamic learning of conventional GCN, we regard all the samples as highly disorder dynamic particles in the data space which therefore contains literally physical energy. Real data samples are located where the energy is zero, which can be namely regarded as those particles in a stable state. The learning procedure can be equivalently compared to a process that randomly initialized particles with high energy are gradually led to the lowenergy states. The force which leads the particles can be stepbystep modified by estimating the current states of the particles. In consideration of the time dimension, it can be easily imagined that the dynamic particles in discrete time points drift as flows or streams. With this thought, we attempt to reformulate GCN from this particle evolution perspective.
3.1 Generative ConvNet Model Theory
Based on an energy function, GCN is defined on the exponential tilt of a gaussian reference distribution , which is a reformulation of MRF [17] and can be written as following Gibbs distribution:
(1) 
where is a nonlinear activation function for an input , is the th element of dimensional model parameter , is the parameters of filters and is the bias; denotes a filtered image or a feature map; , which denotes the Gaussian white noise model with mean and variance ; is the potential function.
The learning process of a GCN model (similar to that of MRF models) follows the iterative MLE parameter estimation that firstly updates model parameters by trying to increase the log likelihood and subsequently samples from the current distribution using parallel MCMC chains. Assuming that is the parameter of sampled probability distribution at time , the sampling process, according to [25], does not necessarily converge at each . We here provide one persistent sampler that converges globally to reduce the number of calculations, through which we can get numerical solution of the :
(2) 
where in and are the feature responses over the realworld data distribution and the model distribution , respectively. According to [17], above is estimated with LD which injects noise into the parameter updates. A GCN can be considered as a FRAME with trainable deep filters, namely the parameter is estimated by maximzing likelihood function :
(3) 
where the is the estimated data distribution at each sampling step.
Inherently, in energybased models, this learning and sampling process characterizes an evolution in a state space associated with learning gradients; Inspired by the gradient flow derived from energy models [26], it essentially can be further regarded as an evolution of wellestablished derivations of stochastic particle model equations, where could be considered as Brownian particles. This gradient flows has proven useful in solving theoretically and numerically diffusion equations that model porous media or crowd evolution[27]. For GCN, the learning gradient of each step approximately follows a discrete gradient flow, which is named discrete flow for short, and allows us to attach a precise interpretation to the conventional energybased model from traditional information and statistical theory to particle perspective.
3.2 Revisit GCN from Particle Evolution Perspective
From a particle evolution perspective, we establish a connection between GCN and dynamic physics process and provide a generalized formulation of GCN in discrete flow with a certain metric measure. Especially, GCN follows a discrete flow with KL divergence measure as its objective of minimizing the KL divergence between and ; its entire optimization process can be regarded as a particle evolution.
Herein we reformulate GCN in the context of a KL discrete flow. Namely, dynamic sampling and updating equations of GCN will be derived in this section, only based on the discrete flow theory. The discrete flow is related to discrete probability distributions (evolution discretized in time) of finitedimensional problems. More precisely, it represents the system of independent Brownian particles , the positions of which in given by a Wiener process (or Brownian motion, a stochastic process of moving particles) satisfy the following stochastic differential equation (SDE):
(4) 
where is the drift term, the index of is left out for simplicity, represents the diffusion term, denotes the Wiener process, and subscript denotes values at time . An empirical measure , which is of those particles and proven to approximate Eq. 1 by an implicit descent step . is the socalled KL discrete flow at time that consists of the KL divergence and energy function ,
(5) 
The connotation of a KL discrete flow that describes a system of Brownian particles with probability distributions inspires a brand new perspective on interpreting a GCN model in the discrete state. If we regard the observed signals with the generating function that has the Markov property as Brownian particles, we will prove a Theorem 1 which implies that LD in Eq.2 can be deduced from a KL discrete flow for learning GCN. This theorem can be sufficiently and necessarily proved through Lemma 1. The detailed proof of Theorem 1 and Lemma 1 will be provided in Appendix A.
Lemma 1.
For i.i.d. particles with the common generating function that has the Markov property, the empirical measure satisfies the large deviation principle with a rate functional in the form of .
Theorem 1.
Given a base measure and a clique potential , the density of GCN in Eq. 1 can be obtained sufficiently and necessarily by solving the following constrained optimization problem:
(6) 
Let be the Lagrange multiplier integrated in [11] and ensure that ; then, the optimization objective 6 can be reformulated as follows (as distinction from Eq. 5, represents the LD):
(7) 
Since , the SDE iteration of in Eq. 4 can be expressed in the Langevin form as follows:
(8) 
where is the gaussian term. By Lemma 1, if we fix , the sampling scheme in Eq. 8 approaches the KL discrete flow , and the flow will fluctuate if varies. Parameter is updated by calculating , which implies that can dynamically transform the transition map (if we consider the all the learned distributions a transformed version of reference distribution ) into the desired form. The sampling process of a GCN model can be summarized as follows:
(9) 
where is the derivative of initial Gaussian noise . Examining the objective function Eq. 7 shows that there is an alternately mechanism involved in updating and .
Obviously, the derived Eq. 9 for updating is completely the same as that of MLE in GCN; thus, optimize to minimize the KL discrete flow from this particle perspective, as Theorem 1 has proved, is equivalent to that in its traditional information theoretical point of view. This theoretically justifies our GCN reformulation. i.e., the minimum amount of transformations of the reference measure (or ). However, such a minimization of the KL divergence might be the cause of the energy unpredictably collapsing to ; namely, the learned model will reduce to producing the initial noise instead of the desired minimum modification. Hence, if , the learned model will tend to degenerate, the images synthesized from GCN trained by the KL divergence will collapse immediately, and the image quality may barely recover. This is identified as KLvanishing issue.
4 Generalizing Energybased Generative ConvNet Model
From the particle evolution perspective as discussed above, a GCN learning process can be interpreted with a KL discrete flow; in turn, the KL vanishing problem becomes manifest and likely causes severe model collapse. To address this problem, our reformulation of GCN claimed in Sec. 3 provides a principle manner to extend GCN with the discrete flow; namely, KL divergence term in GCN could be substituted by an appropriate metric measure under SDE. These still hold for GCN or MGCN which can be as a GCN with trainable deep filters. Accordingly, we generalize this KL discrete flow with KL divergence measure into a JKO discrete flow with Wasserastein distance measure in prototype energybased GCN model, and derive Wasserastein GCN in Sec. 4.1. This JKO discrete flow is connected with the theory of FokkerPlanck Equation and the Wasseratein distance [13].
To further improve its model generalization and its learning capability, we employ a normal distribution as hidden space rather than the original downscaled groundtruth represented reference distribution in multiscale GCN; considering that the gridbygrid upscaling functions make the bias accumulated, a trainable nonlinear upsampling strategy is proposed. Together with all the above, a generalized Generative ConvNet model (GGCN) is derived in Sec. 4.2.
4.1 Wasserastein GCN Model
Although explaining GCN from a KL approach is rational from the perspective of the information theory methodology, there is a risk of the KLvanishing problem as discussed above. The main reason is the nonconvergence of the parameter updating mechanism of MLE. Then here in this section, we manage to solve this issue by generalizing the KL discrete flow with KL divergence measure into a JKO discrete flow with Wasserastein distance measure, which can be theoretically proved in particle perspective, and incorporate the Wasserastein JKO discrete flow into GCN to derive a Wasserastein GCN model (wGCN).
JKO Discrete flow with Wasserstein metric
To avoid the KLvanishing problem, we would like to introduce the Wasserstein metric driven JKO discrete flow. This flow is annotated as . In this particle perspective, we explain this substitution according to [28], in a KL method can be closer given an empirical measure than the same measure applied with the Wasserstein distance. The closer measure makes the transition of learning more inclined to wrong or collapse location in the energy manifold. Additionally, [29] also claims that a better convergence and approximate results can be obtained since the Wasserstein metric defines a weaker topology. The conclusion that if time step rationalizes the proposed method (meaning a relatively long sampling process). The proof of this conclusion in the onedimensional case is presented in [30, 31, 32].
This JKO discrete flow connects the energy model with the theory of FokkerPlanck Equation, which describes the evolution of the probability density function of particle velocity; the required metric in the case of the FokkerPlanck equation is the Wasserstein metric on probability densities [13]. The solution of FokkerPluck equation is demonstrated as, at each instant in time, the direction of the steepest descent of associated free energy functional, which usually is regarded as discrete gradient flow. Thus here we introduce the FokkerPlanck equation to build the connection. This formulation reveals an appealing, and previously unexplored, the relationship between the FokkerPlanck equation and the associated free energy functional.
We begin with a redefinition of notations. Let denote the space of Borel probability measures on any given subset of space , where , . Given some sufficient statistics , scalar and base measure , the space of distributions satisfying a linear constraint is defined as . The Wasserstein space of order is defined as , where denotes the norm on . is the number of elements in domain . denotes the gradient, and denotes the divergence operator.
FokkerPlanck Equation. Let be an integral function and denote its EulerLagrange first variation; then, the FokkerPlanck equations are as follows:
(10) 
Compared with Eq. 5, this function models particle density of probability into a continued state [13].
Wasserstein Metric. The solution steepest descent, or discrete gradient flow, makes sense only in context with an appropriate metric. The required metric for the FokkerPlanck equation is the Wasserstein metric on probability densities [13]. The BenamouBrenier form of Wasserstein metric [33] of order involves solving the following smooth OT problem over any probabilities and in . The solution can be derived by using the continuity equation shown in Eq. 10,
(11) 
where belongs to the tangent space of the manifold (here we suppose particles are moving through a manifold) governed by some potential and associated with curve .
JKO Discrete Flow. Following the initial study [13] that shows how FokkerPlanck diffusions of distributions in Eq. 10 are recovered when minimizing entropy functionals according to Wasserstein metric , the JKO discrete flow is used by our method to replace the initial KL divergence with the entropic Wasserstein distance . The function of the flow is as follows:
(12) 
Wasserstein GCN
Wasserstein GCN (wGCN) is associated with a JKO discrete flow instead of original KL discrete flow GCN. Next in this section we will further discuss this replacement and provide a new JKO discrete flow LD optimization objective.
Remark 1.
The initial Gaussian term is left out for convenience to facilitate the derivation; otherwise, entropy in Eq. 12 would be written as relative entropy .
By Theorem 1, instead of can be calculated approximately, and a steady state will approach Eq. 1. Using as a dissipation mechanism instead of allows regarding the diffusion Eq. 4 as the steepest descent of clique energy and entropy w.r.t. the Wasserstein metric. Solving such an optimization problem using is identical to solving the MongeKantorovich mass transference problem [33].
Using the second mean value theorem for definite integrals, we can approximate the integral in (where ) by two randomly interpolated rectangles:
(13) 
where parameterizes the time piece, and represents a random interpolated parameter since is a random (). Eq. 13 means that the functional derivative of w.r.t. is proportional to the following:
(14) 
which is exactly the result of Proposition 8.5.6 in [34]. Assume that be at least twice differentiable and treat Eq. 14 as the variational condition in Eq. 10; then, plugging Eq. 14 into the continuity equation of Eq. 10 turns the latter into a modified JKO discrete flow in the FokkerPlanck form as follows:
(15) 
Then, the corresponding SDE can be written in the EulerMaruyama form as follows:
(16) 
By remark 1, if we reconsider the initial Gaussian term , the discrete flow of in Eq. 16 should be incremented by .
Remark 2.
If is the energy function defined in Eq. 1, then .
The above is a direct result since defined in GCN only involves inner product, piecewise linear ReLU and other linear operations, and the second derivative is obviously . Since , the time evolution of density in Eq. 15 and sample in Eq. 16 will reduce to Eq. 10 and Eq. 8, respectively. Thus, the SDE of holds by default, i.e., retains the Langevin form while the gradient of the model parameter does not vanish.
Similar to the parameterized KL flow defined in Eq. 7, we propose a similar form for the JKO case. Using Eq. 13 and Eq. 14, the final optimization objective function can be formulated as follows:
(17) 
Given the entire discussion above, the learning progress of wGCN can be constructed by gradient ascent of , i.e., . The calculation steps of the formulation are summarized in Eq. 18:
(18) 
Equation 18 indicates that the gradient of determined in the Wasserstein manner is being added with some soft gradient norm constraints between the last two iterations. Such a gradient norm has the following advantages compared with the original iterative process (Eq. 9). First, the norm serves as the constant speed geodesic connecting with in the manifold spanned by and , which may provide a speedup of convergence. Second, it can be interpreted as a soft force counteracting the original gradient and preventing the entire learning process from stopping. Finally, in experiments we observe that it can preserve internal structural information of data. The learning by synthesizing algorithm details specified below are provided in Algorithm 1.
4.2 Generalized Generative ConvNet Model
We reformulation the conventional energybased GCN model from a particle evolution perspective, explain theoretically GCN in KL discrete flow in Sec. 3, and generalize the formulation of GCN from KL discrete flow to JKO discrete flow driven by Wasserastein metric in Sec. 4.1. In this section, we further generalize the GCN/wGCN with JKO discrete flow and derive our Generalized Generative ConvNet Model (GGCN). This generalization strategy is nontrivial, rather than simply extending GCN/wGCN with JKO discrete flow.
GCN is regarded as a onescale model and its multiscale version MGCN [12] is proposed to speed up the model learning. It is demonstrated that the MGCN is especially inefficient in training since there is a pile of several GCNs in different grids all of which have to be estimated through consuming MCMC sampling. Besides, after applying JKO discrete flow to each grid of GCN, the generative capability and generalized property is still greatly limited by the linear upscale functions and downscaled referenced distribution. This is the main reason that GCNs are rarely trained on largescale datasets. Thus, to improve the learning efficiency we use a nonlinear upsampling function to mimic the updating and sampling process of GCNs in the MGCN, which to some extent improves the model generalization property. Next we expand the initial reference distribution in the first grid of MGCN to a random gaussian reference distribution (which is invalid without nonlinear trainable upsampling); finally, we introduce a learning hidden space mechanism, which further improves the learning stability. The whole GGCN can be learned in a symmetrical parameter estimation cycle.
Generalized Generative ConvNet Model
MGCN is a pile of several of GCNs with different scales, as shown in subfigure (a) of Fig.3. Its learning and synthesis process is through learning each GCN gridbygrid with LD sampling. Therefore, when in high dimensional data synthesis, the low learning efficiency grows significantly which also results in low synthesis quality. We thus consider an equivalent learnable and upscaling structure to mimic the learning process of middle grids of GCNs in MGCN. To further simplify the model and its learning procedure, we then only reserve the last grid of GCN in the MGCN and extend the model to learn the distribution from a hidden space.
Nonlinear Learnable Upsampling. Although a multigrid mechanism indeed improves the synthesis quality slightly[12], it mostly depends on hundreds of MCMC sampling steps, which brings many inefficiencies especially in synthesizing high dimensional data. Noted that upscaling in MGCN is simply a linear operation, we presume that substitution of the several grids of GCN by layers of learnable nonlinear upsampling functions will speed up the learning procedure since it mimics the sampling process by simply minimizing the error between the sampled synthesized data and the realworld data. Learning the hidden space mechanism is further proposed (the same process as learning in GCN), which makes the whole model can be learned symmetrically.
Hidden Space Mapping. Two onescale GCNs at each end as shown in Fig.3 are not necessary. Since the property that learning a GCN is to model the input data distribution, if we extend the input to a random sample from Gaussian distributed in the same dimension as the downscaled real data, then the function of the first GCN becomes modeling a given random distribution. Thus this first GCN can be omitted and the whole pipeline turns into the structure as shown in the right subfigure (b) of Fig.3. Suppose the original sample synthesis a generating process, a mapping from a projection space of real data distribution to real distribution is learned through MCMC, and the latter process can be considered as a learning generator process. This process is likewise learning generator in GAN, thus spontaneously, we expand the input space to the hidden space.
Given a probability distribution with density function , we assume to be the nonlinear upsampling network, where represents a sampled initial gaussian noise, and denotes parameters. The given information only includes random noise (from an unknown realworld distribution ), trainable function and its derivative . Our learning objective is to find an optimal satisfying such that the output matches the objective probability density function . Parameters are learned by the following:
(19) 
where , is the synthesized sample, is the batch size and denotes the gradient of .
Symmetrical Learning Strategy
As shown in subfigure (a) of Fig.3, we simplify the learning procedure to a tradeoff between learning GCN and a learning hidden space strategy with nonlinear sampling functions. The training strategy is illustrated in the subfigure (b) of Fig.3, and it is a symmetrical structure learned with the same iterative method. To be more specific, we define the hidden variable with nonlinear upsampling function as a generator and the GCN in GGCN as a synthesizor. As a matter of fact, the whole model is still a GCN with a parameterized mapping that synthesizes by updating the hidden space; therefore the synthesis is a middle result of the whole model learning and sampling. Noted that the GCN part in GGCN is modified to wGCN as the final model for implementation. The detailed algorithm can be found in Algorithm 2. As shown in Algorithm 2 line 4, an LD is still necessary, which may result in a relatively long leaning process, we thus introduce an amortized sampling to GGCN, denoted as GGCNAMS.
Learning the Hidden Space. Even if the nonlinear upsampling function is introduced and an acceleration method is applied to GCN, the whole framework still has a disadvantage of high sensitivity to random initialization, network structure, and hyperparameter selection.
We consider this to be mainly caused by the possibly incorrect hidden space initialization , i.e., the downscaled real data for the first grid of MGCN. Thus, we apply a hidden space optimization strategy proposed by [35, 36] that is capable of capturing the connection between the noise vector and the signal and preventing the learning procedure from going astray. Hidden variable is updated by minimizing the reconstruction loss in nonlinear upsampling function as follows:
(20) 
where denotes a final approximating reconstruction loss. Note that optimizing either or network parameter will not influence the Wasserstein process of updating , see Algorithm 2.
The GGCN model has the following advantages: (1) the generator mimics the sampling process by simply minimizing the error between the sampled synthesized data and the realworld data; (2) the lowdimensional hidden space can be mapped into any highdimensional training data by this function (also means friendly to high dimensional data synthesis); (3) higher learning efficiency is gained compared with previous MGCN; (4) the accumulating divergence scalebyscale in MGCN can be reduced in GGCN and therefore gains improvements on synthesis quality.
Amortized Langevin Dynamic Sampling. The traditional MCMC algorithm is an inefficient inference technique for approximating energybased probabilistic models. A GGCN still needs a relatively long time to perform the iteration to obtain an approximation, since in Eq. 19 are synthesized by several rounds of sampling in the synthesizor (a GCN). To further accelerate our model, and inspired by an amortized SVGD method proposed by [14] that is claimed to maximally reduce the KL divergence, we use an amortized dynamic sampling method to speed up the sampling progress.
If scalar in Eq.19 is sufficiently small, the optimization of can be rewritten into the leastsquares form. Using the Taylor expansion, we obtain . Since , we obtain the following:
(21) 
After being rewritten in the leastsquares form, the optimization problem becomes as follows:
(22)  
Due to involving the matrix inverse operation, this parameter estimation routine is inevitably computationally intensive; hence, one step of the gradient descent computation will provide a sufficient approximation. The final updating equation is as follows:
(23) 
This updating equation defines a reconstruction approximation loss , thus we can apply this loss to Eq. 20. Although the this equation is a rough approximation [37], it is both efficient and effective in the implementation phase because if is small, one step of the gradient descent method reaches relatively close to the optimal solution. We will update by following a more explicit form that literally maximizes the gradient ascent because it is equivalent to equation 23. In this form, is trained to maximize instead of sampling from . The learning strategy of remains as in Eq. 18 and is learned through Eq. 3.
5 Experiments
In this section, we conduct extensive experimental evaluations of our proposed models on four benchmarks: CelebA [38], SUN [39], LSUNBedroom [40], CIFAR10 [41], and Tiny ImageNet[38]. Firstly, we elaborate the experimental settings in Section 5.1, and provide detailed quantitative comparison results of our proposed models in Section 5.2 and Section 5.3. Then, empirical model analysis and assessment are discussed in Section 5.4 and the superiority of few sample training for our proposed models are demonstrated in Section 5.5. Finally, we extend our models to other tasks, i.e., image inpainting and interpolation in Section 5.6.
5.1 Experimental Settings
Datasets. Four benchmark datasets are selected for our experimental evaluation: one face dataset (i.e., CelebA [38]) and three natural image datasets (i.e., LSUNBedroom [40], CIFAR10 [41], Tiny ImageNet [42]).

CelebA[38] is a widely used largescale face dataset of human celebrities, which contains over 10,000 images. We only use its facecentercropped version.

CIFAR10 [41] consists of 60,000 colour images in 10 categories. Each category has 6,000 images.

Tiny ImageNet [42] is a light version of ImageNet dataset. This dataset has 200 classes out of 1,000 classes of ImageNet and 500 images in each category.
Considering that some largescale datasets have no category labels (such as CelebA and LSUNBedroom), we only implement unconditional training on these datasets while others with label information can be carried out with conditional learning (input vector is the concatenation of random Gaussian noise and embedded label vector).
Evaluation metrics. Two metrics are chosen to evaluate our proposed models, i.e., inception scores (IS) [43] and response distance.

Response distance [10] is defined as follows:
where denotes the th filter. The smaller is, the better the generated results since , which implies that provides an approximation of the divergence between the target data distribution and the generated data distribution. Furthermore, with Eq. 2, the faster declines, the faster converges. This metric is a reflection of energy in the particle perspective. The more energy gained, the larger response distance shown.
Learning settings. For GCN, we follows the default FRAME settings in [17], and update CNN parameters with learning rate . For wGCN, is set to ; is 0.2 for CelebA and 0.112 for CIFAR10 datset. The default setting of is 0.01, is 20, the number of learning iterations is 100, the step number of LD sampling within each learning iteration is 50, and the batch size and are 100. The implementation of in wGCN uses the first four convolutional layers of a pretrained VGG16 [46] but is fixed during training.
For proposed variants of our models (i.e., GGCN, GGCNLHS, GGCNLHSAMS and GGCNAMS), we set to 1.0, the learning rate of generators and synthesizors to and batch size to 100; the step number of Langevin learning iteration is among . The corresponding generator and hidden variables will be updated after every 4 iterations of updating the synthesizor and generate images. The generator is a ResNet50 [47] based autoencoder and the synthesizor is simply a ResNet50. There is also a shallower network version for GGCNs, detail structural information is provided in Appendix.
The dimension hidden variable in GGCNs is for ImageNet and CIFAR10, for CelebA, and its learning rate is on ImageNet, on CelebA, and on CIFAR10. All the parameters are leaned by Adam optimizer, except that is learned by SGD. Usually, there will be a total of 500 or 1,000 epochs of training, and a memory replay buffer similar to that in [48] is used. This image buffer serves as a onestep advance initialization of the next sampling step.
5.2 Evaluation on Wasserastein metric
To address the model collapse problem, Wasserstein distance under the particle evolution conditions is introduced for conventional GCN models, deriving wGCN. To demonstrate its effectiveness, we mainly provide proposed wGCN comparison results with GCN in this section. Extensive experiments are conducted on CelebA, LSUNBedroom and CIFAR10 datasets under various experimental settings. These settings involve different learning rates and different input image scales.
In Fig.4, less sample synthesized results of GCN and wGCN under various hyperparameter settings are shown. For each image scale (6464 or 128128), wGCN always generates better images than GCN. Furthermore, under different hyperparameter settings, wGCN still presents a stable learning ability. On the contrary, GCN fails to address the sample artifices problem resulted from model collapse. For example, as shown from the third to forth row in Fig.4, with small changes of learning rate , the generated image quality of GCN significantly deteriorates with severe image artifices. Obviously, the learning of wGCN is less sensitive to hyperparameter tuning and its stability of learning surpasses that of FRAME.
A similar performance is also illustrated with the averaged learning curves of GCN and wGCN in Fig.5. In Fig.5, it is observed that the averaged learning curves of wGCN are all apparently closely converging to on three datasets, which implies a faster and more stable learning of wGCN compared with GCN according to that the smaller the response distance is, the less divergence between real distribution and estimated distribution. Thus, if we consider the model collapse an unstable particle energy dissipation phenomenon, the Wassersstein method effectively improves this shortcoming in GCN since different learning strategies that seriously interfere GCN synthesis have much slighter effects on wGCN.
As demonstrated in Fig.4 and Fig.5, without the Wasserstein method, the synthesized results would certainly collapse after several iterations of GCN learning. Similarly, this model collapse also occurs for other variants of FRAME, e.g., GCN or MGCN. As shown in Fig.6 (a), most synthesized images contain gray spot areas which are obvious indications of model collapse. Conversely, variants of our proposed wFRAME, e.g., wMGCN and GGCN, greatly improve the image generation quality as shown in Fig.6 (b)(d), demonstrating a stable learning ability.
5.3 Comparison on the proposed GGCN
Our wGCN reformulates the conventional energybased models (i.e., FRAME and GCN) from particle evolution perspective, further generalizes the conventional GCN approach by introducing JKO discrete flow with Wasserstein distance measure and derives wFRAME variants (i.e., wMGCN, GGCN) to solve more practical problems. Detailed experimental results of these models are elaborated in the following. The models related with detailed methods can be found in Table I.
Model  LCN  JKO  MS  NLU  HS  LHS 

FRAME  
wFRAME  
GCN  
wGCN  
MGCN  
wMGCN  
GGCN 
Experimental Results on CelebA
We provide synthesized results of MGCN, wMGCN, GGCNLHS+AMS and GGCNAMS on CelebA dataset in Fig.6. This experiment is an ablation study for building GGCN, it is conducted for locating the effect of the learning mechanisms within GGCN, which are HS, NLU and LHS. The model structure from MGCN becomes gradually more complicated to GGCN+AMS, where suffix “+AMS” means with amortized sampling method and “LHS” means without learning the hidden space. We find AMS is necessary in face generation, thus in this study GGCN is learned with AMS. This phenomenon will be explained later. After timeconsuming iterations of training, an MGCN still suffers from collapse issue even with carefully finetuning. As demonstrated in Fig.6(a), most synthesized images of MGCN contain gray spot areas which obviously indicate the model collapse issue. In Fig.6(b), it is observed that our proposed wMGCN alleviates this issue by applying the JKO discrete flow and the gray spots are not shown any more in the synthesized results. However, those synthesized images of wMGCN are still blurry and distorted and artifices or ghosts are present since a mere Wasserstein learning cannot solve the problem of accumulating divergence between real data distribution and estimated data distribution from scale to scale.
Accordingly, our proposed GGCNLHS+AMS introduces JKO discrete flow with Wasserstein distance measure and further improves the image quality, shown in Fig.6(c). Compared with Fig.6(b), these results have less twisted lines, faces are more realistic. Considering the twisting issue in the synthesis might be caused by the accumulating divergence through different scales in MGCN, GGCNLHS+AMS leverages nonlinear upsampling functions(NLU), maps the input to a hidden space(MS), and manages to conquer this problem. Compared with results from those methods, the images resulting from the proposed model GGCNAMS in Fig.6(d) are clearest, least distorted and comparable with real human face images.
Method 
Inception Score 

CIFAR10 without labels  
FRAME[18] 
4.280.05 
GCN  4.180.13 
Improved GAN  4.360.05 
wGCN (ours)  4.310.12 
ALI[49]  5.340.05 
wFRAME[18]  5.520.13 
DCGAN[37] 
6.160.07 
GGCNAMS (ours)  6.790.08 
CIFAR10 with labels  
FRAME [18]  4.950.05 
GCN  5.500.05 
WINN5CNNs [50]  5.580.05 
wGCN (ours)  5.700.06 
wFRAME [18]  6.050.13 
SteinGAN [14]  6.350.05 
MGCN[12]  6.570.05 
DCGAN [12]  6.580.05 
GGCNAMS (ours)  7.010.05 
GGCN (ours)  7.100.07 

Experimental Results on CIFAR10
We compare our proposed models with stateoftheart methods in Table II. Considering available category labels of CIFAR10 dataset, the performance results of inception score on both conditional and unconditional learning are provided. Compared with base model GCN, GGCN obtains respectively 2.61 and 1.55 gains in the unconditional and conditional settings. In comparison with stateoftheart GAN based methods, our proposed still demonstrates a significant improvement in inception score in unconditional/conditional settings. For example, GGCN gains around 0.6 improvements over DCGAN. Notice that inception score of wGCN is slightly lower than that of wFRAME, this might be caused by the applied filters of wFRAME and FRAME are from VGG models well pretrained on ImageNet, while in GCN and wGCN, the filters will probably be trained to relatively inferior features. Further, higher quality images with more realistic natural scenes are generated by GGCN than those of DCGAN, as demonstrated in Fig.7 (b). In those images of GGCN, objects in these images such as cars, horses, and airplanes are easily be recognized.
Experimental Results on Tiny ImageNet
Tiny ImageNet synthesis results are also provided on conditional and unconditional generation. A visualization of results for both types of generations are shown in Fig.8. Considering that the capacity of ImageNet is much larger than CelebA and CIFAR10 datasets and its images contain relatively complicated objects and unclear background, these features challenge the learning of generative models on ImageNet, and the generations on this dataset are not as satisfying as the former two large scale datasets. However, GGCN manages to generate images with clear edges and textures, some objects and scenes can even be identified, such as the fruits and birds in both Fig.8 (a) and (b). Owning to the learning hidden space mechanism, results of these two types of generations are visually similar.
5.4 Model Analysis and Assessment
Improvement of Training Stability
Quantitatively, the curves in Fig.5 are calculated by averaging across the entire dataset. The wGCN method has a lower cost according to response distance ; namely, a direct critic in section.5.1 of filter banks between synthesized samples and target samples is smaller and decreases more steadily. More specifically, our algorithm has mostly solved the model collapse problem of GCN because it not only ensures the closeness of the generated samples and the “groundtruth” samples but also stabilizes the learning phase of model parameter . The three plots clearly show that the quantitative measures are strongly correlated with qualitative visualizations of generated samples. In the absence of model collapse, we obtain results that are comparable to or even better than those of GCN.
Since GGCN can be regarded as a GCN with a more complex sampling mechanism, the Wasserstein method that applied to GCN is also implemented to GGCN. Results of Fig.6 (a) to Fig.6 (d) shows that the collapse issue has been completely solved in GCNs with Wassersstein method since no more gray spot appearing in the synthesis results from any proposed models. This implies that the original energy dissipation process is rectified to be more stable. Another steadiness improvement is manifested in the generalization of GCN. Compare the results of Fig.6 (b) and (c), the context of the latter is more sensibly organized. As we have discussed in section 4.2.1, the divergence or learning error will be accumulated as the learning scale increases in MGCN. Obviously, the Wasserstein method merely helps with the collapse issue. Thus we own this improvement to the nonlinear upsampling functions and the extension to hidden space mechanisms which largely reduce the accumulating error by combining the estimating and modification of the hidden variable.
Improvement of Image Generation Quality
According to our performance measured by response distance , the quality of image synthesis is improved. This measurement corresponds to the iterative learning process of both GCN and wGCN. The learning curves presented in Fig.5 represent the observations of the overall datasets’ synthesis. The curves also imply that wGCN converges faster than GCN.
From subfigures (b) of Fig.6, we found that even applied with Wasserstein learning method in training method, most result images of MGCN are in the condition of either unclear outline edge, or a total distorted face, or being full of unknown spots which make picture extremely dirty. On the contrary, images produced by GGCN (Fig.6 (c)) and with learning hidden space mechanism (Fig.6 (d)) are both reasonably mixed, sensibly structured and brightly colored and less distorted. Especially the synthesized results from GGCNAMS maintain more clear contexts than GGCNLHSAMS which implies LHS can improve the image quality from this respect.
On the other hand, compared with traditional GCN or MGCN, GGCNAMS can be more efficiently trained on large datasets such as CIFAR10 and ImageNet. Since the inception score is shown to correlate well with human scoring of the realism of generated images from the generative model on CIFAR10 dataset, higher inception score than other energybased model (e.g. WINN [50]) to some extent implies this new GGCN synthesizes better quality than others. This conclusion also can be drawn from Fig.7 which is showing that visually GGCN generates more identified objects from DCGANs with CIFAR10.
Discussion
Noticed that the amortized sampling method has a controversy that may influence the model performance. We consider this amortized sampling a costeffective acceleration scheme, since performance declines in some situations if it is implemented. We suspect this is due to the approximation of Taylor’s expansion in 21. Therefore we particularly conduct a comparison experiment of GGCN and GGCNAMS on CIFAR10 with conditional learning. The visual results are presented in Fig.9, where group (a) are the results of GGCNAMS, (b) GGCN. Though GGCNAMS can be trained efficiently, GGCN can synthesize better. Nevertheless, this conclusion will not sustain on CelebA, since we found that GGCNAMS outperforms GGCN on this dataset, as shown in Fig.9. We conjecture this is due to the property of the datasets, as distinguishing features for generation on CIFAR10 is benefited while to some extent is the burden to that of CelebA. Thus amortized sampling is not recommended to implement on the generation of CIFAR10.
5.5 Few Sample Training
It is generally known that recent popular generative models such as GANs are all trained on the massive dataset, thus the energybased models at the beginning are blamed for the inefficiency on training massively. However, those largely learned models are mostly cannot be trained on the small training set. Quick and accurate learning is required in few sample training GCN(or other energy based model such as FRAME) can perfectly achieve with this target while DCGAN, as a representative of popular generative models, somehow has a really hard time converging on this learning task after epochs of training.
We conduct few sample learning on wFRAME and wGCN compared with DCGAN. Most primitive energybased model is performed in this scenario thus we wGCN or wFRAME can achieve even better results (no collapse anymore). The performance of DCGAN in modeling only a few images is shown in Fig.10, where for a fair comparison we duplicate the input images several times to a total of 10,000 to match the training environment of DCGAN. Both of the compared wGCN and wFRAME models are trained using Wasserstein method. The DCGAN’s training procedure stops as it converges, but results showing quite likewise. Besides, all the comparison experiments in section 5.2 are the few sample training results.
5.6 Image Inpainting and Interpolation
To fully validate our proposed models, we further extend our models to another two tasks, i.e., image inpainting and interpolation on the CelebA dataset, which maintains a superior performance on both experiments.
In the inpainting experiment on GGCNAMS, the image with the mask is regarded as the hidden space latent variable mapped by the generator to the realworld data space. If an upsampling generator with a random hidden variable used as input can be trained to complete the masked images as other generative models. GGCNAMS adopts an autoencoder of ResNet50 as the generator, whose input is an image multiplied by a mask array. We apply three types of masks to evaluate our model: a square box, random spots and a combination of the two. As shown in Fig.11, the model infers reasonable and texturelevel smooth contents to fill the blank space. The larger the blank space is, the more different the generated image from the original one. We also evaluate our inpainting results with frequently used criteria, the figures in Table III imply that GGCNAMS can synthesis realistic faces.
Mask Type  MSE  FID  SSIM  PSNR 

Spot  0.0033  25.44  0.954  30.410 
Square  0.0041  30.69  0.934  30.095 
SpotSquare  0.0046  25.03  0.940  30.039 
In the interpolation experiment, we randomly pick two hidden variable vectors denoted by with 256 dimensions, clip their values to the range of [1, 1], and then linearly generate 10 intermediate vectors using the equation . Here in Fig.12, we display some typical interpolation results of our model. As shown, no matter how different faces the picked two hidden variables can be synthesized to, the middle generated faces of inter vectors can keep the form with the facial property continuously change. For example, in the first row, the faces at two ends are generated from the picked two random vectors, the left is a woman with left face direction, while the other one is a man slightly face to the right. The faces between them form a transition from female to male, direction from left to the right. The example shows a transition crossing gender and direction, we also provide transitions between different hair cuts, expressions, skin color, etc.
6 Conclusions
In this paper, we track and rederive the origin of the GCN model from the perspective of particle evolution and discover the potential factors that may lead to a deteriorating quality of generated samples and instability of model training, i.e., the vanishing problem inherent in minimization of the KL divergence. Based on this discovery, we propose the wGCN model by reformulating the KL discrete flow in the GCN model into the JKO flow and prove empirically that this strategy can overcome the above mentioned deficiencies. The Wasserstein method can be applied to other energybased GCN models. Additionally, to overcome challenges of time consumption and model collapse during synthesis, we propose another model called GGCN which is modified from an MGCN but can be regarded as a onescale GCN. GGCN is designed and trained on a specific symmetrical learning structure. Experiments are performed to demonstrate the superiority of the proposed Wasserstein method and GGCN models, and comparable results show that the Wasserstein method can significantly mitigate the vanishing issue of energybased models and produce more visually appealing results. Moreover, the final model GGCN can be further accelerated through an costeffective amortized sampling mechanisms. Our investigation on the generalization of energybased GCN models presents an appealing research topic from the particle evolution perspective and is expected to provide an impetus for the present research on EBMs.
Appendix A
Theroem 1. Given a base measure and a clique potential , the density of GCN in Eq. 1 can be obtained sufficiently and necessarily by solving the following constrained optimization problem:
(24) 
Proof.
Necessity can be proven using MEP by calculating iteratively:
Sufficiency: Recalling the Markov property in Eq. 1, we can write the inner product as a sum of feature responses w.r.t. different cliques; then, the shared pattern activation can be approximated by the Dirac measure as follows:
The result coincides with the empirical measure of , so the proof of sufficiency turns into that of Lemma 1 that is completed elsewhere. ∎
Proof of Lemma 1. Under the Gaussian reference measure a GCN model has the piecewise Gaussian property, as summarized in Proposition 1. This proposition is a reformulation of Theorem 1 and implies that different pieces can be regarded as different generating samples acting as Brownian particles.
Proposition 1.
Equation 3 is piecewise Gaussian, and on each piece the probability density can be written as follows:
(25) 
where is an approximate reconstruction of in one part of the data space by a linear transformation involving inner products of model parameters and a piecewise linear activation function (ReLU).
By Proposition 1, each particle (image piece) in a GCN model has the transition kernel of the Gaussian form (equation 25). It describes the probability of a particle moving from to in time . Let a fixed measure such as the Gaussian distribution be the initial measure of Brownian particles at time .
The Sanov theorem shows that the empirical measure of such transition particles satisfies Langevin dynamics process (LDP) with rate functional , i.e.,
(26) 
Specifically, each Brownian particle has internal subparticles that are independent and belong to different cliques. Let denote the number of cliques; the Cramer’s theorem states that for i.i.d. random variables with a common generating function , the empirical mean satisfies LDP with rate functional given by the Legendre transformation of ,
(27) 
Since the empirical measure of is simply the empirical mean of the Dirac measure, i.e., , the empirical measure over all particles is shown to be as follows:
where the exponent is exactly the KL discrete flow . Thus, the empirical measure of the activation patterns of all particles satisfies LDP with rate functional in discrete time.
Appendix B Network Structure Details
GGCN v0. This network structure is the GGCN v0 for synthesizing results in Fig.6 (c) and Fig.9 (a). As shown that there is a ResNet50 verison v1 by which we achieve best results of CIFAR10 and Tiny ImageNet results, but this will be provided in code.
Operation  Kernel  Step Size  Heat map  Norm  Activation 
– input  
Deconv  Leaky ReLU  
Deconv  Leaky ReLU  
Deconv  Leaky ReLU  
Deconv  Leaky ReLU  
Deconv  Leaky ReLU  
Conv  Sigmoid  
– input  
Conv  Leaky ReLU  
Conv  Leaky ReLU  
Conv  Leaky ReLU  
Conv  Leaky ReLU  
Conv  Leaky ReLU  
– input  
Deconv  Leaky ReLU  
Deconv  Leaky ReLU  
Deconv  Leaky ReLU  
Deconv  Leaky ReLU  
Deconv  Leaky ReLU  
Conv  Sigmoid  
Optimizer  Adam (, )  
Batch Size  100  
Number of iterations  500  
Leaky ReLU tilt  0.02  
Dropout  0.0  
Weight initialization  Isotropic Gaussian (, ), Constant() 
Yang Wu received the B.S. degree of applied mathematics from South China University of Technology, Guangzhou, China. She is currently pursuing her Ph.D. degree at the School of Data and Computer Science of Sun YatSen University, advised by Prof. Liang Lin. Her current research interests include computer vision and machine learning, particularly in generative modeling and embodied tasks. 
Xu Cai received his B.S. degree of computer science and technology from Xidian University, Xidian, China, and Master degree from Sun YatSen University. He is currently pursuing his Ph.D. degree at the School of Computing at National University of Singapore. His current research interest is mainly theoretical machine learning. 
Pengxu Wei received the B.S. degree in computer science and technology from the China University of Mining and Technology, Beijing, China, in 2011 and the Ph.D. degree from University of Chinese Academy of Sciences in 2018. Her current research interests include visual object recognition, detection and latent variable learning. She is currently a research scientist at Sun Yatsen University. 
Guanbin Li currently an associate professor at the School of Data and Computer Science, Sun Yatsen University. He received his Ph.D. degree from the University of Hong Kong in 2016. He was a recipient of a Hong Kong Postgraduate Fellowship. His current research interests include computer vision, image processing, and deep learning. He has authored and coauthored more than 20 papers in toptier academic journals and conferences. He serves as an area chair for the conference of VISAPP. He has served as a reviewer for numerous academic journals and conferences, such as TPAMI, TIP, TMM, TC, TNNLS, CVPR2018 and IJCAI2018. 
Liang Lin is a Full Professor at Sun Yatsen University, and the CEO of DMAI. He served as the Executive R&D Director and Distinguished Scientist of SenseTime Group from 2016 to 2018, taking charge of cuttingedge technology transferring into products. He has authored or coauthored more than 200 papers in leading academic journals and conferences (e.g., TPAMI/IJCV, CVPR/ICCV/NIPS/ICML/AAAI). He is an associate editor of IEEE Trans, HumanMachine Systems and IET Computer Vision. He served as Area Chairs for numerous conferences such as CVPR and ICCV. He is the recipient of numerous awards and honors including Wu WenJun Artificial Intelligence Award for Natural Science, ICCV Best Paper Nomination in 2019, Annual Best Paper Award by Pattern Recognition (Elsevier) in 2018, Best Paper Dimond Award in IEEE ICME 2017, Google Faculty Award in 2012, and Hong Kong Scholars Award in 2014. He is a Fellow of IET. 
References
 S. C. Zhu, Y. Wu, and D. Mumford, “Filters, random fields and maximum entropy (frame): Towards a unified theory for texture modeling,” International Journal of Computer Vision, vol. 27, no. 2, pp. 107–126, 1998.
 J. Baek, G. J. McLachlan, and L. K. Flack, “Mixtures of factor analyzers with common factor loadings: Applications to the clustering and visualization of highdimensional data,” IEEE transactions on pattern analysis and machine intelligence, vol. 32, no. 7, pp. 1298–1309, 2009.
 Y. Wei, Y. Tang, and P. D. McNicholas, “Flexible highdimensional unsupervised learning with missing data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
 J. Liang, J. Yang, H.Y. Lee, K. Wang, and M.H. Yang, “Subgan: An unsupervised generative model via subspaces,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 698–714.
 I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems, 2014, pp. 2672–2680.
 D. P. Kingma and M. Welling, “Autoencoding variational bayes,” International Conference on Learning Representations (ICLR), Banff, AB, Canada, Conference Track Proceedings, 2014.
 L. Dinh, D. Krueger, and Y. Bengio, “Nice: Nonlinear independent components estimation,” International Conference on Learning Representations(ICLR), 2015.
 Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang, “A tutorial on energybased learning,” Predicting structured data, vol. 1, no. 0, 2006.
 A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
 J. Xie, Y. Lu, S.C. Zhu, and Y. Wu, “A theory of generative convnet,” in International Conference on Machine Learning, 2016, pp. 2635–2644.
 S. C. Zhu, Y. N. Wu, and D. Mumford, “Minimax entropy principle and its application to texture modeling,” Neural Computation, vol. 9, no. 8, pp. 1627–1660, 1997.
 R. Gao, Y. Lu, J. Zhou, S.C. Zhu, and Y. Nian Wu, “Learning generative convnets via multigrid modeling and sampling,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 9155–9164.
 R. Jordan, D. Kinderlehrer, and F. Otto, “The variational formulation of the fokker–planck equation,” SIAM journal on mathematical analysis, vol. 29, no. 1, pp. 1–17, 1998.
 D. Wang and Q. Liu, “Learning to draw samples: With application to amortized mle for generative adversarial learning,” arXiv preprint arXiv:1611.01722, 2016.
 L. D. Landau and E. M. Lifshitz, Course of theoretical physics. Elsevier, 2013.
 J. Dai, Y. Lu, and Y.N. Wu, “Generative modeling of convolutional neural networks,” International Conference on Learning Representations(ICLR), 2015.
 Y. Lu, S.C. Zhu, and Y. N. Wu, “Learning frame models using cnn filters,” Proceedings of the AAAI Conference on Artificial Intelligence, 2016.
 X. Cai, Y. Wu, G. Li, Z. Chen, and L. Lin, “Frame revisited: An interpretation view based on particle evolution,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2019, pp. 3256–3263.
 M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two timescale update rule converge to a local nash equilibrium,” in Advances in Neural Information Processing Systems, 2017, pp. 6626–6637.
 A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” International Conference on Learning Representations(ICLR), 2016.
 D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling, “Improving variational inference with inverse autoregressive flow,” Advances in Neural Information Processing Systems, 2016.
 L. Dinh, J. SohlDickstein, and S. Bengio, “Density estimation using real nvp,” International Conference on Learning Representations(ICLR), 2017.
 J. Xie, Y. Lu, S.C. Zhu, and Y. N. Wu, “Cooperative training of descriptor and generator networks,” arXiv preprint arXiv:1609.09408, 2016.
 T. Kim and Y. Bengio, “Deep directed generative models with energybased probability estimation,” arXiv preprint arXiv:1606.03439, 2016.
 L. Younes, “Parametric inference for imperfectly observed gibbsian fields,” Probability Theory and Related Fields, vol. 82, no. 4, pp. 625–645, 1989.
 M. A. Peletier, “Variational modelling: Energies, gradient flows, and large deviations,” arXiv preprint arXiv:1402.1990, 2014.
 S. Whitaker, “Flow in porous media i: A theoretical derivation of darcy’s law,” Transport in porous media, vol. 1, no. 1, pp. 3–25, 1986.
 G. Montavon, K.R. Müller, and M. Cuturi, “Wasserstein training of restricted boltzmann machines,” in Advances in Neural Information Processing Systems, 2016, pp. 3718–3726.
 M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in International Conference on Machine Learning, 2017, pp. 214–223.
 S. Adams, N. Dirr, M. A. Peletier, and J. Zimmer, “From a largedeviations principle to the wasserstein gradient flow: a new micromacro passage,” Communications in Mathematical Physics, vol. 307, no. 3, pp. 791–815, 2011.
 M. H. Duong, V. Laschos, and M. Renger, “Wasserstein gradient flows from large deviations of manyparticle limits,” ESAIM: Control, Optimisation and Calculus of Variations, vol. 19, no. 4, pp. 1166–1188, 2013.
 M. Erbar, Matthias anErbar, J. Maas, M. Renger et al., “From large deviations to wasserstein gradient flows in multiple dimensions,” Electronic Communications in Probability, vol. 20, 2015.
 J.D. Benamou and Y. Brenier, “A computational fluid mechanics solution to the mongekantorovich mass transfer problem,” Numerische Mathematik, vol. 84, no. 3, pp. 375–393, 2000.
 L. Ambrosio, N. Gigli, and G. Savaré, Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2008.
 T. Han, Y. Lu, S.C. Zhu, and Y. N. Wu, “Alternating backpropagation for generator network,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2017.
 P. Bojanowski, A. Joulin, D. LopezPaz, and A. Szlam, “Optimizing the latent space of generative networks,” Proceedings of International Conference on Machine Learning(ICML), 2017.
 Q. Liu and D. Wang, “Learning deep energy models: Contrastive divergence vs. amortized mle,” arXiv preprint arXiv:1707.00797, 2017.
 Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the wild,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3730–3738.
 J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun database: Largescale scene recognition from abbey to zoo,” in Computer vision and pattern recognition (CVPR), 2010, pp. 3485–3492.
 F. Yu, Y. Zhang, S. Song, A. Seff, and J. Xiao, “Lsun: Construction of a largescale image dataset using deep learning with humans in the loop,” arXiv preprint arXiv:1506.03365, 2015.
 A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Citeseer, Tech. Rep., 2009.
 O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.
 T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” in Advances in Neural Information Processing Systems, 2016, pp. 2234–2242.
 C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826.
 J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei, “Imagenet: A largescale hierarchical image database,” in Computer Vision and Pattern Recognition (CVPR), 2009, pp. 248–255.
 K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” International Conference on Learning Representations(ICLR), 2015.
 K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
 V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
 D. WardeFarley and Y. Bengio, “Improving generative adversarial networks with denoising feature matching,” International Conference on Learning Representations(ICLR), 2017.
 K. Lee, W. Xu, F. Fan, and Z. Tu, “Wasserstein introspective neural networks,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.