MMGAN: Generative Adversarial Networks for Multi-Modal Distributions

# MMGAN: Generative Adversarial Networks for Multi-Modal Distributions

Teodora Pandeva and Matthias Schubert
Institute for Informatics
LMU Munich
Teodora.Pandeva@campus.lmu.de and schubert@dbs.ifi.lmu.de
###### Abstract

Over the past years, Generative Adversarial Networks (GANs) have shown a remarkable generation performance especially in image synthesis. Unfortunately, they are also known for having an unstable training process and might loose parts of the data distribution for heterogeneous input data. In this paper, we propose a novel GAN extension for multi-modal distribution learning (MMGAN). In our approach, we model the latent space as a Gaussian mixture model with a number of clusters referring to the number of disconnected data manifolds in the observation space, and include a clustering network, which relates each data manifold to one Gaussian cluster. Thus, the training gets more stable. Moreover, MMGAN allows for clustering real data according to the learned data manifold in the latent space. By a series of benchmark experiments, we illustrate that MMGAN outperforms competitive state-of-the-art models in terms of clustering performance.

## Introduction

Generative Adversarial Nets (GANs) [Goodfellow et al.2014] are state-of-the-art deep generative models and, therefore, they are primarily designed to model data distributions. Compared to other generative models, GANs gain distinction in generating higher quality data. Despite their notable success, GANs still suffer from unsolved problems and thus, there is ongoing research to further improve their performance and make training more stable. For instance, GAN implicit nature does not allow to apply inference learning on the latent space. Although many methods exist which deal with this shortcoming most of them lack interpretability of the estimated posterior distribution. Moreover, GANs model the latent space as a simple unimodal distributions, ignoring the often more complicated implicit structure of the learned data distribution. However, for many data sets, a union of disjoint manifolds (or clusters) fits more naturally to the implicit structure of the input data. For example, digit data can be interpreted as samples from a disjoint union of manifolds – one for each digit. [Khayatkhoei, Singh, and Elgammal2018] showed that the quality of generated data suffers from the generator attempting to cover all data manifolds in the data space with a single manifold in the latent space. Hence, this can lead to mode dropping, i.e. one or more submanifolds of the real data are not covered by the generator. It has been proven that GAN’s local convergence can be sustained when the real and fake data distribution are near achieving a Nash equilibrium [Nagarajan and Kolter2017]. Thus, ignoring the multi-modal nature of a data set might lead to oscillating generator parameters without converging to the real distribution.

In this paper, we introduce GANs for learning multi-modal distributions. The resulting architecture, which is named MMGAN, adopts a dynamic disconnected structure of the latent space, which is distributed according to a Gaussian mixture model. More precisely, by introducing an extra network into the GAN structure, the resulting framework aims to find a disconnected data representation in the latent space, such that each data mode or cluster in the observation space is related to a single cluster in the latent space. This stabilizes the training process and yields a better data representation. Furthermore, we can do inference on the real data to predict the most likely cluster in the latent space and thus, categorize the data with respect to its implicit structure. We provide an universal approximation theorem assuring the existence of a generator with the MMGAN functionality in the spirit of [Cybenko1989, Hornik1991].

## Related Work

There is a great variety of GAN architectures, which explore the latent space abilities to produce realistic data. Most of them can be referred to as hybrid VAE-GAN methods, which bridge the gap between Variational Autoencoders (VAEs) [Kingma and Welling2013] and GANs. All of them use a third encoder network which maps a data object to an instance from the latent space . For example, \citeauthormakhzani2015aae proposed the Adversarial Autoencoder (AAE), which is an autoencoder for performing inference. AAE is composed of three networks: encoder, decoder and discriminator. The latter is trained to correctly classify an encoded noise from the prior noise, which is an arbitrary noise distribution. Although this model can be extended to learn a discrete data representation in an unsupervised learning fashion, it does not consider the true data distribution. Another VAE-GAN hybrid is ClusterGAN [Mukherjee et al.2018], which is essentially InfoGAN [Chen et al.2016] followed by k-means post-clustering on the encoded latent codes. Although InfoGAN has shown remarkable generation and clustering performance by semantically disentangling the latent space, we argue that the encodings are not suitable for -means clustering, which tends to discover spherical patterns in the data.

A further approach for gaining more insights into the structure of the noise , is to directly model the latent space by imposing some assumptions about the prior distribution For instance, GM-GAN [Ben-Yosef and Weinshall2018] and DeLiGAN [Gurumurthy, Sarvadevabhatla, and Babu2017] adopt a Gaussian mixture model for the latent space distribution, where the means and standard deviations are learnable parameters. However, these models do not provide any direct inference framework and any interpretation of the learned latent components.

The Gaussian Mixture VAE (GMVAE) is adapted for unsupervised clustering tasks as the latent space attains the form of Gaussian mixture model [Dilokthanakul et al.2016]. Thus, it becomes the explicit counterpart of MMGAN. However, based on the VAE framework, GMVAE has shown some shortcomings, including the VAE’s tendency to produce blurry images and the (strong) restriction of the encoder output distribution.

GANs consist of two networks which are opposed to one another in a game [Goodfellow et al.2014]. The first one, , is a generator, which captures the data distribution and tries to produce realistic data. It receives as input noise data, sampled from the latent space with dimension , which is deterministically mapped to a data point from the observation space , where . The second player, , is called discriminator. It measures how realistic the input data is, i.e. for some is in general a score, e.g. the probability, measuring whether comes from the real distribution. Thus, is trained via supervised learning on data with assigned labels for being real and for being fake and tries to correctly classify a real object from a fake one. In contrast, aims to fool the discriminator by producing data resembling the real data as close as possible.

In this setting, the models and are neural networks with fixed structures. Hence, the learning takes place over the networks parameters, which are denoted by and , respectively, where and are real spaces with dimension depending on the networks architectures. For simplicity, throughout this work, we use and instead of and respectively, unless it is explicitly mentioned. Here, the unknown true data distribution is denoted by defined on , and is the input noise distribution, defined on . Given a noise instance , the generator produces a fake data point which is a sample of the unspecified distribution, induced by , which is the implicit approximation of and gives the methodology how is related to .

The standard GAN optimization problem (SGAN) [Goodfellow et al.2014] is defined by

\citeauthor

jolicoeur2018relativistic gives a theoretical and empirical analysis of the SGAN training behavior which contradicts the theoretical results, derived by [Goodfellow et al.2014], i.e. the probability of real data being real should decrease during training, while the probability of fake data being fake should increase, which is not fulfilled by SGAN. To excel the training stability, relativistic objective functions are proposed [Jolicoeur-Martineau2018]. A class representative is the RSGAN, defined in the initial paper. The corresponding optimization problem is given by

 ^θD=argminθD−Exr∼Pr,xf∼Pf[log(s(C(xr)−C(xf))]^θG=argminθG−Exr∼Pr,xf∼Pf[log(s(C(xf)−C(xr))],

where is the critic of the discriminator, which is defined by the non-transformed output of [Arjovsky, Chintala, and Bottou2017], i.e. , and is the sigmoid function. This is equivalent to the negative expected probability that a real data object is more realistic than a fake one. A further example of the relativistic approach is the relativistic average GAN (RaSGAN), which compares a real object to the average fake one and vice versa. We provide a modified version of RaSGAN in the next section. Notably, this family of objective functions allows a direct comparison between pairs of fake and real data objects. This is a key feature which is utilized in the MMGAN structure.

## Multi-Modal GANs (MMGAN)

MMGAN samples noise from a Gaussian mixture model. In particular, we sample a cluster from a cluster distribution and then, draw the noise from the Gaussian corresponding to the sampled cluster. As any GAN, MMGAN takes as input to a generator () and employs a discriminator () to guide to generate realistic objects. In addition, MMGAN employs an encoder network () to predict the cluster of data objects. The output is a probability distribution over the clusters, which is denoted by for . Thus, the encoder should reproduce the cluster from which a fake instance was sampled from, and predict the most likely cluster for real images. An overview of the architecture is depicted in Figure 1.

In the following, we will describe our architecture in more detail. To force MMGAN to cluster data, we model the latent space as a mixture of Gaussians with a uniform prior over the clusters. Thus, our goal is to find a representation of each cluster in terms of mean and covariance. We restrict the covariance matrix to have equal diagonal entries and everywhere else, i.e. to be of the form where . Therefore, we define the pairing with and to be the mean and standard deviation of the th cluster, where for all , are learnable parameters. Both parameters are represented by a and dense layers located right before the core generator, denoted by and , respectively. Both networks receive a one hot encoded cluster as input, i.e. of the form and output and for the entry of being , i.e. . Thus, for the cluster related noise , we obtain using the reparameterization trick [Kingma and Welling2013]. Afterwards is fed into the core generator. The exact procedure is illustrated in Figure 2. To formalize the whole procedure, we describe the generative model as:

 y∼Cat(K,1K),~z|y∼N(μ(y),σ(y)2Id),xf=G(~z),

Doing inference on the latent space can be done by directly computing the posterior for and , i.e.

 p(z|x)=K∑k=1pE(y|x)N(z|μk,σ2kId),

where denotes the probability distribution function of .

### MMGAN Training

To explain the training of MMGANs, we will start with the generation of training batches. Firstly, we sample real objects from the unknown distribution These are fed into the encoder , such that the resulting output is transformed into a cluster , which refers to a Gaussian cluster in the mixture model. The encoded together with randomly sampled standard Gaussian noise serve as an input to the generator . Thus, fake objects are produced from the Gaussian cluster, corresponding to the sampled real data . The resulting pairings are fed into the discriminator . We argue that this pairing system will excel both the data generation process and the clustering performance. We will refer to this training batch as B1.

In addition, to train the encoder to correctly assign clusters to fake objects, we generate a second type of training batch, B2, which is composed of fake observations labeled by their corresponding clusters . For this batch, the clusters are randomly drawn samples from the categorical distribution .

The exact optimization problem for training MMGANs is given as follows:

 minθDmaxθG,θEV(D,G,E)+αEy∼Cat(K,1K),z∼N(0,Id),xf=G(z,y)[logpE(y|xf)],

where refers to an adversarial loss, depending on the three nets, and is the output encoder posterior for given observation . Note that is trained on the first type of batch B1 and thus, depends on to generate fake instances . The second term is the cross entropy loss for the encoder output, weighted by a hyperparameter . This term is trained on the second type of batch B2 to make sure that each cluster is sufficiently represented.

The training steps are shown in Algorithm 1. To lay emphasis on the dependence of on the parameters and , in Algorithm 1, the notation and is used. Here, the chosen adversarial loss is RSGAN. Moreover, we use Adam [Kingma and Ba2015] for parameter learning (see lines of Algorithm 1).

We choose to be a relativistic objective since we aim to measure similarity between real and fake objects, belonging to the same cluster. If we assume that clusters data objects in a meaningful way, we expect that the discriminator will find it more difficult to classify a real object from a fake one from the same mode. Thus, we argue that the discriminator will not reach optimality very fast, which will lead to a more stable GAN training behavior [Arjovsky and Bottou2017].

In addition to using standard RSGAN loss, we propose an extension of the RaSGAN [Jolicoeur-Martineau2018], which is a cluster-wise comparison between a real data object and the average fake one or vice versa, i.e. and are optimal solutions of the optimization problems

 ^θD=minθD−Exr∼Pr[log^D(xr)]−Ez∼Pz,y∼Cat(K,1K)[log(1−^D(G(z,y)))],^θG,^θE=minθG,θE−Ez∼Pz,y∼Cat(K,1K)[log^D(G(z,y))]−Exr∼Pr[log(1−^D(xr))],

where

 ^D(xr)=s(C(xr)−Ez∼Pz[C(G(z,E(xr)))]) ^D(G(z,y))=s(C(G(z,y))−Exr∼Pr[δE(xr)(y)⋅C(xr)]),

and is the Dirac delta function with for assigning the highest probability of being in cluster , and otherwise. We name the resulting objective conditional RaSGAN (cRaSGAN).

### Universal Approximation Theorem for the Latent Space Assumption

In the following, we will guarantee the existence of a fully connected neural network that maps a collection of Gaussians to the disjoint data space such that the resulting network recovers the initial data distribution up to a constant . This result is closely related to the Universal Approximation Theorem of [Cybenko1989, Hornik1991]. A similar theory for the uniform distribution has been recently developed in the work of [Khrulkov and Oseledets2019].

Here, we are interested in smooth (infinitely differentiable) functions , which surjectively map the support of a Gaussian distribution to a -connected manifold, as defined in [Jost, Jürgen2008, Definition ]. In real life applications, we choose the dimensionality of the latent space to be very large, due to the high dimensionality of the observation space The Gaussian Annulus theorem (see [Blum, Hopcroft, and Kannan2015, Theorem ]) suggests that for a large enough , the mass of a Gaussian with zero mean and identity matrix as covariance is concentrated around the periphery of a a ball with radius and, thus, it approximates the sphere . The theory developed below is adapted to high dimensional spherical latent spaces, since these can be easily extended to high dimensional Gaussians.

###### Lemma 1.

Let for be a compact connected -dimensional manifold. Then there exists a smooth map

 f:Sd→Rp,

such that where .

###### Proof..

We use [Khrulkov and Oseledets2019, Theorem 5.1], implying the existence of a surjective smooth map , where is the (closed -dimensional) unit ball with origin i.e. Now, we construct a smooth surjective function such that the resulting map i.e for all , fulfills the above stated requirements.

Let be the projection on the unit ball defined by for This map is smooth because it can be represented in matrix form by such that It is also surjective because for each the point fulfills and is contained in since

Thus, the map is smooth and surjective since it is a composition of smooth and surjective maps. ∎

Let be the support of a -dimensional Gaussian of the form . Hence, the map where is defined as in Lemma 1, is smooth and since

According to the Universal Approximation Theorem by [Cybenko1989, Hornik1991] the map defined in Lemma 1 can be approximated by a fully-connected neural network arbitrarily well. This observation is translated into the more general case of disconnected manifolds in Theorem 1. In this setting, the approximation error is measured by means of the Hausdorff distance [Munkres2017, p. 280], which is defined by

 dH(X,Y)=max{supx∈Xinfy∈Ym(x,y),supy∈Yinfx∈Xm(x,y)},

where is a well-defined metric on and . Thus, we aim to find a network such that the value for is kept to be low.

###### Theorem 1.

Let be a disconnected union of compact connected -dimensional manifolds. Then for every and every nonconstant, bounded, continuous activation function , there exists a fully connected neural network with activation function such that the following is fulfilled.

There exists a collection of disjoint -dimensional compact annuli such that for all

 dH(G(Si),Xi)<ϵ.
###### Proof..

The collection is constructed explicitly. Let be a -dimensional cube and be the set of all vertices, i.e. for all , . We choose arbitrarily vertices and initialize Gaussians with mean and covariance for . Now, by means of the Gaussian Annulus Theorem (defined as in [Hopcroft and Kannan2013, Lemma ]), the required sets are defined as annuli, i.e. , for some , where every contains the support of the -th Gaussian up to a fraction of . Thus the initialized annuli form a union of disjoint compact -connected manifolds.

Lemma 1 conveys that for every given manifold there exist a smooth surjective function , such that . Thus, we obtain a collection of functions Let be of the form where is the indicator function defined on i.e. for and 0 otherwise. Next, we construct a function which is an approximation of via smoothing by convolution with a suitable function, e.g. the standard mollifier, defined by:

 ηρ :Rd+1→R, ηρ(x)

where and is chosen, such that

 ∫Rd+1ηρ(x)dx=1,

and .

Let , where Hence, is a union of open bounded sets. Thus, the resulting convolved function , defined by

 fρ(x) =f∗ηρ(x)=∫Ωηρ(y)f(x−y)dy =K∑i=1∫Si∖∂Siηρ(y)fi(x−y)1Si(x−y)dy,

is continuous (e.g. [Heil2019, Theorem ]) and it holds

 supp(fρ)⊂¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯⋃i≤KSi+Bρd+1(0)=⋃i≤K¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯Si+Bρd+1(0)=:S,

where . We choose such that the compact sets become separable, in the sense that for all , For all , and it holds . Recall that the means for all are the vertices of defined above. Moreover,

 4√d ≤∥v(i)−v(j)∥2≤∥v(i)−v(j)−(s+b)+(s+b)∥2 ≤∥v(i)−(s+b)∥2+∥v(j)−(s+b)∥2 <2√d+∥v(j)−(s+b)∥2.

It follows that and, therefore,

Now, define a cube where is chosen such that Thus, the function , for all , fulfills the requirements of [Khrulkov and Oseledets2019, Theorem 5.1]. Therefore, for all a neural network exists, such that for all , . ∎

Theorem 1 gives theoretical guarantees for the existence of a generator and a disconnected latent space such that approximates the real data manifolds with small error. However, this holds only if the dimension of is known, which is impossible to estimate in every real life application. Nevertheless this result and the discussion above provide a profound justification of the latent space choice.

### Cluster Initialization

To generate a collection of disjoint Gaussian clusters in a high-dimensional space for initializing the model, we propose the following heuristic. It is based on the idea of the annuli construction suggested in the proof of Theorem 1. For this reason, consider the -dimensional cube where the number of vertices equals Let with is the set of all vertices, i.e. for each We randomly sample a subset from of length and initialize the means of the Gaussian clusters. Here, we assume that the number of clusters does not exceed , i.e. All the standard deviations are initialized with values of at most 1.Moreover, to avoid narrow Gaussian clusters with very small for , we set a lower bound of for the standard deviations.

## Experiments

We fix the network structure, parameter initialization and use benchmark data to achieve a fair comparison between our new approach and compared existing models (GMVAE [Dilokthanakul et al.2016], AAE [Makhzani et al.2015], ClusterGAN [Mukherjee et al.2018]). All models are trained with the Adam optimization method [Kingma and Ba2015], where and The hyperparameter is set to for all experiment. MMGAN and ClusterGAN generator and discriminator use the same architectures as AAE’s decoder and encoder. Moreover, MMGAN/ClusterGAN discriminator and encoder have the same structure. For GMVAE we used an existing implementation. We evaluated our method on a synthetic and three gray scale data sets (Moon [Pedregosa et al.2011], MNIST [LeCun and Cortes2010], Fashion MNIST [Xiao, Rasul, and Vollgraf2017] and Coil-20 [Nene, Nayar, and Murase1996]). For synthetic data, all model components have two dense layers with units per layer and ReLU activation function while for gray-scale images, MMGAN, ClusterGAN and AAE nets have three CNN layers. The used activation function for the MMGAN/ClusterGAN discriminator and encoder is LeakyReLU.

To study MMGAN functionalities, we conduct several experiments, where the model is trained on benchmarking datasets, and compared to other three competitive models. Table 1 provides an overview of the numerical results regarding the clustering performance on test data. We can conclude that MMGAN especially the cRaSGAN based one outperforms the other competitors in terms of all used evaluation measures: normalized mutual information (NMI), adjusted rank index (ARI), purity (ACC).

Figure 3 illustrates the generated output of SGAN, RSGAN, cRaSGAN - based MMGAN trained on the MNIST data. In this experiment, we can come to the conclusion that the relativistic approach seems to be more stable than the SGAN one. For instance, it can be seen in Figure 2(a) that the third cluster collapses and clusters and generate the same images. This is illustrated in the th and th column of Figure 2(a).

In Figure 4, the heatmaps visualize cosine similarity between the cluster means, which is defined by for two means and , where . The measure is bounded in . Two clusters are close to each other when the cosine similarity measure is around . We also keep in mind that in high dimensional spaces two randomly sampled vectors are almost surely orthogonal. As we have pointed out earlier, clusters and in the SGAN based model generate the same mode. It can be expected that the cluster means form a small angle. However, Figure 3(a) does not support this hypothesis. The computed cosine measure is around . The other two heatmaps (see Figures 3(b) and 3(c)) also do not reveal any pattern between the similarity measurements and generated cluster output. For example, the digits and are often associated with one cluster (see column and in 2(b)). However, according to Figures 3(b) and 3(c) the cluster means are not very similar. Herewith, we conclude that the obtained clusters do not allude to further structure in the latent space, i.e. when two latent clusters resemble in the corresponding generated output need not be similar in

In the next experiment, we examine the effect of the pairings strategy on the MMGAN encoder performance. For each dataset (MNIST, Fashion MNIST, Coil-20) five MMGANs are trained using the cRaSGAN adversarial loss. For each dataset, the trained models are evaluated with respect to the clustering measures (NMI, ARI, ACC), which are summarized by their mean and standard deviations, shown in Table 2.

Analogously to the setting above, we train MMGANs without using the pairing strategy, i.e. the input noise is sampled randomly from the Gaussian mixture model. Thus, the formed pairings do not necessarily refer to the same cluster. Table 2 provides the clustering summary statistics for this type of model, as well.

By considering Table 2, we can conclude that for both the MNIST and Fashion MNIST data the two type of models show similar results in terms of clustering performance. Moreover, in the Coil-20 dataset case, our proposed MMGAN framework outperforms the other one. This observation is also supported by Figure 5, which illustrates the discriminator loss over the training iterations for both type of models. It can be concluded that the random strategy pushes the discriminator loss faster to than the pairings one.

In our experiment, depicted in Figure 6, we trained two MMGANs with different initialization of the cluster means. The first one (see Figure 5(a)) employs the heuristic explained above, while the second one uses the same starting value for each cluster mean. For both experiments the features standard deviations have starting values The latent codes used for acquiring the data points are fixed over all iterations.Interestingly, only after the first iteration a clustered structure can be recognized in the generated data. It can be also seen that the generator manages to reconstruct the initial data distribution and the encoder successfully performs the unsupervised clustering task by achieving maximal evaluation scores. The second row of Figure 5(a) shows the latent space parameter learning over time. It can be observed that the cluster specific standard deviations decrease. Figure 5(b) similarly to Figure 5(a) shows the generator behavior over time, yet for overlapping clustering initialization. It indicates that the resulting encoder does not match the prior labeling. Moreover, the sample quality impairs compared to the first MMGAN.

## Conclusions

This paper introduced a new model from the VAE-GANs hybrid family, which is adapted for both inference learning and approximating real data distributions with disconnected support. Throughout presenting theoretical and empirical results, we have justified the specific structure of our model. We observe that MMGAN excels in the generative modeling task and successfully clusters the real data in the latent space, regarding the labels in the used datasets. In the conducted experiments, we observed an outstanding performance compared to other two state-of-the-art models which are related to this field.

## References

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters