Fluid Flow Mass Transport for Generative Networks
Generative Adversarial Networks have been shown to be powerful tools for generating content resulting in them being intensively studied in recent years. Training these networks requires maximizing a generator loss and minimizing a discriminator loss, leading to a difficult saddle point problem that is slow and difficult to converge. Motivated by techniques in the registration of point clouds and the fluid flow formulation of mass transport, we investigate a new formulation that is based on strict minimization, without the need for the maximization. This formulation views the problem as a matching problem rather than an adversarial one, and thus allows us to quickly converge and obtain meaningful metrics in the optimization path.
Generative Networks have been intensively studied in recent years yielding some impressive results in different fields (see for example Salimans et al. (2016); Karras et al. (2017); Brock et al. (2018); Karras et al. (2018); Zhu et al. (2017) and reference within). The most common technique used to train a generative network is by formulating the problem as a Generative Adversarial Networks (GANs). Nonetheless, GANs are notoriously difficult to train and in many cases do not converge, converge to undesirable points, or suffer from problems such as mode collapse.
Our goal here is to better understand the generative network problem and to investigate a new formulation and numerical optimization techniques that do not suffer from similar shortcomings. To be more specific, we consider a data set made up of two sets of vectors, template vectors (organized as a matrix) and reference vectors . The goal of our network is to find a transformation that generates reference vectors, that is vectors from the space , from template vectors in the space , where are parameters that control the function . For simplicity, we write our generator as
Equation equation 1.1 defines a generator that depends on the template data and some unknown parameters to be learned in the training process. We will use a deep residual network to approximate the function with being the weights of the network.
In order to find the parameters for the generator we need to minimize some loss that captures how well the generator works. Assume first that we have correspondence between the data points in and , that is, we know that the vector . In other words, our training set consists of paired input and output vectors. In this case we can find by simply minimizing the sum of squares difference (and may add some regularization term such as weight decay).
However, such correspondence is not always available and therefore, a different approach is needed if we are to estimate the generator parameters.
The most commonly used approach to solve the lack of correspondence is Generative Adversarial Networks (GANs) and more recently, the Wasserstein GANs (WGANs) Arjovsky et al. (2017). In these approaches one generates a discriminator network, or a critic in the case of WGANs, that gives a score for a vector, , to be in the space . In the original formulation yields the probability and in more recent work on WGANs it yields a real number. The discriminator/critic depends on the parameters to be estimated from the data in the training process. One then defines the function
In the case of a simple GAN we have that
where the function, is a soft max function that converts the score to a probability. In the case of WGANs a simpler expression is derived where we use the score directly setting to the identity and , and require some extra regularity on the score function.
Minimizing with respect to detects the "fake" vectors generated by the generator while maximizing with respect to ”fools” the discriminator, thus generating vectors that are similar to vectors that are drawn from . Training GANs is a minimax problem where is minimized with respect to and maximize with respect to . Minimax problems are very difficult to solve and are typically unstable. Furthermore, the solution is based on gradient d/ascent which is known to be slow, especially when considering a saddle point problem Nocedal and Wright (1999), and this can be demonstrated by solving the following simple quadratic problem.
The d/ascent algorithm used for the solution of the problem reads
The convergence path of the method is plotted in Figure 1.
It is evident that the algorithm takes an inefficient path to reach its destination. This inefficient path is a known property of the method and avoiding it requires much more complex algorithms that, for example, eliminate one of the unknowns at least locally (see a discussion in Nocedal and Wright (1999)). While such approaches have been derived for problems in fluid dynamics and constrained optimization, they are much more difficult to derive for deep learning due to the non-linearity of the learning problem.
Besides the slowly converging algorithm, the simple GAN approach has a number of known fundamental problems. It has been shown in Zhang et al. (2017) that a deep network can classify vectors with random labels. This implies that given sufficient capacity in the classifier, it is always possible to obtain loss even if , implying that there is no real metrics to stop the process (see further discussion in Arjovsky and Bottou (2017)). This problem also leads to mode collapse, as it is possible to obtain the saddle point of the function when , however this yields a local solution that is redundant. Therefore, it is fairly well known that the discriminator needs to be heavily regularized in order to effectively train the generator Gulrajani et al. (2017), such as by using a very small learning rate or by weight clipping in the case of WGANs. A number of improvements have been proposed for GANs and WGANs, however these techniques still involve a minimax problem that can be difficult to solve.
Another common property of all GANs is that they minimize some distance between the two vector spaces that are spanned by some probabilities. While a simple GAN is minimizing the JS-Divergence, the Wasserstein GAN minimizes the Wasserstein Distance. Minimizing the distance for probabilities makes sense when the probabilities are known exactly. However, as we discuss in the next section, such a process is not reasonable when the probabilities are estimated, that is, sampled and are therefore noisy. Since both spaces, and are only sampled, we only have noisy estimations of the probabilities and this has to be taken into consideration when solving the problem.
In this paper we therefore propose a different point of view that allows us to solve the problem without the need of minimax. Our point of view stems from the following observation. Assume for a moment that the vectors and are in or . Then, the problem described above is nothing but a registration of point clouds under a non-common transformation rule (i.e. not affine transformation). Such problem has been addressed in computer graphics and image registration for decades (see for example Eckart et al. (2018); Myronenko and Song (2010) and reference within) yielding successful software packages, numerical analysis and computational treatment. This observation is demonstrated in the following example, that we use throughout the paper.
Assume that the space is defined by vectors that are in and that each vector is drawn from a simple Gaussian distribution with mean and standard deviation of . To generate the space we use a known generator that is a simple resnet with fully connected layers
Here is the activation function and we choose and save them for later use. The original points as well as points from the same distribution that are transformed using this simple resenet are plotted in Figure 2. Finding the transformation parameters (the matrices ) is a kin to registering the red and blue points, when no correspondence map is given.
We therefore propose to extend the ideas behind cloud point registration to higher dimensions, adapting them to Generative Models. Furthermore, recent techniques for such registration are strongly connected to the fluid flow formulation of optimal mass transport. As we explore next, there is a strong connection between our formulation and the Wasserstein GAN. Similar connections have been proposed in Lei et al. (2019). Our approach can be viewed as a discretization of the fluid flow formulation of the optimal mass transport problem proposed in Benamou and Brenier (2003) and further discussed in Haber and Horesh (2014); Ryu et al. (2017). It has some commonalities with normalized flow generators Kingma and Dhariwal (2018) with two main differences. First, our flow become regular due to the addition of regularization that explicitly keep the flow normalized and second, and more importantly, it solves the correspondence problem between the two data sets. The fluid formulation is known to be easier to solve and less nonlinear than the standard approach and (at least in 3D) leads to better and faster algorithms. We name our algorithm Mass Transport Generative Networks (MTGN) since our algorithm is based on mass transport that tries to match distributions but not adversarial networks. Similar to the fluid flow formulation for the OMT problem, our algorithm leads to a simple minimization problem and does not require the solution of a minimax.
The rest of this paper is organized as follows. In Section 2 we lay out the foundation of the idea used to solve the problem, including discretization and numerical optimization. In Section 3 we demonstrate the idea on very simple example that help us gain insight into the method. In Section 4 we perform numerical experiments in higher dimensions and discuss how to effectively use the method and finally in Section 5 we summarize the paper and suggest future work.
2 Generative models, Mass Transport and Cloud Point Registration
In this section we discuss our approach for the solution of the problem and the connection to the registration of cloud of points and the fluid flow formulation of optimal mass transport.
2.1 Generative Models and Fluid Flow Mass Transport
We start by associating the spaces and with two probability density functions and . The goal is to find a transformation such that the probability, is transformed to the probability , minimizing some distance (see the review paper Evans (1989) for details). An distance leads to the Monge Kantorovich problem but it is possible to use different distances to obtain different transformations Burger et al. (2013). The computation of such a transformation has been addressed by vast amount of literature. Solution techniques range from linear programming, to the solution of the notoriously nonlinear Monge Ampre equation Evans (1989). However, in a seminal paper Benamou and Brenier (2003), it was shown that the problem can be formed as minimizing the energy of a simple flow
Here is the total energy that depends on the velocity and density of the flow. The idea was extended in Chen et al. (2017) to solve the problem with different distances on vector spaces. The problem is also commonly solved in fields such as computational flow of fluids in porous media, where different energy and more complex transport equations are considered Sarma et al. (2007). A simple modification which we use here, is to relax the constraint and to demand it holds only approximately. This formulation is better where we have noisy realizations of the distributions and a perfect fit may lead to overfitting. The formulation leads to the optimization problem
Here, the first term
can be viewed as a data misfit, while the second term
can be thought of as regularization. Typical to all learning problems, the regularization parameter needs to be chosen, usually by using some cross validation set. Such a formulation is commonly solved in applied inverse transport problems Dean and Chen (2011).
Here we see that our formulation equation 2.4a differs from both WGAN and normalized flow. It allows us to have a tradeoff between the regularity of the transformation as expressed in equation 2.6 to fitting the data as expressed in equation 2.5. This is different from both WGAN and normalized flow when such a choice is not given.
2.2 Discretization of the regularization term
The optimization problem equation 2.4 is defined in continuous space and in order to solve it we need to discretize it. The work in Benamou and Brenier (2003); Haber and Horesh (2014) used an Eulerian framework, and discretize both and on a grid in 2D and 3D. Such a formulation is not suitable for our problems as the dimensionality of the problem can be much higher. In this case, a Lagrangian approach that is based on sampling is suitable although some care must be taken if the discrete problem to be solved is faithful to the continuous one. To this end, the flow equation 2.3c is approximated by placing particles, of equal mass for now, at locations and letting them flow by the equation
Here is the velocity field that depends on the parameters . If we use the forward Euler discretization in time, equation equation 2.2 is nothing but a resnet that transforms particles located in from the original distribution to the final distribution , that is sampled at points . It is important to stress that other discretizations in time may be more suitable for the problem. Using the point mass approximation to estimate the density, the regularization part of the energy can be approximate in a straight forward way as
where is the time interval used to discretize the ODE’s equation 2.2. We see that the OMT energy is simply a sum of squared activations for all particles and layers. Other energies can be used as well and can sometimes lead to more regular transportation maps Burger et al. (2013).
2.3 Discretizing the Misfit and Point of Cloud Registration
Estimating the first term in the objective function , the misfit, requires further discussion. Assume that we have used some parameters and push the particles forward. The main problem is how to compare the distributions that is sampled at points and sampled at .
In standard Lagrangian framework one usually assumes correspondence between the particles in the different distributions, however this is not the case here. Since we have unpaired data there is no way to know which particle in corresponds to a particle in . This is exactly the problem solved when registering two point clouds to each other. We thus discuss the connection between our approach to point cloud registration.
One approach for measuring the difference between two point clouds is using the closest point match. This is the basis for the Iterative Closest Point (ICP) Besl and McKay (1992) algorithm that is commonly used to solve the problem. Nonetheless, the ICP algorithm tends to converge only locally and thus we turn to other algorithms that usually exhibit better properties.
Following the work Myronenko and Song (2010) we use the idea of coherent point drift for the solution of the problem. To this end, we use a Gaussian Mixture Model to evaluate the distribution of each of the data. We define the approximations and to and as
The integral equation 2.5 can be now written as
Finally, we approximate by replacing by the sampled points and in a symmetric distance obtaining
2.4 Numerical Optimization
The optimization problem equation 2.4 can be solved using any standard optimization technique, however, there are a number of points that require special attention. First, the batch size in both and cannot be too small. This is because we are trying to match probabilities that are approximated by particles. For example, using a batch of a single vector is very likely to not represent the probability density. Better approximations can be obtained by using a different approximation to the distribution, for example, by using a small number of Gaussians but this is not explored here. A second point is the choice of is the estimation of the probability. When the distributions are very far it is best to pick a rather large . Such a choice yields a very "low resolution" approximation to the density, that is, only the low frequencies of the densities are approximated. As the fit becomes better, we decrease and obtain more details in the density’s surfaces. This principle is very well known in image registration Modersitzki (2004).
3 Numerical Experiments on Synthetic Data
In this section we perform numerical experiments using synthetic data. The goals of these experiments are twofold. First, experimenting in 2D allows us to plot the distributions and obtain some highly needed intuition. Second, synthetic experiments allow us to quantitatively test the results as we can always compute the true correspondence for a new data point.
Returning to Example 1.2, we use the data generated with some chosen parameters . We train the generator to estimate and obtain convergence in 8 epochs. The optimization path is plotted in Figure 3. We have also used a standard GAN Zhu et al. (2017) in order to achieve the same goal. The GAN converged much slower and to a visually less pleasing solution that can be qualitatively assessed to be of lower accuracy.
|Epoch 1||Epoch 2||Epoch 3||Epoch 4|
|Epoch 5||Epoch 6||Epoch 7||Epoch 8|
|Epoch 2000||Epoch 4000||Epoch 6000||Epoch 8000|
One of the advantages of synthetic experiments is that we have the "true" transformation and therefore can qualitatively validate our results. To this end, we choose a new set of random points, and used them with the optimal parameters to generate and its associate approximate distribution. We also generate the "true" distribution from the chosen parameters by pushing with the true parameters, generating We then compute the mean square error
For the experiment at hand we obtained an error of with our method, while the GAN gave us an error of , which is substantially worse than our estimated network. This implies that the using our training the network managed to learn the transformation rather well. Unfortunately, this quantitative measure can only be obtained when the transformation is known and this is why we believe that such simple tests are important.
4 Numerical Experiments in Higher Dimensions
In order to match higher dimensional vectors and distributions we slightly modify the architecture of the problem. Rather than working on the spaces and directly, we use a feature extractor to obtain latent space and respectively. Such spaces can be formed for example by training an auto-encoder and then use the encoded space as the latent space. We then register the points in to the points in . In the experiments we have done here we used the MNIST data set and used a simple encoder similar to Kingma and Welling (2019) to learn the latent space of . We then use our framework to obtain the transformation that maps a template vector sampled from a Gaussian random distribution, , with mean and a standard deviation of to the latent space . In our experiments the size of the latent space was only 32 which seems to be sufficient to represent the MNIST images. We use a simple ResNet50 network with a single layer at every step that utilizes convolutions. We run our network for epochs in total with a constant learning rate. Better results are obtained if we change , the kernel width, throughout the optimization. We start with very large, a value of 50, and slowly decrease it, dividing by every steps. The final value of is which yields a rather local support. Convergence curve for our method is plotted in Figure 4.
Convergence is generally monotonic and the misfit grows only when we choose , changing the problem to a more challenging one.
Results of our results are presented in Figure 5.
Although not all images look real, we have a very large number (over by visual inspection) that look like they can be taken from the reference set. Unlike the previous experiment where we have a quantitative measure of how successful our approach is, here we have to rely on visual inspection.
5 Conclusions and Future Work
In this work we have introduced a new approach for Generative Networks. Rather than viewing the problem as ”fooling” an adversary which leads to a minimax we view the problem as a matching problem, where correspondence between points is unknown. This enables us to formulate the problem as strictly a minimization problem, using the theory of optimal mass transport that is designed to match probabilities, coupled with numerical implementation that is based on particles and cloud point registration.
When comparing our approach to typical GANs in low dimensions, where it is possible to construct examples with known solution it is evident that our algorithm is superior in terms of iterations to convergence and also in terms of visual inspection. Although we have shown only preliminary results in higher dimensions we believe that our approach is more appropriate for the problem and we will be pursuing variations of this problem in the future. Indeed, is it not better to find a match, that is commonalities, rather than to be adversary?
- Towards Principled Methods for Training Generative Adversarial Networks. arXiv e-prints, pp. arXiv:1701.04862. External Links: Cited by: §1.
- Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 214–223. External Links: Cited by: §1.
- A computational fluid mechanics solution to the monge kantorovich mass transfer problem. SIAM J. Math. Analysis 35, pp. 61–97. Cited by: §1, §2.1, §2.2.
- A method for registration of 3-d shapes.. IEEE Trans. Pattern Anal. Mach. Intell. 14 (2), pp. 239–256. Cited by: §2.3.
- Large scale GAN training for high fidelity natural image synthesis. CoRR abs/1809.11096. External Links: Cited by: §1.
- A hyperelastic regularization energy for image registration. SIAM Journal on Scientific Computing 35, pp. B132–B148. External Links: Cited by: §2.1, §2.2.
- An efficient algorithm for matrix-valued and vector-valued optimal mass transport. CoRR abs/1706.08841. Cited by: §2.1.
- Recent progress on reservoir history matching: a review. Comput Geoscience 15, pp. 185–221. Cited by: §2.1.
- Fast and Accurate Point Cloud Registration using Trees of Gaussian Mixtures. arXiv e-prints, pp. arXiv:1807.02587. External Links: Cited by: §1.
- Partial differential equations and monge kantorovich transfer. Lecture Notes. Cited by: §2.1.
- Improved Training of Wasserstein GANs. arXiv e-prints, pp. arXiv:1704.00028. External Links: Cited by: §1.
- Efficient numerical methods for the solution of the monge kantorovich optimal transport map. SIAM J. on Scientific Computing, pp. 36–49. Cited by: §1, §2.2.
- Progressive growing of gans for improved quality, stability, and variation. CoRR abs/1710.10196. External Links: Cited by: §1.
- A style-based generator architecture for generative adversarial networks. CoRR abs/1812.04948. External Links: Cited by: §1.
- Glow: Generative Flow with Invertible 1x1 Convolutions. arXiv e-prints, pp. arXiv:1807.03039. Cited by: §1.
- An Introduction to Variational Autoencoders. arXiv e-prints, pp. arXiv:1906.02691. Cited by: §4.
- A geometric view of optimal transportation and generative model. Computer Aided Geometric Design 68, pp. 1 – 21. External Links: Cited by: §1.
- Numerical methods for image registration. Oxford. Cited by: §2.4.
- Point set registration: coherent point drift. IEEE Trans. Pattern Anal. Mach. Intell. 32 (12), pp. 2262–2275. External Links: Cited by: §1, §2.3.
- Numerical optimization. Springer, New York. Cited by: Example 1.1, §1.
- Vector and Matrix Optimal Mass Transport: Theory, Algorithm, and Applications. arXiv e-prints, pp. arXiv:1712.10279. External Links: Cited by: §1.
- Improved techniques for training gans. CoRR abs/1606.03498. External Links: Cited by: §1.
- A new approach to automatic history matching using kernel pca. SPE Reservoir Simulation Symposium, Houston, Texas. Cited by: §2.1.
- Understanding deep learning requires rethinking generalization. External Links: Cited by: §1.
- Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. arXiv e-prints, pp. arXiv:1703.10593. Cited by: §3.
- Unpaired image-to-image translation using cycle-consistent adversarial networks. CoRR abs/1703.10593. External Links: Cited by: §1.