A Geometric View of Optimal Transportation and Generative Model

A Geometric View of Optimal Transportation and Generative Model

Na Lei Dalai University of Technology, Dalian, China. Email: nalei@dlut.edu.cn    Kehua Su Wuhan University, Wuhan, China. Email: skh@whu.edu.cn    Li Cui Beijing Normal University, Beijing, China. Email: licui@bnu.edu.cn    Shing-Tung Yau Harvard University, Boston, US. Email: yau@math.harvard.edu    David Xianfeng Gu Stony Brook University, New York, US. Email: gu@cs.stonybrook.edu.
Abstract

In this work, we show the intrinsic relations between optimal transportation and convex geometry, especially the variational approach to solve Alexandrov problem: constructing a convex polytope with prescribed face normals and volumes. This leads to a geometric interpretation to generative models, and leads to a novel framework for generative models.

By using the optimal transportation view of GAN model, we show that the discriminator computes the Kantorovich potential, the generator calculates the transportation map. For a large class of transportation costs, the Kantorovich potential can give the optimal transportation map by a close-form formula. Therefore, it is sufficient to solely optimize the discriminator. This shows the adversarial competition can be avoided, and the computational architecture can be simplified.

Preliminary experimental results show the geometric method outperforms WGAN for approximating probability measures with multiple clusters in low dimensional space.

1 Introduction

GAN model

Generative Adversarial Networks (GANs) [10] aim at learning a mapping from a simple distribution to a given distribution. A GAN model consists of a generator and a discriminator , both are represented as deep networks. The generator captures the data distribution and generates samples, the discriminator estimates the probability that a sample came from the training data rather than . Both generator and the discriminator are trained simultaneously. The competition drives both of them to improve their performance until the generated samples are indistinguishable from the genuine data samples. At the Nash equilibrium [50], the distribution generated by equals to the real data distribution. GANs have several advantages: they can automatically generate samples, and reduce the amount of real data samples; furthermore, GANs do not need the explicit expression of the distribution of given data.

Recently, GANs receive an exploding amount of attention. For example, GANs have been widely applied to numerous computer vision tasks such as image inplainting [35, 49, 28], image super resolution [26, 19], semantic segmentation [52, 31], object detection [37, 27, 47], video prediction [32, 46], image translation [20, 51, 7, 29], 3D vision [48, 34], face editing [25, 30, 36, 39, 6, 40, 18], etc. Also, in machine learning field, GANs have been applied to semi-supervised learning [33, 24, 38], clustering [41], cross domain learning [42, 22], and ensemble learning [43].

Figure 1: Wasserstein Generative Adversarial Networks (W-GAN) framework.

Optimal Transportation View

Recently, optimal mass transportation theory has been applied to improve GANs. The Wasserstein distance has been adapted by GANs as the loss function as the discriminator, such as WGAN [3], WGAN-GP [13] and RWGAN [14]. When the supports of two distributions have no overlap, Wasserstein distance still provides a suitable gradient for the generator to update.

Figure  1 shows the optimal mass transportation point of view of WGAN  [3]. The ambient image space is , the real data distribution is . The latent space is with much lower dimension. The generator can be treated as a mapping from the latent space to the sample space, , realized by a deep network with parameter . Let be a fixed distribution on the latent space, such as uniform distribution of Gaussian distribution. The generator pushes forward to a distribution in the ambient space . The discriminator computes the distance between and , in general using the Wasserstein distance, . The Wasserstein distance is equivalent to find the so-called Kantorovich potential function , which is carried out by another deep network with parameter . Therefore, improves the ”decoding” map to approximate by ; improves the to increase the approximation accuracy to the Wasserstein distance. The generator and the discriminator are trained alternatively, until the competition reaches an equilibrium.

In summary, the generative model has natural connection with the optimal mass transportation (OMT) theory:

  1. In generator , the generating map in GAN is equivalent to the optimal transportation map in OMT;

  2. In discriminator , the metric between distributions is equivalent to the Kantorovich potential .

  3. The alternative training process of W-GAN is the min-max optimization of expectations:

    The deep nets of and perform the maximization and the minimization respectively.

Figure 2: The GAN model, OMT theory and convex geometry has intrinsic relations.

Geometric Interpretation

The optimal mass transportation theory has intrinsic connections with the convex geometry. Special OMT problem is equivalent to the Alexandrov theory in convex geometry: finding the optimal transportation map with cost is equivalent to constructing a convex polytope with user prescribed normals and face volumes. The geometric view leads to a practical algorithm, which finds the generating map by a convex optimization. Furthermore, the optimization can be carried out using Newton’s method with explicit geometric meaning. The geometric interpretation also gives the direct relation between the transportation map for and the Kantorovich potential for .

These concepts can be explained using the plain language in computational geometry [8],

  1. the Kantorovich potential corresponds to the power distance;

  2. the optimal transportation map represents the mapping from the power diagram to the power centers, each power cell is mapped to the corresponding site.

Imaginary Adversary

In the current work, we use optimal mass transportation theory to show the fact that: by carefully designing the model and choosing special distance functions , the generator map and the descriminator function (Kantorovich potential) are equivalent, one can be deduced from the other by a simple closed formula. Therefore, once the Kantorovich potential reaches the optimum, the generator map can be obtained directly without training. One of the deep neural net for or is redundant, one of the training processes is wasteful. The competition between the generator and the discriminator is unnecessary. In one word, the adversary is imaginary.

Contributions

The major contributions of the current work are as follows:

  1. Give an explicit geometric interpretation of optimal mass transportation map, and apply it for generative model;

  2. Prove in theorem 3.7 that if the cost function , where is a strictly convex function, then once the optimal discriminator is obtained, the generator can be written down in an explicit formula. In this section, the competition between the discriminator and the generator is unnecessary and the computational architecture can be simplified;

  3. Propose a novel framework for generative model, which uses geometric construction of the optimal mass transportation map;

  4. Conduct preliminary experiments for the proof of concepts.

Organization

The article is organized as follows: section 2 explains the optimal transportation view of WGAN in details; section 3 lists the main theory of OMT; section 4 gives the detailed exposition of Minkowski and Alexandrov theorems in convex geometry, and its close relation with power diagram theory in computational geometry, an explicit computational algorithm is given to solve Alexandrov’s problem; section 5 analyzes semi-discrete optimal transportation problem, and connects Alexandrov problem with the optimal transportation map; section 6 proposes a novel geometric generative model, which applies the geometric OMT map to the generative model; preliminary experiments are conducted for proof of concept, which are reported in section 7. The work concludes in the section 8.

2 Optimal Transportation View of GAN

This section, the GAN model is interpreted from the optimal transportation point of view. We show that the discriminator mainly looks for the Kantorovich potential.

Let be the (abient) image space, be the Wasserstein space of all probability measures on . Assume the data distribution is , represented as an empirical distribution

(1)

where are data samples. A generative model produces a parametric family of probability distributions , , a Minimum Kantorovitch Estimator for is defined as any solution to the problem

where is the Wasserstein cost on for some ground cost function ,

(2)

where and are projectors, and are marginalization operators. In a generative model, the image samples are encoded to a low dimensional latent space (or a feature space) , . Let be a fixed distribution supported on . A WGAN produces a parametric mapping , which is treated as a ”decoding” map the latent space to the original image space . pushes forward to , . The minimal Kantorovich estimator in WGAN is formulated as

According to the optimal transportation theory, the Kantorovich problem has a dual formulation

(3)

The gradient of the dual energy with respect to can be written as

where is the optimal Kantorovich potental. In practice, can be replaced by the c-tranform of , defined as

The function is called the Kantorovich potential. Since is discrete, one can replace the continuous potential by a discrete vector and impose . The optimization over can then be achieved using stochastic gradient descent, as in  [9].

In WGAN [3], the dual problem Eqn. 3 is solved by approximating the Kantorovich potential by the so-called ”adversarial” map£¬ , where is represented by a discriminative deep network. This leads to the Wasserstein-GAN problem

(4)

The generator produces , the discriminator estimates , by simultaneous training, the competition reaches the equilibrium. In WGAN [3], , then the c-transform of equals to , subject to being a 1-Lipschitz function. This is used in to replace by in Eqn. 4 and use deep network made of ReLu units whose Lipschitz constant is upper-bounded by .

3 Optimal Mass Transport Theory

In this section, we review the classical optimal mass transportation theory. Theorem 3.7 shows the intrinsic relation between the Wasserstein distance (Kantorovich potential ) and the optimal transportation map (Brenier potential), this demonstrates that once the optimal discriminator is known, the optimal generator is automatically obtained. The game between the discriminator and the generator is unnecessary.

The problem of finding a map that minimizes the inter-domain transportation cost while preserves measure quantities was first studied by Monge [4] in the 18th century. Let and be two metric spaces with probability measures and respectively. Assume and have equal total measure

Definition 3.1 (Measure-Preserving Map)

A map is measure preserving if for any measurable set ,

(5)

If this condition is satisfied, is said to be the push-forward of by , and we write .

If the mapping is differentiable, then measure-preserving condition can be formulated as the following Jacobian equation, ,

(6)

Let us denote the transportation cost for sending to by , then the total transportation cost is given by

(7)
Problem 3.2 (Monge’s Optimal Mass Transport[4])

Given a transportation cost function , find the measure preserving map that minimizes the total transportation cost

(8)

The total transportation cost is called the Wasserstein distance between the two measures and .

3.1 Kantorovich’s Approach

In the 1940s, Kantorovich introduced the relaxation of Monge’s problem  [21]. Any strategy for sending onto can be represented by a joint measure on , such that

(9)

is called a transportation plan, which represents the share to be moved from to . We denote the projection to and as and respectively, then and . The total cost of the transportation plan is

(10)

The Monge-Kantorovich problem consists in finding the , among all the suitable transportation plans, minimizing in Eqn. 10£¬

(11)

3.2 Kontarovich Dual Formulation

Because Eqn. 11 is a linear program, it has a dual formulation, known as the Kantorovich problem [45]:

(12)

where and are real functions defined on and . Equivalently, we can replace by the c-transform of .

Definition 3.3 (c-transform)

Given a real function , the c-transform of is defined by

Then the Kantorovich problem can be reformulated as the following dual problem:

(13)

where is called the Kantorovich potential.

For transportation cost in , if the Kantorovich potential is 1-Lipsitz, then its c-transform has a special relation . The Wasserstein distance is given by

(14)

For transportation cost in , the c-transform and the classical Legendre transform has special relations.

Definition 3.4

Given a function , its Legendre tranform is defined as

(15)

Intuitively, Legendre tranform has the following form:

We can show the following relation holds when ,

(16)

3.3 Brenier’s Approach

At the end of 1980’s, Brenier [5] discovered the intrinsic connection between optimal mass transport map and convex geometry. (see also for instance [44], Theorem 2.12(ii), and Theorem 2.32)

Suppose is a continuous convex function, namely its Hessian matrix is semi-positive definite. Its gradient map is defined as

Theorem 3.5 (Brenier[5])

Suppose and are the Euclidean space , and the transportation cost is the quadratic Euclidean distance . If is absolutely continuous and and have finite second order moments, then there exists a convex function , its gradient map gives the solution to the Monge’s problem, where is called Brenier’s potential. Furthermore, the optimal mass transportation map is unique.

This theorem converts the Monge’s problem to solving the following Monge-Amperé partial differential equation:

(17)

The function is called the Brenier potential. Brenier proved the polar factorization theorem.

Theorem 3.6 (Brenier Factorization[5])

Suppose and are the Euclidean space , is measure preserving, . Then there exists a convex function , such that

where preserves the measure , . Furthermore, this factorization is unique.

Based on the generalized Brenier theorem we can obtain the following theorem.

Theorem 3.7 (Generator-Discriminator Equivalence)

Given and on a compact domain there exists an optimal transport plan for the cost with strictly convex. It is unique and of the form , provided is absolutely continuous and is negligible. More over, there exists a Kantorovich potential , and can be represented as

Proof: Assume is the joint probability, satisfying the conditions , , is a point in the support of , by definition , hence

Because is strictly convex, therefore is invertible,

hence .

When , we have

In this case, the Brenier’s potential and the Kantorovich’s potential is related by

(18)

4 Convex Geometry

(a) Minkowski theorem (b) Alexandrov theorem
Figure 3: Minkowski and Alexandrov theorems for convex polytopes with prescribed normals and areas.

This section introduces Minkowski and Alexandrov problems in convex geometry, which can be described by Monge-Ampere equation as well. This intrinsic connection gives a geometric interpretation to optimal mass transportation map with transportation cost.

4.1 Alexandrov’ Theorem

Minkowski proved the existence and the uniqueness of convex polytope with user prescribed face normals and the areas.

Theorem 4.1 (Minkowski)

Suppose are unit vectors which span and so that . There exists a compact convex polytope with exactly codimension-1 faces so that is the outward normal vector to and the volume of is . Furthermore, such is unique up to parallel translation.

Minkowski’s proof is variational and suggests an algorithm to find the polytope. Minkowski theorem for unbounded convex polytopes was considered and solved by A.D. Alexandrov and his student A. Pogorelov. In his book on convex polyhedra [2], Alexandrov proved the following fundamental theorem (Theorem 7.3.2 and theorem 6.4.2)

Theorem 4.2 (Alexandrov[2])

Suppose is a compact convex polytope with non-empty interior in , are distinct unit vectors, the -th coordinates are negative, and so that . Then there exists convex polytope with exact codimension-1 faces£¬ so that is the normal vector to and the intersection between and the projection of is with volume . Furthermore, such is unique up to vertical translation.

Alexandrov’s proof is based on algebraic topology and non-constructive. Gu et al. [12] gave a variational proof for the generalized Alexandrov theorem stated in terms of convex functions.

Given and , the piecewise linear convex function is defined as

The graph if is a convex polytope in , the projection induces a cell decomposition of ¡£ Each cell is a closed convex polytope,

Some cells may be empty or unbounded. Given a probability measure defined on , the volume of is defined as

Theorem 4.3 (Gu-Luo-Sun-Yau[12])

Let be a compact convex domain in , be a set of distinct points in and a probability measure on . Then for any with , there exists , unique up to adding a constant , so that , for all . The vectors are exactly maximum points of the concave function

(19)

on the open convex set

Furthermore, minimizes the quadratic cost

among all transport maps , where the Dirac measure .

For the convenience of discussion, we define the Alexandrov’s potential as follows:

Definition 4.4 (Alexandrov Potential)

Under the above condition, the convex function

(20)

is called the Alexandrov potential.

Figure 4: Geometric Interpretation to Optimal Transport Map: Brenier potential , Legendre dual , optimal transportation map , power diagram , weighted Delaunay triangulation .
Figure 5: Power diagram (blue) and its dual weighted Delaunay triangulation (black), the power weight equal to the square of radius (red circle).

4.2 Power Diagram

Alexandrov’s theorem has close relation with the conventional power diagram. We can use power diagram algorithm to solve the Alexandrov’s problem.

Definition 4.5 (power distance)

Given a point with a power weight , the power distance is given by

Definition 4.6 (power diagram)

Given weighted points , the power diagram is the cell decomposition of , denoted as ,

where each cell is a convex polytope

The weighted Delaunay triangulation, denoted as , is the Poincaré dual to the power diagram, if then there is an edge connecting and in the weighted Delaunay triangulation.

Note that is equivalent to

let

(21)

we construct the convex function

(22)

4.3 Convex Optimization

Now, we can use the power diagram to explain the gradient and the Hessian of the energy Eqn.19, by definition

(23)

The Hessian matrix is given by power diagram - weighted Delaunay triangulation, for adjacent cells in the power diagram,

(24)

Suppose edge is in the weighted Delaunay triangulation, connecting and . It has a unique dual cell in the power diagram, then

the volume ratio between the dual cells. The diagonal element in the Hessian is

(25)

Therefore, in order to solve Alexandrov’s problem to construct the convex polytope with user prescribed normal and face volume, we can optimize the energy in Eqn. 19 using classical Newton’s method directly.

Let’s observe the convex function , its graph is the convex hull . Then the discrete Hessian determinant of assigns each vertex of the volume of the convex hull of the gradients of at top-dimensional cells adjacent to . Therefore, solving Alexandrov’s problem is equivalent to solve a discrete Monge-Ampere equation.

5 Semi-discrete Optimal Mass Transport

In this section, we solve the semi-discrete optimal transportation problem from geometric point of view. This special case is useful in practice.

Suppose has compact support on , assume is a convex domain in ,

The space is discretized to with Dirac measure . The total mass are equal

5.1 Kantorovich Dual Approach

We define the discrete Kantorovich potential , , then

(26)

The c-transformation of is given by

(27)

This induces a cell decomposition of ,

where each cell is given by

According to the dual formulation of the Wasserstein distance Eqn.13 and integration Eqn.26, we define the energy

then obtain the formula

(28)

where is the measure of the cell ,

(29)

Then the Wasserstein distance between and equals to

5.2 Brenier’s Approach

Kantorovich’s dual approach is for general cost functions. When the cost function is the distance , we can apply Brenier’s approach directly.

We define a height vector , consisting of real numbers. For each , we construct a hyperplane defined on , . We define the Brenier potential function as

(30)

then is a convex function. The graph of is an infinite convex polyhedron with supporting planes . The projection of the graph induces a polygonal partition of ,

(31)

where each cell is the projection of a facet of the graph of onto ,

(32)

The measure of is given by

(33)

The convex function on each cell is a linear function , therefore, the gradient map

(34)

maps each to a single point . According to Alexandrov’s theorem, and the Gu-Luo-Yau theorem, we obtain the following corollary:

Corollary 5.1

Let be a compact convex domain in , be a set of distinct points in and a probability measure on . Then for any , with , there exists , unique up to adding a constant , so that , for all . The vectors are exactly maximum points of the concave function

(35)

Furthermore, minimizes the quadratic cost

among all transport maps .

5.3 Equivalence

For cost cases, we have introduced two approaches: Kantorovich’s dual approach and Brenier’s approach. In the following, we show these two approaches are equivalent.

In Kantorovich’s dual approach, finding the optimal mass transportation is equivalent to maximize the following energy:

In Brenier’s approach, finding the optimal transportation map boils down to maximize

Lemma 5.2

Let be a compact convex domain in , be a set of distinct points in . Given a probability measure on , , with . If , then

and

Figure 6: Variation of the volume of top-dimensional cells¡£

proof: Consider the power cell

is equivalent to

therefore .

Let the transportation cost to be defined as

Suppose we infinitesimally change to , then we define

Then , also . For each , , then , hence

This shows , hence

The Legendre dual of is

Hence

On the other hand, , ,

We put everything together

where and are two constants.

This shows Kantorovich’s dual approach and Brienier’s approach are equivalent. At the optimal point, , therefore equals to the transportation cost . Furthermore, the Brenier’s potential is

where is given by the power weight . The Kantorovich’s potential is the power distance

hence at the optimum, the Brenier potential and the Kantorovich potential are related by

(36)
Figure 7: The framework for Geometric Generative Model.

6 Geometric Generative Model

In this section, we propose a novel generative framework, which combines the discriminator and the generator together. The model decouples the two processes

  1. Encoding/decoding process: This step maps the samples between the image space and the latent (feature) space by using deep neural networks, the encoding map is denoted as , the decoding map is . This step achieves the dimension deduction.

  2. Probability measure transformation process: this step transform a fixed distribution to any given distribution . The mapping is denoted as , . This step can either use conventional deep neural network or use explicit geometric/numerical methods.

There are many existing methods to accomplish the encoding/decoding process, such as VAE model [23], therefore we focus on the second step.

As shown in Fig. 7, given an empirical distribution in the original ambient space , the support of is a sub-manifold . The encoding map transform the support manifold to the latent (or feature) space , pushes forward the empirical distribution to defined on latent space

(37)

where .

Let be a fixed measure on the latent space, we would like to find an optimal transportation map , such that . This is equivalent to find the Brenier potential

Note that, can be easily represented by linear combinations and ReLus. The height parameter can be obtained by optimizing the energy Eqn. 19

The optimal transporation map . This can be carried out as a power diagram with weighted points , where

The relation between the Kantorovich potential and the Brenier potential is

The Wasserstein distance can be explicitly given by

We use to denote the decoding map. Finally, the composition transforms in the latent space to the original empirical distribution in the image space .

7 Experiments

In order to demonstrate in principle the potential of our proposed method, we have designed and conducted the preliminary experiments.

7.1 Comparison with WGAN

(a) initial stage (b) after iterations
(c) after iterations (d) final stage, after iterations
Figure 8: WGAN learns the Gaussian mixture distribution.
(a) Brenier potential (b) Optimal transportation map : power diagram cell
Figure 9: Geometric model learns the Gaussian mixture distribution .

In the first experiment, we use Wasserstein Generative Adversarial Networks (WGANs) [3] to learn the mixed Gaussian distribution as shown in Fig. 8.

Dataset

The distribution of data is described by a point cloud on a plane. We sample data points as real data from two Gaussian distributions, , , where and , and . The latent space is a square on the plane , the input distribution is the unform distribution on . We use a generator to generate data from to approximate the data distribution . We generate samples in total.

Network Structure

The structure of the discriminator is 2-layer ( FC)-ReLU-( FC) network, where FC denotes the fully connected layer. The number of inputs is and the number of outputs is . The number of nodes of the hidden layer is .

The structure of the generator is a 6-layer ( FC)-ReLU-( FC)-ReLU-( FC)-ReLU-( FC)-ReLU-( FC)-ReLU-( FC) network. The number of inputs is and the number of outputs is . The number of nodes of all the hidden layer is .

Parameter Setting

For WGAN, we clip all the weights to . We use the RMSprop [16] as the optimizer for both discriminator and generator. The leaning rate of both the discriminator and generator are set to .

Deep learning framework and hardware

We use the PyTorch [1] as our deep learning tool. Since the toy dataset is small, we do experiments on CPU. We perform experiments on a cluster with cores and RAM. However, for this toy data, the running code only consumes core with less than RAM, which means that it can run on a personal computer.

Results analysis

In Fig. 8, the blue points represent the real data distribution and the orange points represent the generated distribution. The left frame shows the initial stage, the right frame illustrates the stage after 1000 iterations. It seems that WGAN cannot capture the Gaussian mixture distribution. Generated data tend to lie in the middle of the two Gaussians. One reason is the well known mode collapse problem in GAN, meaning that if the data distribution has multiple clusters or data is distributed in multiple isolated manifolds, then the generator is hard to learn multiple modes well. Although there are a couple of methods proposed to deal with this problem [15, 17], these methods require the number of clusters, which is still a open problem in the machine learning community.

Geometric OMT

Figure 9 shows the geometric method to solve the same problem. The left frame shows the Brenier potential , namely the upper envelope, which projects to the power diagram on a unit disk , . The right frame shows the discrete optimal transportation map , which maps each cell to a sample , the cell and the sample have the same color. All the cells have the same area, this demonstrates that pushes the uniform distribution to the exact empirical distribution .

The samples are generated according to the same Gauss mixture distribution, therefore there are two clusters. This doesn’t cause any difficulty for the geometric method. In the left frame, we can see the upper envelope has a sharp ridge, the gradients point to the two clusters. Hence, the geometric method outperforms the WGAN model in the current experiment.

7.2 Geometric Method

(a) Supporting manifold <