Geometric Understanding of Deep Learning
Abstract
Deep learning is the mainstream technique for many machine learning tasks, including image recognition, machine translation, speech recognition, and so on. It has outperformed conventional methods in various fields and achieved great successes. Unfortunately, the understanding on how it works remains unclear. It has the central importance to lay down the theoretic foundation for deep learning.
In this work, we give a geometric view to understand deep learning: we show that the fundamental principle attributing to the success is the manifold structure in data, namely natural high dimensional data concentrates close to a lowdimensional manifold, deep learning learns the manifold and the probability distribution on it.
We further introduce the concepts of rectified linear complexity for deep neural network measuring its learning capability, rectified linear complexity of an embedding manifold describing the difficulty to be learned. Then we show for any deep neural network with fixed architecture, there exists a manifold that cannot be learned by the network. By empirical evidences, we also demonstrate the learning accuracies of thestateofart autoencoders are reasonably good but still leave large spaces to be improved. Finally, we propose to apply optimal mass transportation theory to control the probability distribution in the latent space.
1 Introduction
Deep learning is the mainstream technique for many machine learning tasks, including image recognition, machine translation, speech recognition, and so on [12]. It has outperformed conventional methods in various fields and achieved great successes. Unfortunately, the understanding on how it works remains unclear. It has the central importance to lay down the theoretic foundation for deep learning.
We believe that the main fundamental principle to explain the success of deep learning is the manifold structure in the data, there exists a well accepted manifold assumption: natural high dimensional data concentrates close to a nonlinear lowdimensional manifold.
Manifold Representation
The main focus of various deep learn methods is to learn the manifold structure from the real data and obtain a parametric representation of the manifold. In general, there is a probability distribution in the ambient space , the support of is a low dimensional manifold . For example, an autoencoder learns the encoding map and the decoding map , where is the latent space. The parametric representation of the input manifold is given by the decoding map . The reconstructed manifold approximates input manifold. Furthermore, the DNN also learns and controls the distribution induced by the encoder defined on the latent space. Once the parametric manifold structure is obtained, it can be applied for various application, such as randomly generating a sample on as a generative model. Image denoising can be reinterpreted geometrically as projecting a noisy sample onto representing the clean image manifold, the closest point on gives the denoised image.
Learning Capability
An autoencoder implemented by a ReLU DNN offers a piecewise functional space, the manifold structure can be learned by optimizing special loss functions. We introduce the concept of Rectified Linear Complexity of a DNN, which represents the upper bound of the number of pieces of all the functions representable by the DNN, and gives a measurement for the learning capability of the DNN. On the other hand, the piecewise linear encoding map defined on the ambient space is required to be homemorphic from to a domain on . This requirement induces strong topological constraints of the input manifold . We introduce another concept Rectified linear Complexity of an embedded manifold , which describes the minimal number of pieces for a PL encoding map, and measures the difficulty to be encoded by a DNN. By comparing the complexities of the DNN and the manifold, we can verify if the DNN can learn the manifold in principle. Furthermore, we show for any DNN with fixed architecture, there exists an embedding manifold that can not be encoded by the DNN.
Approximation Accuracy
The reconstructed manifold approximates the manifold in data. The approximation accuracy is analyzed by empirical experiments on low dimensional surfaces. Current autoencoders accomplish the learning tasks reasonably well, but leave large spaces to be improved. The subtle geometric details are lost during the encoding/decoding process, the mappings are not globally homeomorphic.
Latent Probability Distribution Control
The distribution induced by the encoding map can be controlled by designing special loss functions to modify the encoding map . We also propose to use optimal mass transportation theory to find the optimal transportation map defined on the latent space, which transforms simple distributions, such as Gaussian or uniform, to . Comparing to the conventional WGAN model, this method replaces the blackbox by explicit mathematical construction, and avoids the competition between the generator and the discriminator.
1.1 Contributions
This work proposes a geometric framework to understand autoencoder and general deep neural networks and explains the main theoretic reason for the great success of deep learning  the manifold structure hidden in data. The work introduces the concept of rectified linear complexity of a ReLU DNN to measure the learning capability, and rectified linear complexity of an embedded manifold to describe the encoding difficulty. By applying the concept of complexities, it is shown that for any DNN with fixed architecture, there is a manifold too complicated to be encoded by the DNN. The work also shows current approximation accuracy is reasonable but still low, and leaves much space to be improved. Finally, the work proposes to apply optimal mass transportation map to control the distribution on the latent space.
1.2 Organization
The current work is organized in the following way: section 2 briefly reviews the literature of autoencoders; section 3 explains the manifold representation; section 4 quantifies the learning capability of a DNN and the learning difficulty for a manifold; section 5 empirically analyzes the approximation accuracy of the reconstructed manifold; section 6 proposes to control the probability measure induced by the encoder using optimal mass transportation theory.
2 Previous Works
The literature of autoencoders is vast, in the following we only briefly review the most related ones as representatives.
Traditional Autoencoders (AE)
The traditional autoencoder (AE) framework first appeared in [2], which was initially proposed to achieve dimensionality reduction. [2] use linear autoencoder to compare with PCA. With the same purpose, [14] proposed a deep autoencoder architecture, where the encoder and the decoder are multilayer deep networks. Due to nonconvexity of deep networks, they are easy to converge to poor local optima with random initialized weights. To solve this problem, [14] used restricted Boltzmann machines (RBMs) to pretrain the model layer by layer before finetuning. Later [4] used traditional AEs to pretrain each layer and got similar results.
Sparse Encoders
The traditional AE uses bottleneck structure, the width of the middle later is less than that of the input layer. The sparse autoencoder (SAE) was introduced in [10], which uses overcomplete latent space, that is the middle layer is wider than the input layer. Sparse autoencoders [19, 21, 20] were proposed.
Denoising Autoencoder (DAE)
[30, 29] proposed denoising autoencoder (DAE) in order to improve the robustness from the corrupted input. DAEs add regularizations on inputs to reconstruct a “repaired” input from a corrupted version. Stacked denoising autoencoders (SDAEs) is constructed by stacking multiple layers of DAEs, where each layer is pretrained by DAEs. The DAE/SDAE is suitable for denosing purposes, such as speech recognition [9, 9], and removing musics from speeches [33], medical image denoising [11] and superresolutions [7].
Contractive Autoencoders (CAEs)
[24] proposed contractive autoencoders (CAEs) to achieve robustness by minimizing the first order variation, the Jacobian. The concept of contraction ratio is introduced, which is similar to the Lipschitz constants. In order to learn the lowdimensional structure of the input data, the panelty of construction error encourages the contraction ratios on the tangential directions of the manifold to be close to , and on the orthogonal directions to the manifold close to . Their experiments showed that the learned representations performed as good as DAEs on classification problems and showed that their contraction properties are similar. Following this work, [23] proposed the higherorder CAE which adds an additional penalty on all higher derivatives.
Generative Model
Autoencoders can be transformed into a generative model by sampling in the latent space and then decode the samples to obtain new data. [30] used Bernoulli sampling to AEs and DAEs to first implement this idea. [5] used Gibbs sampling to alternatively sample between the input space and the latent space, and transfered DAEs into generative models. They also proved that the generated distribution is consistent with the distribution of the dataset. [22] proposed a generative model by sampling from CADs. They used the information of the Jacobian to sample around the latent space.
The Variational autoencoder (VAE) [15] use probability perspective to interprete autoencoders. Suppose the real data distribution is in , the encoding map pushes forward to a distribution in the latent space . VAE optimizes , such that is normal distributed in the latent space.
Followed by the big success of GANs, [17] proposed adversarial autoencoders (AAEs), which use GANs to minimize the discrepancy between the push forward distribution and the desired distribution in the latent space.
3 Manifold Structure
Deep learning is the mainstream technique for many machine learning tasks, including image recognition, machine translation, speech recognition, and so on [12]. It has outperformed conventional methods in various fields and achieved great successes. Unfortunately, the understanding on how it works remains unclear. It has the central importance to lay down the theoretic foundation for deep learning.
We believe that the main fundamental principle to explain the success of deep learning is the manifold structure in the data, namely natural high dimensional data concentrates close to a nonlinear lowdimensional manifold.
The goal of deep learning is to learn the manifold structure in data and the probability distribution associated with the manifold.
3.1 Concepts and Notations
The concepts related to manifold are from differential geometry, and have been translated to the machine learning language.
Definition 3.1 (Manifold).
An dimensional manifold is a topological space, covered by a set of open sets . For each open set , there is a homeomorphism , the pair form a chart. The union of charts form an atlas . If , then the chart transition map is given by ,
a. Input manifold  b. latent representation  c. reconstructed manifold 
d. cell decomposition  e. induced latent space  f. cell decomposition 
cell decomposition 
As shown in Fig. 1, suppose is the ambient space, is a probability distribution defined on , represented as a density function . The support of ,
is a lowdimensional manifold. is a local chart, is called an encoding map, the parameter domain is called the latent space or feature space. A point is called a sample, its parameter is called the code or feature of . The inverse map is called the decoding map. Locally, gives a local parametric representation of the manifold.
Furthermore, the encoding map induces a pushforward probability measure defined on the latent space : for any measurable set ,
The goal for deep learning is to learn the encoding map , decoding map , the parametric representation of the manifold , furthermore the pushforward probability and so on. In the following, we explain how an autoencoder learns the manifold and the distribution.
3.2 Manifold Learned by an Autoencoder
Autoencoders are commonly used for unsupervised learning [3], they have been applied for compression, denoising, pretraining and so on. In abstract level, autoencoder learns the lowdimensional structure of data and represent it as a parametric polyhedral manifold, namely a piecewise linear (PL) map from latent space (parameter domain) to the ambient space, the image of the PL mapping is a manifold. Then autoencoder utilizes the polyhedral manifold as the approximation of the manifold in data for various applications. In implementation level, an autoencoder partition the manifold into pieces (by decomposing the ambient space into cells) and approximate each piece by a hyperplane as shown in Fig. 2.
Architecturally, an autoencoder is a feedforward, nonrecurrent neural network with the output layer having the same number of nodes as the input layer, and with the purpose of reconstructing its own inputs. In general, a bottleneck layer is added for the purpose of dimensionality reduction. The input space is the ambient space, the output space is also the ambient space. The output space of the bottle neck layer is the latent space.
{diagram}
{(X,x),μ,Σ} &\rTo^φ &{(F,z), D}
&\rdTo_ψ∘φ &\dTo^ψ
& &{(X,~x),~Σ}
An autoencoder always consists of two parts, the encoder and the decoder. The encoder takes a sample and maps it to , , the image is usually referred to as latent representation of . The encoder maps to its latent representation homemorphically. After that, the decoder maps to the reconstruction of the same shape as , . Autoencoders are also trained to minimise reconstruction errors:
where is the loss function, such as squared errors. The reconstructed manifold is used as an approximation of .
In practice, both encoder and decoder are implemented as ReLU DNNs, parameterized by . Let be the training data set, , the autoencoder optimizes the following loss function:
Both the encoder and the decoder are piecewise linear mappings. The encoder induces a cell decomposition of the ambient space
where is a convex polyhedron, the restriction of on it is an affine map. Similarly, the piecewise linear map induces a polyhedral cell decomposition , which is a refinement (subdivision) of . The reconstructed polyhedral manifold has a parametric representation , which approximates the manifold in the data.
Fig. 2 shows an example to demonstrate the learning results of an autoencoder. The ambient space is , the manifold is the buddha surface as shown in frame (a). The latent space is , the encoding map parameterizes the input manifold to a domain on as shown in frame (b). The decoding map reconstructs the surface into a piecewise linear surface , as shown in frame (c). In ideal situation, the composition of the encoder and decoder should equal to the identity map, the reconstruction should coincide with the input . In reality, the reconstruction is only a piecewise linear approximation of . By examining the Fig. 2, we can see that the encoding map is not homeomorphic everywhere. The mapping is degenerated near the finger and the mouth regions, where multiple points on are mapped onto the same point in the latent space. The reconstructed surface crudely approximates the original input surface , some subtle geometric features are lost.
Fig. 2 also shows the cell decompositions induced by the encoding map and that by the reconstruction map . It is obvious that subdivides .
3.3 Direct Applications
Once the neural network has learned a manifold , it can be utilized for many applications.
Generative Model
Suppose is the space of all color images, where each point represents an image. We can define a probability measure , which represents the probability for an image to represent a human face. The shape of a human face is determined by a finite number of genes. The facial photo is determined by the geometry of the face, the lightings, the camera parameters and so on. Therefore, it is sensible to assume all the human facial photos are concentrated around a finite dimensional manifold, we call it as human facial photo manifold .
By using many real human facial photos, we can train an autoendoer to learn the human facial photo manifold. The learning process produces a decoding map , namely a parametric representation of the reconstructed manifold. We randomly generate a parameter (white noise), gives a human facial image. This can be applied as a generative model for generating human facial photos.
Denoising
Tradition image denoising performs Fourier transformation of the input noisy image, then filtering out the high frequency components, inverse Fourier transformation to get the denoised image. This method is general and independent of the content of the image.
In deep learning, image denoising can be reinterpreted as geometric projection as shown in Fig. 3. Suppose we perform human facial image denoising. The clean human facial photo manifold is , the noisy facial image is not in but close to . We project to , the closest point to on is , then is the denoised image.
In practice, suppose an noisy facial image is given , we train an autoencoder to obtain a manifold of clean facial images represented as and an encoding map , then we encode the noisy image , then maps to the reconstructed manifold . The result is the denoised image. Fig. 4 shows the projection of several outliers onto the buddha surface using an autoencoder.
(a) project to the human facial  (b) project to the cat facial 
image manifold  image manifold 
We apply this method for human facial image denoising as shown in Fig. 5, in frame (a) we project the noisy image to the human facial image manifold and obtain good denoising result; in frame (b) we use the cat facial image manifold, the results are meaningless. This shows deep learning method heavily depends on the underlying manifold, which is specific to the problem. Hence the deep learning based method is not as universal as the conventional ones.
4 Learning Capability
a. Input manifold  b. latent representation  c. reconstructed manifold 
d. cell decomposition  e. cell decomposition  f. level set 
4.1 Main Ideas
Fig. 6 shows another example, an Archimedean spiral curve embedded in , the curve equation is given by , are constants, . For relatively small range , the encoder successfully maps it onto a straight line segment, and the decoder reconstructs a piecewise linear curve with good approximation quality. When we extend the spiral curve by enlarging , then at some threshold, the autoencoder with the same architecture fails to encode it.
The central problems we want to answer are as follows:

How to decide the bound of the encoding or representation capability for an autoencoder with a fixed ReLU DNN architecture?

How to describe and compute the complexity of a manifold embedded in the ambient space to be encoded ?

How to verify whether a embedded manifold can be encoded by a ReLU DNN autoencoder?
For the first problem, our solutions are based on the geometric intuition of the piecewise linear nature of encoder/decoder maps. By examining fig. 2 and fig. 6, we can see the mapping and induces polyhedral cell decompositions of the ambient space , and respectively. The number of cells offers a measurement to describing the representation capabilities of these maps, the upper bound of the number of cells describes the limit of the encoding capability of . We call this upper bound as the rectified linear complexity of the autoencoder. The rectified linear complexity can be deduced from the architecture of the encoder network, as claimed in our theorem 4.5.
For the second problem, we introduce the similar concept to the embedded manifold. The encoder map has a very strong geometric requirement: suppose is a cell in , then is an affine map to the latent space, its restriction on is a homeomorphism . In order to satisfy the two stringent requirements for the encoding map: the piecewise ambient linearity and the local homeomorphism, the number of cells of the decomposition of (and of ) must be greater than a lower bound. Similarly, we call this lower bound the rectified linear complexity of the pair of the manifold and the ambient space . The rectified linear complexity can be derived from the geometry of and its embedding in . Our theorem 4.12 gives a criteria to verify if a manifold can be rectified by a linear map.
For the third problem, we can compare the rectified linear complexity of the manifold and the autoencoder. If the RL complexity of the autoencoder is less than that of the manifold, then the autoencoder can not encode the manifold. Specifically, we show that for any autoencoder with a fixed architecture, there exists an embedded manifold, which can not be encoded by it.
4.2 ReLU Deep Neuron Networks
We extend the ReLU activation function to vectors through entrywise operation:
For any , let and denote the class of affine and linear transformations from , respectively.
Definition 4.1 (ReLU DNN).
For any number of hidden layers , input and output dimensions , a ReLU DNN is given by specifying a sequence of natural numbers representing widths of the hidden layers, a set of affine transformations for and a linear transformation corresponding to weights of hidden layers. Such a ReLU DNN is called a layer ReLU DNN, and is said to have hidden layers, denoted as .
The mapping represented by this ReLU DNN is
(1) 
where denotes mapping composition, represent all the weight and bias parameters. The depth of the ReLU DNN is , the width is , the size .
Definition 4.2 (PL Mapping).
A mapping is a piecewise linear mapping if there exists a finite set of polyhedra whose union is , and is affine linear over each polyhedron. The number of pieces of is the number of maximal connected subsets of over which is affine linear, denoted as . We call as the rectified linear complexity of .
Definition 4.3 (Rectified Linear Complexity of a ReLU DNN).
Given a ReLU DNN , its rectified linear complexity is the upper bound of the rectified linear complexities of all PL functions represented by ,
Lemma 4.4.
The maximum number of parts one can get when cutting dimensional space with hyperplanes is denoted as , then
(2) 
Proof.
Suppose hyperplanes cut into cells, each cell is a convex polyhedron. The th hyperplane is , then the first hyperplanes intersection and partition into cells, each cell on partitions a polyhedron in into cells, hence we get the formula
It is obvious that , the formula (2) can be easily obtained by induction. ∎
Theorem 4.5 (Rectified Linear Complexity of a ReLU DNN).
Given a ReLU DNN , representing PL mappings with hidden layers of widths , then the linear rectified complexity of has an upper bound,
(3) 
Proof.
The th hidden layer computes the mapping . Each neuron represents a hyperplane in , the hyperplanes partition the whole space into polyhedra.
The first layer partitions into at most cells; the second layer further subdivides the cell decomposition, each cell is at most subdivides into polyhedra, hence two layers partition the source space into at most . By induction, one can obtain the upper bound of as described by the inequality (2). ∎
4.3 Cell Decomposition
The PL mappings induces cell decompositions of both the ambient space and the latent space . The number of cells is closely related to the rectified linear complexity.
Fix the encoding map , let the set of all neurons in the network is denoted as , all the subsets is denoted as .
Definition 4.6 (Activated Path).
Given a point , the activated path of consists all the activated neurons when is evaluated, and denoted as . Then the activated path defines a setvalued function .
Definition 4.7 (Cell Decomposition).
Fix an encoding map represented by a ReLU RNN, two data points are equivalent, denoted as , if they share the same activated path, . Then each equivalence relation partitions the ambient space into cells,
each equivalence class corresponds to a cell: if and only if . is called the cell decomposition induced by the encoding map .
Furthermore, maps the cell decomposition in the ambient space to a cell decomposition in the latent space. Similarly, the composition of the encoding and decoding maps also produces a cell decomposition, denoted as , which subdivises . Fig. 2 bottom row shows these cell decompositions.
4.4 Learning Difficulty
Definition 4.8 (Linear Rectifiable Manifold).
Suppose is a dimensional manifold, embedded in , we say is linear rectifiable, if there exists an affine map , such that the restriction of on , , is homeomorphic. is called the corresponding rectified linear map of .
Definition 4.9 (Linear Rectifiable Atlas).
Suppose is a dimensional manifold, embedded in , is an atlas of . If each chart is linear rectifiable, is the rectified linear map of , then the atlas is called a linear rectifiable atlas of .
Given a compact manifold and its atlas , one can select a finite number of local charts , still covers . The number of charts of an atlas is denoted as .
Definition 4.10 (Rectified Linear Complexity of a Manifold).
Suppose is a dimensional manifold embedded in , the rectified linear complexity of is denoted as and defined as,
(4) 
4.5 Learnable Condition
Definition 4.11 (Encoding Map).
Suppose is a dimensional manifold, embedded in , a continuous mapping is called an encoding map of , if restricted on , is homeomorphic.
Theorem 4.12.
Suppose a ReLU DNN represents a PL mapping , is a dimensional manifold embedded in . If is an encoding mapping of , then the rectified linear complexity of is no less that the rectified linear complexity of ,
Proof.
The ReLU DNN computes the PL mapping , suppose the corresponding cell decomposition of is
where each is a convex polyhedron, . If is an encoding map of , then
form a linear rectifiable atlas of . Hence from the definition of rectified linear complexity of an ReLU DNN and the manifold, we obtain
∎
The encoding map is required to be homeomorphic, this adds strong topological constraints to the manifold . For example, if is a surface, is , then must be a genus zero surface with boundaries. In general, assume is a simply connected domain in , then must be a dimensional topological disk. The topological constraint implies that autoencoder can only learn manifolds with simple topologies, or a local chart of the whole manifold.
On the other hand, the geometry and the embedding of determines the linear rectifiability of .
Lemma 4.13.
Suppose a dimensional manifold is embedded in , {diagram} M&\rTo^G& S^n &\rTo^p & RP^n where is the Gauss map, is the real projective space, the projection maps antipodal points to the same point, if covers the whole , then is not linear rectifiable.
Proof.
Given any unit vector , all the unit vectors orthogonal to form a sphere , then , therefore there is a point , is in the tangent space at . Line is tangent to , by shifting the line by an infinitesimal amount, the line intersects at two points. This shows there is no linear mapping, which projects onto along . Because is arbitrary, is not linear rectifiable. ∎
a. linear rectifiable  b. nonlinearrectifiable  c. Peano curve  d. Peano curve 
Theorem 4.14.
Given any ReLU deep neural network , there is a manifold embedded in , such that can not be encoded by .
Proof.
First, we prove the simplest case. When , we can construct space filling Peano curves, as shown in Fig. 7. Suppose is shown in the left frame, we make copies of , by translation, rotation, reconnection and scaling to construct , as shown in the right frame. Similarly, we can construct all ’s. The red square shows one unit, has units, has units. Each unit is not rectifiable, therefore
We can choose big enough, such that , then can not be encoded by .
Similarly, for any and , we can construct Peano curves to fill , which can not be encoded by . The Peano curve construction can be generalized to higher dimensional manifolds by direct product with unit intervals. ∎
5 Approximation Accuracy
input manifold  latent representation  reconstructed manifold  reconstructed manifold 
In ideal situation, the composition of the encoding and decoding maps should equal to the identity of the manifold . In practice, the reconstructed manifold is only an approximation of . As shown in Fig. 8, the human facial surface encoded/decoded by an autoencoder. Although the encoding/decoding maps are homeomorphic, the approximation of by loses geometric details.
Uniform Sampling
The buddha surface is conformally mapped onto the planar unit disk using the Ricci flow method [32], the image is shown in Fig. 11 frame (a). Then by composing with an optimal mass transportation map using the algorithm in [25], one obtain an areapreserving mapping , the image is shown in Fig. 9 left frame. Then we uniformly sample the planar disk to get the samples , then pull them back on to by , , . Because is areapreserving, is uniformly distributed on the disk, is uniformly distributed on as shown in Fig. 9 right frame.
a. front view  b. left view  c. back view 
a.conformal mapping  b. LSCM  c.autoencoding map 
a. right view  b. front view  c. back view 
a. right view  b. front view  back view 
Encoding Map
Fig. 11 shows the image of the encoding map, . Frame (a) and (b) are the computation results using conventional conformal surface parameterizations based on Ricci flow [32] and least square conformal mapping (LSCM) [16], which are guaranteed to be homeomorphic. Frame (c) shows the result obtained by the autoencoder. By carefully examing the finger and the mouth regions, we can see the mapping is not homeomorphic, there are flipped triangles in these regions. We have tried to eliminate the flipped triangles by increasing the widths of the latent layers, but couldn’t obtain homeomorphism.
Reconstructed Manifold
The autoencoder represents piecewise linear map , which induces a cell decomposition (as shown in Fig. 2 frame (f) ), restricted on each cell, the mapping is affine. Fig. 12 shows the reconstructed buddha , which are polyhedral surfaces. The global shape is captured by the reconstructed manifold, but local geometric features are lost.
Cell Decomposition
Fig. 13 shows the cell decomposition of the ambient space induced by the encoding map , the first and third frames show the cut view. Different cells are colorencoded differential. It is clear that all the cells are convex polyhedra cells. Fig. 2 frame (e) shows the cell decomposition in the latent space, Fig. 14 shows the cell structure on the reconstructed manifold. It can be observed that each layer in the network subdivides the cell decomposition produced by previous layers, hence the cell decompositions become refiner and refiner. There are many cells close to be degenerated, which can be pruned to simplify the network while preserving the performance.
Approximation Accuracy
The subtle and complicated local geoemtric features, such as fingers and facial geometry show the buddha has high rectified linear complexity. It is difficult to achieve global homeomorphism.
Fig. 15 shows the reconstruction results obtained by autoencoders with different architectures (widths of latent layers) and sizes of training sample. The encoding map is implemented by a network , where the input dimension is always , the width of the bottle neck layer is . The network for the decoding map share the similar architecture. It is obvious that the increase of the number of nodes and training samples only improves the approximation accuracy marginally.
a.  b.  c. 
samples  samples  samples 
a.  b.  c. reconstructed using 
samples  samples  novel method 
a. original hand model  b.  c. reconstructed results 
samples  by novel method 
a. original model  b.  c. reconstructed result 
samples  by novel method 
Comparison
Fig. 15 shows the reconstructed polyhedral surface using conventional autoencoder. We have developed a novel method which has greatly improved the approximation accuracy as shown in Fig. 16. The novel method produces polyhedral surfaces with detailed geometric features, in fact, numerical measurements show that the approximation accuracy has been improved by one order of magnitude. Fig. 17 and Fig. 18 show the reconstructed manifolds of a hand surface and the Stanford bunny surface respectively by the novel method (b) and the autoencoder (c), the quality improvements are obvious. In our next paper , we will give the details of the novel method.
6 Control Induced Measure
In generative models, such as VAE [15] or GAN [1], the probability measure in the latent space induced by the encoding mapping is controlled to be simple distributions, such as Gaussian or uniform, then in the generating process, we can sample from the simple distribution in the latent space, and use the decoding map to produce a sample in the ambient space.
Optimal Mass Transportation
The optimal transportation theory can be found in Villani’s classical books [27][28]. Suppose is the induced probability in the latent space with a convex support , is the simple distribution, e.g. the uniform distribution on . A mapping is measurepreserving if . Given the transportation cost between two points , the transportation cost of is defined as
The Wasserstein distance between and is defined as
The measurepreserving map that minimizes the transportation cost is called the optimal mass transportation map.
Kantorovich proved that the Wasserstein distance can be represented as
where is called the Kontarovhich potential, its ctransform
In WGAN, the discriminator computes the generator computes the decoding map , the discriminator computes the Wasserstein distance between and . If the cost function is chosen to be the norm, , is 1Lipsitz, then , the discriminator computes the Kontarovich potential, the generator computes the optimal mass transportation map, hence WGAN can be modeled as an optimization
The competition between the discriminator and the generator leads to the solution.
If we choose the cost function to be the norm, , then the computation can be greatly simplified. Briener’s theorem [6] claims that there exists a convex function , the socalled Brenier’s potential, such that its gradient map gives the optimal mass transportation map. The Brenier’s potential satisfies the MongeAmpere equation
Geometrically, the MongeAmpere equation can be understood as solving Alexandroff problem: finding a convex surface with prescribed Gaussian curvature. A practical algorithm based on variational principle can be found in [13]. The Brenier’s potential and the Kontarovich’s potential are related by the closed form
(5) 
Eqn.(5) shows that: the generator computes the optimal transportation map , the discriminator computes the Wasserstein distance by finding Kontarovich’s potential ; and can be converted to each other, hence the competition between the generator and the discriminator is unnecessary, the two deep neural networks for the generator and the discriminator are redundant.
AutoencoderOMT model
As shown in Fig. 19, we can use autoencoder to realize encoder and decoder , use OMT in the latent space to realize probability transformation , such that
We call this model as OMTautoencoder.
(a) real digits  (b) VAE 
(c) WGAN  (d) AEOMT 
Fig. 6 shows the experiments on the MNIST data set. The digits generates by OMTAE have better qualities than those generated by VAE and WGAN. Fig.(6) shows the human facial images on CelebA data set. The images generated by OMTAE look better than those produced by VAE.
(a) VAE  (d) AEOMT 
7 Conclusion
This work gives a geometric understanding of autoencoders and general deep neural networks. The underlying principle is the manifold structure hidden in data, which attributes to the great success of deep learning. The autoencoders learn the manifold structure and construct a parametric representation. The concepts of rectified linear complexities are introduced to both DNN and manifold, which describes the fundamental learning limitation of the DNN and the difficulty to be learned of the manifold. By applying the concept of complexities, it is shown that for any DNN with fixed architecture, there is a manifold too complicated to be encoded by the DNN. Experiments on surfaces show the approximation accuracy can be improved. By applying optimal mass transportation theory, the probability distribution in the latent space can be fully controlled in a more understandable and more efficient way.
In the future, we will develop refiner estimates for the complexities of the deep neural networks and the embedding manifolds, generalize the geometric framework to other deep learning models.
Acknowledgement
The authors thank our students: Yang Guo, Dongsheng An, Jingyao Ke, Huidong Liu for all the experimental results, also thank our collaborators: Feng Luo, Kefeng Liu, Dimitris Samaras for the helpful discussions.
References
 Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. International Conference on Machine Learning, pages 214–223, 2017.
 P. Baldi and K. Hornik. Neural networks and principal component analysis: Learning from examples without local minima. Neural Netw., 2(1):53–58, January 1989.
 Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):1798–1828, August 2013.
 Yoshua Bengio and Yann LeCun. Scaling learning algorithms towards AI. In Leon Bottou, Olivier Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines. MIT Press, 2007.
 Yoshua Bengio, Li Yao, Guillaume Alain, and Pascal Vincent. Generalized denoising autoencoders as generative models. In Proceedings of the 26th International Conference on Neural Information Processing Systems  Volume 1, NIPS’13, pages 899–907, USA, 2013. Curran Associates Inc.
 Yann Brenier. Polar factorization and monotone rearrangement of vectorvalued functions. Comm. Pure Appl. Math., 44(4):375–417, 1991.
 Chakravarty R. Alla Chaitanya, Anton S. Kaplanyan, Christoph Schied, Marco Salvi, Aaron Lefohn, Derek Nowrouzezahrai, and Timo Aila. Interactive reconstruction of monte carlo image sequences using a recurrent denoising autoencoder. ACM Trans. Graph., 36(4):98:1–98:12, July 2017.
 J. Deng, Z. Zhang, E. Marchi, and B. Schuller. Sparse autoencoderbased feature transfer learning for speech emotion recognition. In 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, pages 511–516, Sept 2013.
 X. Feng, Y. Zhang, and J. Glass. Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1759–1763, May 2014.
 Peter Földiák and Malcolm P. Young. The handbook of brain theory and neural networks. chapter Sparse Coding in the Primate Cortex, pages 895–898. MIT Press, Cambridge, MA, USA, 1998.
 L. Gondara. Medical image denoising using convolutional denoising autoencoders. In 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), pages 241–246, Dec 2016.
 Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
 Xianfeng Gu, Feng Luo, Jian Sun, and ShingTung Yau. Variational principles for minkowski type problems, discrete optimal transport, and discrete mongeampere equations. Asian Journal of Mathematics (AJM), 20(2):383 ¨C 398, 2016.
 Geoffrey Hinton and Ruslan Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504 – 507, 2006.
 Diederik P. Kingma and Max Welling. Autoencoding variational bayes. CoRR, abs/1312.6114, 2013.
 B. Lévy, S. Petitjean, N. Ray, and J. Maillot. Least squares conformal maps for automatic texture generation. ACM Trans. on Graphics (SIGGRAPH), 21(2):362–371, 2002.
 Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, and Ian Goodfellow. Adversarial autoencoders. In International Conference on Learning Representations, 2016.
 Andrew Ng. Sparse autoencoder. CS294A Lecture Notes, December 2011.
 Bruno A. Olshausen and David J. Field. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision Research, 37(23):3311 – 3325, 1997.
 Marc’ Aurelio Ranzato, YLan Boureau, and Yann LeCun. Sparse feature learning for deep belief networks. In Proceedings of the 20th International Conference on Neural Information Processing Systems, NIPS’07, pages 1185–1192, USA, 2007. Curran Associates Inc.
 Marc’Aurelio Ranzato, Christopher Poultney, Sumit Chopra, and Yann LeCun. Efficient learning of sparse representations with an energybased model. In Proceedings of the 19th International Conference on Neural Information Processing Systems, NIPS’06, pages 1137–1144, Cambridge, MA, USA, 2006. MIT Press.
 Salah Rifai, Yoshua Bengio, Yann N. Dauphin, and Pascal Vincent. A generative process for sampling contractive autoencoders. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12, pages 1811–1818, USA, 2012. Omnipress.
 Salah Rifai, Grégoire Mesnil, Pascal Vincent, Xavier Muller, Yoshua Bengio, Yann Dauphin, and Xavier Glorot. Higher order contractive autoencoder. In Proceedings of the 2011 European Conference on Machine Learning and Knowledge Discovery in Databases  Volume Part II, ECML PKDD’11, pages 645–660, Berlin, Heidelberg, 2011. SpringerVerlag.
 Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, and Yoshua Bengio. Contractive autoencoders: Explicit invariance during feature extraction. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, pages 833–840, USA, 2011. Omnipress.
 Zhengyu Su, Yalin Wang, Rui Shi, Wei Zeng, Jian Sun, Feng Luo, and Xianfeng Gu. Optimal mass transport for shape matching and comparison. IEEE Trans. Pattern Anal. Mach. Intell., 37(11):2246–2259, 2015.
 C. Tao, H. Pan, Y. Li, and Z. Zou. Unsupervised spectral spatial feature learning with stacked sparse autoencoder for hyperspectral imagery classification. IEEE Geoscience and Remote Sensing Letters, 12(12):2438–2442, Dec 2015.
 Cédric Villani. Topics in optimal transportation. Number 58 in Graduate Studies in Mathematics. American Mathematical Society, Providence, RI, 2003.
 Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
 Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and PierreAntoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, pages 1096–1103, New York, NY, USA, 2008. ACM.
 Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and PierreAntoine Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res., 11:3371–3408, December 2010.
 Jun Xu, Lei Xiang, Qingshan Liu, Hannah Gilmore, Jianzhong Wu, Jinghai Tang, and Anant Madabhushi. Stacked sparse autoencoder (ssae) for nuclei detection on breast cancer histopathology images. IEEE Transactions on Medical Imaging, 35(1):119–130, 1 2016.
 Wei Zeng, Dimitris Samaras, and Xianfeng David Gu. Ricci flow for 3d shape analysis. IEEE Trans. Pattern Anal. Mach. Intell., 32(4):662–677, 2010.
 M. Zhao, D. Wang, Z. Zhang, and X. Zhang. Music removal by convolutional denoising autoencoder in speech recognition. In 2015 AsiaPacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pages 338–341, Dec 2015.