MarkovLipschitz Deep Learning
Abstract
We propose a novel framework, called MarkovLipschitz deep learning (MLDL), to tackle geometric deterioration caused by collapse, twisting, or crossing in vectorbased neural network transformations for manifoldbased representation learning and manifold data generation. A prior constraint, called locally isometric smoothness (LIS), is imposed acrosslayers and encoded into a Markov random field (MRF)Gibbs distribution. This leads to the best possible solutions for local geometry preservation and robustness as measured by locally geometric distortion and locally biLipschitz continuity. Consequently, the layerwise vector transformations are enhanced into wellbehaved, LISconstrained metric homeomorphisms. Extensive experiments, comparisons, and ablation study demonstrate significant advantages of MLDL for manifold learning and manifold data generation. MLDL is general enough to enhance any vector transformationbased networks. The code is available at https://github.com/westlakecairi/MarkovLipschitzDeepLearning.
1 Introduction
Manifold learning aims to perform nonlinear dimensionality reduction (NLDR) mapping from the input data space to a latent space so that we can use the Euclidean metric to facilitate pattern analysis and manifold data generation does the inverse from the learned latent space. Numerous literature exist on manifold learning for NLDR, including classic such as ISOMAP [25] and LLE (locally linear embedding) [24] and more recent developments [8, 13, 31, 6, 10, 18] and popular visualization methods such as tSNE [17]. The problem can be considered from the viewpoints of geometry [5]) and topology [27, 21]. The manifold assumption [19, 9] is the basis for NLDR, and preserving local geometric structure is the key to its success.
MLDL  AE/TopoAE  MLLE  ISOMAP  tSNE  

Manifold learning without decoder  Yes  No  Yes  Yes  Yes 
Learned NLDR applicable to test data  Yes  Yes  No  No  No 
Able to generate data of learned manifolds  Yes  No  No  No  No 
In this paper, we develop a novel deep learning framework, called MarkovLipschitz deep learning (MLDL), for manifold learningbased NLDR, representation learning and data generation tasks. The motivations are the following: (1) We believe that existing layerwise vector transformations of neural networks can be enhanced by imposing on them a constraint, which we call locally isometric smoothness (LIS), to become metric homeomorphisms. The resulting LISconstrained homeomorphisms are continuous and bijective mappings, and due to the geometrypreserving property of LIS, avoid the mappings from collapse, twisting, or crossing, thus improve generalization, stability, and robustness. (2) Existing manifold learning methods are based on local geometric structures of data samples, thus may be modeled by conditional probabilities of Markov random fields (MRFs) locally and an MRFGibbs distribution globally.
The paper combines the above two features into the MLDL framework by implementing the LIS constraint through locally biLipschitz continuity and encoding it into the energy (loss) function of an MRFGibbs distribution. The functional features of MLDL are summarized in Table 1 in comparison with other popular methods. The proposed MLDL has advantages over other AEbased methods in terms of manifold learning without a decoder and generating new data from the learned manifold. These merits are ascribed to its distinctive property of bijection and invertibility endowed by the LIS constraint. The main contributions, to the best of our knowledge, are summarized below:

Proposing the MLDL framework that (i) imposes the prior LIS constraint across network layers, (ii) constrains a neural network as a cascade of homeomorphic transformations, and (iii) encodes the constraint into an MRFGibbs prior distribution. This results in MLDLbased neural networks optimized in terms of not only local geometry preservation measured by geometric distortion but also homeomorphic regularity measured by locally biLipschitz constant.

Proposing two instances of MLDLbased neural networks: MarkovLipschitz Encoder (MLEnc) for manifold learning and ML AutoEncoder (MLAE). The decoder part (MLDec) of the MLAE helps regularize manifold learning of MLEnc and also acts as a manifold data generator.

Proposing an auxiliary term for MLDL training. It assists graduated optimization and prevents MLDL from falling into bad local optima (failure in unfolding the manifold into a plane in latent space).

Providing extensive experiments, with self and comparative evaluations and ablation study, which demonstrate significant advantages of MLDL over existing methods in terms of locally geometric distortion and locally biLipschitz continuity.
Related work. Besides the abovementioned literature, related work also includes the following: Markov random field (MRF) concepts [11, 16] are used for establishing a connection between neighborhoodbased loss (energy) functions and the global MRFGibbs distribution. Lipschitz continuity has been used to improve neural networks in generalization [3, 1], robustness against adversarial attacks [20, 28, 7], and generative modeling [32, 23]. Algorithms for estimation of Lipschitz constants are developed in [26, 12, 20, 30, 15].
Organization. In the following, Section 2 introduces the MLDL network structure and related preliminaries, Section 3 presents the key ideas of MLDL and the two MLDL neural networks and Section 4 presents extensive experiments. More details of experiments can be found in Appendix.
2 MLDL Network Structure
The structure of MLDL networks is illustrated in Fig. 1. The MLEncoder, composed of a cascade of locally homeomorphic transformations, is aimed at manifold learning and NLDR, and the corresponding MLDecoder is for manifold reconstruction and data generation. The LIS constraint is imposed across layers and encoded into the energy (loss) function of an MRFGibbs distribution. These lead to good properties in local geometry preservation and locally biLipschitz continuity. This section introduces preliminaries and describes the MLDL concepts of local homeomorphisms and the LIS constraint in connection to MLDL.
2.1 MRFs on neural networks
Markov random fields (MRFs) can be used for manifold learningbased NLDR and manifold data generation at the data sample level because the modeling therein is done with respect to some neighborhood system. On the other hand, at the neural network level, layerwise interaction and performance of a neural network can be model by Markov or relational graphs (e.g. [2, 29]). This section introduces the basic notion of MRFs for modeling relationships between data at each layer and interactions between layers.
Data samples on manifold. Let be a set of samples on a manifold . We aim to learn a latent representation of from based on the local structure of . For this we need to define a metric and a neighborhood system . In this paper, we use the Euclidean metric for and define as the NN of , and when we write a pair , it is restricted by unless specified otherwise. Also, note that the concept of is used for training an MLDL network only and not needed for testing. When is Riemannian, its tangent subspace at any is locally isomorphic to the Euclidean space of and this concept is the basis for manifold learning.
MRF modeling of manifold data samples. Manifold learning algorithms usually work with a neighborhood system, namely, the direct influence of every sample point on the other points is limited to its neighbors for only (condition 1). As long as the neighboring relationship is defined, the conditional distribution is positive (condition 2). The collection of random variables is an MRF when these two conditions are satisfied, which is generally the case for manifold learning algorithms.
Furthermore, is an MRF on with respect to if and only if its joint distribution is a Gibbs distribution with respect to [4]. A Gibbs distribution (GD) of is featured by two things: (1) it belongs to the exponential family; (2) its energy function is defined on cliques. A clique consists of either a single node or a pair of neighboring nodes . Let’s denote the respective set by and and the collection of all given cliques by if is also taken consideration. A GD of with respect to is of the following form
where
(1) 
is the energy composed of clique potentials , and is a global parameter called the temperature. The MRFGibbs equivalence theorem enables the formulations of loss functions in a principled way based on probability and statistics, rather than heuristically, thus laying a rational foundation for optimizationbased manifold learning. The LIS loss and other loss functions to be formulated in this paper are all based on clique potentialsbased energy functions as in Equ. (1).
MRF modeling of neural network layers. We extend the concepts of the neighborhood system and clique potentials to impose betweenlayer constraints. Consider an layered manifold transforming neural network, such as the MLEncoder in Fig. 1. We have as the metric space for layer . Let and in which is the input data and fixed, and all the other (for ) are part of the transformed data or solution. Between a pair of layers and can exist a link (undirected) which we call supercliques. Let be the set of all supercliques, and be the subset of the pairwise supercliques under consideration. Underlying is a neighborhood system which is independent of the input data. is an MRF on w.r.t. and its global joint probability can be modeled by a clique potentialbased Gibbs distribution.
2.2 Local homeomorphisms
Transformation between graphs. Manifold learning from is associated with a graph where can be defined from the distance matrix . The objective of manifold learning or NLDR is to find a local homeomorphism
transforming from the input space to the latent space with . In this work, is realized by the MLEnc which is required to best satisfy the LIS constraint. The reverse process, manifold data generation realized by the MLDec, is also a local homeomorphism
Cascade of local homeomorphisms. While (and its inverse) can be highly nonlinear and complex, we decompose it into a cascade of less nonlinear, locally isometric homeomorphisms . This can be done using an layer neural network (e.g. MLEnc) of the following form
in which is the input data, the nonlinear transformation at layer , the output of , the neighborhood system, and , the distance matrix given metric . The layerwise transformation can be written as
in which is the weight matrix of the neural network to be learned, the distance matrix is updated after iterations, and the product is followed by a nonlinear activation.
Such a decomposition is made possible by the property that the tangent space of a Riemannian manifold is locally isomorphic to a simple Euclidean metric space. The layerwise neural network unfolds in the input space, stage by stage, into in a latent space to best preserve the local geometric structure of . In other words, given , transforms onto , finally resulting in the embedding as .
Effective homeomorphisms. Although the actual neural transformations are from one layer to the next, between any two layers and is an effective compositional homeomorphism . The LIS constraint can be imposed between any pair of and to constrain for all and eventually on the overall .
Evolution in graph structure. The graphs, as represented by , are evolving from layer to layer as the learning continues except for the input layer. The neighborhood structures are allowed to change from to for admissible geometric deformations caused by the nonlinearity of . To account for such changes acrosslayer, we define the set union as the set of pairwise cliques to be considered in formulating losses for for .
3 MarkovLipschitz Neural Networks
3.1 Local isometry and Lipschitz continuity
The core of MLDL is the imposition of the prior LIS constraint acrosslayers (as orangecolored arcs and dashed lines in Fig. 1). MLDL requires that the homeomorphism should satisfy the LIS constraint across layers; that is, the distances (or some other metrics) be preserved locally, for , as far as possible, so as to optimize homeomorphic regularity between metric spaces at different network layers. Such a can be learned by minimizing the following energy functional
(2) 
measures the overall geometric distortion and homeomorphic (ir)regularity of the neural transformation and reaches the lower bound of when the local isometry constraint is strictly satisfied. is said to be locally Lipschitz if for all there exists with
This requires that does not “collapsed”, i.e. for . A Lipschitz mapping or homeomorphism with a smaller tends to generalize better [3, 1], be more robust to adversarial attacks [28, 7], and more stable in generative model learning [32, 23]. A mapping is locally biLipschitz if for all there exists with
(3) 
The best possible biLipschitz constant is and the closer to 1 the better. A good biLipschitz homeomorphism would not only well preserve the local geometric structure but also improve the stability and robustness of resulting MLDL neural networks, as will be validated by our extensive experiments.
3.2 MarkovLipschitz encoder (MLEnc)
MLEnc for Manifold Learning. The MLEnc, unlike the other AEs, can learn an NLDR transformation from the input to latent code by using the LIS constraint only without the need for a reconstruction loss. It consists of a LIS loss plus a transit, auxiliary pushaway loss
in which starts from a positive value at the beginning of manifold learning so that an auxiliary term is effective and gradually decreases to so that only the target takes effect finally.
LIS loss. The prior LIS constraint enforces the classic isometry, that of Equ. (2) and (3), across layers, as a key prior on the manifold learning and NLDR task. It is encoded into crosslayer superclique potentials which are then summed over all into the energy function of an MRFGibbs distribution . The clique potentials due to the LIS constraint are defined as
where are betweenlayer supercliques and are betweensample cliques. Note that is the given input data and is fixed whereas for are part of the solution which can be rewritten as . actually imposes constraint on the solution . It corresponds to the energy function in the prior MRFGibbs distribution, regardless of the input data .
Summing up the potentials gives rise to the energy function
(4) 
where is the union of the two pairwise clique sets as defined before and are weights. Here the LIS loss , corresponding to Equ. (2), is expressed as a function of because given the network architecture, . The weights determine how the LIS constraint should be imposed across . Wishing that as the layer goes closer and closer to the deepest latent layer , the Euclidean metric makes more and more sense, we will evaluate three schemes: decreasing, increasing and constant as the link goes deeper for a scheme of linking between the input layer and each subsequent MLEnc layer.
Pushaway loss. The pushaway loss is defined as
in which is the indicator function and is a bound. This term is aimed to help “unfold” nonlinear manifolds, by exerting a spring force to push away from each other those pairs which are nonneighbors at layer but nearby (distance smaller than ) at layer .
3.3 MarkovLipschitz AutoEncoder (MLAE)
The MLAE has two purposes: (1) helping further regularize MLEnc based manifold learning with the MLDec and reconstruction losses, and (2) enabling manifold data generation of the learned manifold. The MLAE is constructed by appending the ML decoder (MLDec) to the MLEnc, implementing the inverse of the MLEnc, with reconstruction losses imposed. The MLAE is symmetric to the MLEnc in its structure (see Fig. 1). Nonetheless, an asymmetric decoder is also acceptable.
The LIS loss for the MLDec can be defined in a similar way to Equ. (4). The LIS constraint may also be imposed between the corresponding layers of the MLEnc and MLDec. The total reconstruction loss is the sum of individual ones between the corresponding layers (shown as dashed lines in Fig. 1)
(5) 
where are weights. The total MLAE loss is then
Once trained, the ML decoder can be used to generate new data of the learned manifold.
4 Experiments
The purpose of the experiments is to evaluate the MLEnc and MLAE in their ability to preserve local geometry of manifolds and to achieve good stability and robustness in terms of relevant evaluation metrics. While this section presents numerical evaluation of major comparisons, visualization and more numerical results can be found in Appendix.
Four datasets are used: (i) Swiss Roll (3D) and (ii) SCurve (3D) generated by the sklearn library [22], (iii) MNIST (784D), and (vi) Spheres (101D) [21]. Seven methods for manifold learning are compared: MLEnc (ours), HLLE [8], MLLE and LTSA[31], ISOMAP [25], LLE [24] and tSNE [17]; Four autoencoder methods are compared for manifold learning, reconstruction and manifold data generation: MLAE (ours), AE [13], VAE [14], and TopoAE [21].
The Euclidean distance metric is used for all layers () and is normalized to by the dimensionality . The NN scheme (or ball ) is used to define neighborhood systems. The learning rate is set to 0.001, and the batch size is set to the number of samples. LeakyReLU is used as the activation function. Continuation: starts from an initial value at epoch 500 and decreases linearly to the minimum 0 at epoch 1000.
Hyper parameters. For Swiss Roll and SCurve, the network structure is [784, 100, 100, 100, 3, 2], (working with range of 0.2 0.3), , (), , and continuation . For MNIST, the network structure is [784, 1000, 500, 250, 100, 2], (in the neighborhood system), , (), , and . For Spheres5500+5500, the Network structure is [101, 50, 25, 2], , , (), , and . For Spheres10000, the network structure is [101, 90, 80, 70, 2]. Other hyper parameters are the same as Spheres5500+5500. The implementation is based on the PyTorch library running on Ubuntu 18.04 on NVIDIA v100 GPU. The code is available at https://github.com/westlakecairi/MarkovLipschitzDeepLearning
Evaluation metrics include the following. (1) The number of successes (#Succ) is the number of successes (in unfolding the manifold) out of 10 solutions from random initialization (with random seed in ). (2) Local KL divergence (LKL) measures the difference between distributions of local distances in the input and latent spaces. (3) Averaged relative rank change (ARRC), (4) Trustworthiness (Trust) and (4’) Continuity (Cont) measure how well neighboring relationships are preserved between two spaces (layers). (5) Locally geometric distortion (LGD) measures how much corresponding distances between neighboring points differ in two metric spaces. (6) Mean projection error (MPE) measures the “coplanarity” of 2D embeddings in a highD space (in the following, the 3D layer before the 2D latent layer). (7) Minimum (Min) and Maximum (Max) are for local biLipschitz constant values Equ. (3) of computed for all neighborhoods. (8) Mean reconstruction error (MRE) measures the difference between the input and output of autoencoders. Of the above, the MPE (or “coplanarity”) and Min and Max are used for the first time as evaluation metrics for manifold learning. Their exact definitions are given in Appendix A.1. Every set of experiments is run 10 times with the 10 random seeds, and the results are averaged into the final performance metric. When a run is unsuccessful, the numerical averages are not very meaningful so the numbers will be shown in gray color in the following tables.
#Succ  LKL  ARRC  Trust  Cont  LGD  Min  Max  MPE  

MLEnc  10  0.0184  0.000414  0.9999  0.9985  0.00385  1.00  2.14  0.0718 
MLLE  6  0.1251  0.030702  0.9455  0.9844  0.04534  7.37  238.74  0.1709 
HLLE  6  0.1297  0.034619  0.9388  0.9859  0.04542  7.44  218.38  0.0978 
LTSA  6  0.1296  0.034933  0.9385  0.9859  0.04542  7.44  215.93  0.0964 
ISOMAP  6  0.0234  0.009650  0.9827  0.9950  0.02376  1.11  34.35  0.0429 
tSNE  0  0.0450  0.006108  0.9987  0.9843  3.40665  11.1  1097.62  0.1071 
LLE  0  0.1775  0.014249  0.9753  0.9895  0.04671  6.17  451.58  0.1400 
TopoAE  0  0.0349  0.022174  0.9661  0.9884  0.13294  1.27  189.95  0.1307 
#Samples  Noise level  
700  800  1000  1500  2000  0.05  0.10  0.15  0.20  0.25  0.30  
MLEnc  10  10  10  10  10  10  10  10  9  8  8 
MLLE  2  7  7  10  10  7  7  6  6  5  4 
HLLE  2  6  7  10  10  7  7  6  6  5  3 
LTSA  2  6  7  10  10  7  7  6  6  5  3 
ISOMAP  2  6  7  10  10  7  7  6  6  5  4 
4.1 MLEnc for manifold learning and NLDR
NLDR quality. Table 2 compares the MLEnc with 7 other methods (TopoAE also included) in 9 evaluation metrics, using the Swiss Roll (800 points) manifold data (the higher #Succ, Trust and Cont are, the better; the lower the other metrics, the better). Results with the SCurve are given in Appendix A.2. While the MPE is calculated at layer (3D), the other metrics are calculated between the input and the latent layer. The results demonstrate that the MLEnc outperforms all the other methods in all the evaluation metrics, particularly significant in terms of the isometry (LGD, ARRC, Trust and Cont) and Lipschitz (Min and Max) related metrics.
Robustness to sparsity and noise. Table 3 evaluates the success rates of 5 manifold learning methods (tSNE and LLE not included because they had zero success) in their ability to unfold the manifold and robustness to varying numbers of samples (700, 800, 1000, 1500, 2000) and standard deviation of noise . The corresponding evaluation metrics are provided in Appendix A.3. These two sets of experiments demonstrate that the MLEnc achieves the highest success rate, the best performance metrics, and the best robustness to data sparsity and noise.
Generalization to unseen data. The MLEnc trained with 800 points can generalize well to unfold unseen samples of the learned manifold. The test is done as follows: First, a set of 8000 points of the Swiss Roll manifold are generated; the data set is modified by removing, from the generated 8000 points of the manifold, the shape of a diamonds, square, pentagram, or fivering, respectively, creating 4 test sets. Each point of a test set is transformed independently by the trained MLEnc to obtain an embedding ( shown in Appendix A.4). We can see that the unseen manifold data sets are well unfolded by the MLEnc and the removed shapes are kept very well, illustrating that the learned MLEnc has a good ability to generalize to unseen data. Since LLEbased, LTSA, and ISOMAP algorithms do not possess such a generalization ability, the MLEnc is compared with the encoder parts of the AE based algorithms. Unfortunately, AE and VAE failed altogether for the Swiss Roll data sets.
4.2 MLAE for manifold generation
Manifold reconstruction. This set of experiments compare the MLAE with AE [13], VAE [14], and TopoAE [21]. Table 4 compares the 9 quality metrics (each value being calculated as the averages for the 10 runs) for the 4 autoencoders with the Swiss Roll (800 points) data. The MPE is the average of the MPE’s at layers and and the other metrics are calculated between the input and output layers of the AE’s. While the other 3 autoencoders fail to unfold the manifold data sets (hence their metrics do not make much sense), the MLAE produces good quality results especially in terms of the isometry and Lipschitz related metrics. The resulting metrics also suggest that max could be used as an indicator of success/failure in manifold unfolding.
#Succ  LKL  ARRC  Cont  LGD  min  max  MPE  MRE  

MLAE  10  0.00165  0.00070  0.9998  0.00514  1.01  2.54  0.04309  0.01846 
AE  0  0.11537  0.13589  0.9339  0.03069  1.82  5985.74  0.01519  0.40685 
VAE  0  0.23253  0.49784  0.5039  0.04000  1.49  5290.55  0.01977  0.78104 
TopoAE  0  0.05793  0.04891  0.9762  0.09651  1.10  228.11  0.12049  0.56013 
Manifold data generation. The MLDec part of the trained MLAE can be used to generate new data of the learned manifold, mapping from random samples in the latent space to points in the ambient input space. The generated manifold data points, shown in Appendix A.5, are wellbehaved due to the biLipschitz continuity. It also suggests that it is possible to construct invertible, bijective mappings between spaces of different dimensionalities using the proposed MLDL method.
4.3 Results with highdimensional datasets
Having evaluated toy datasets with perceivable structure, the following presents the results with the MNIST dataset in 784D and the Spheres dataset in 101D spaces, obtained by using MLEnc and MLAE in comparison with others, shown in Table 5 (and Figures. A1, A2 and A3 in Appendix A.2). For the MNIST dataset of 10 digits, a subset of 8000 points are randomly chosen for training and another subset of 8000 points for testing without overlapping. The MLEnc is used for NLDR by manifold learning. The results are shown in the first parts of Table 5.
LKL  ARRC  Trust  Cont  Kmin  Kmax  LGD  

MNIST  ISOMAP  0.428  0.003  0.994  0.993  6.395  23.62  2.629 
MLLE  0.227  0.392  0.609  0.554  1.773  7.4 E+11  0.180  
tSNE  0.120  0.005  0.994  0.988  1.137  14385.0  0.192  
TopoAE  0.174  0.006  0.999  0.993  3.200  7.012  0.298  
MLEnc (ours)  0.324  0.004  0.991  0.998  1.006  1.273  0.006  
Spheres5500+5500 
ISOMAP  0.082  0.091  0.881  0.933  1.241  194.5  0.251 
MLLE  0.233  0.095  0.880  0.924  1.693  862.1  0.041  
tSNE  0.079  0.046  0.924  0.968  1.526  690.5  6.814  
TopoAE  0.103  0.091  0.883  0.940  1.195  77.5  0.507  
MLEnc (ours)  0.007  0.095  0.879  0.931  1.023  40.9  0.021 
#Succ  LKL  ARRC  Cont  LGD  Min  Max  MPE  MRE  

AB  10  0.01842  0.00041  0.9998  0.00385  1.00  2.14  0.04309  N/A 
A  0  0.06257  0.09208  0.9810  0.00434  1.00  5.76  0.00574  N/A 
B  0  0.05213  0.09407  0.9806  27.76862  177.41  1089.54  0.00374  N/A 
AB  10  0.00165  0.00070  0.9998  0.00514  1.01  2.54  0.04309  0.01846 
A  0  0.07579  0.09127  0.9810  0.02776  1.05  11.34  0.00574  0.24499 
B  0  0.06916  0.08457  0.9806  0.01896  1.05  125.69  0.00373  0.25894 
The Spheres10000 dataset, proposed by TopoAE [21], is composed of 1 big sphere enclosing 10 small ones in 101D space. Spheres5500+5500 differs in that its big sphere consists of only 500 samples (whereas that in Spheres10000 has 5000) – the data is so sparse that the smallest withinsphere distance on the big sphere can be larger than that between the big sphere and some small ones. The results are shown in the lower parts of Table 5 and in Appendix A.2. From Table 5 (and Appendix A.2), we can see the following:

For the MNIST dataset, the MLEnc achieves overall the best in terms of the performance metric values and possesses an ability to generalize to unseen test data whereas the other compared methods cannot. Visualizationwise, however, tSNE delivers the most appealing result.

For the Spheres dataset, both the MLAE and the MLEnc (without a decoder) perform better than the SOTA TopoAE and the MLAE generalizes better than the MLEnc.

the MLAE and the MLEnc handle sparsity better and generalize better than the others.
Overall, the MLDL framework has demonstrated its superiority over the compared methods in terms of Lipschitz constant and isometryrelated properties, realizing its promise.
4.4 Ablation study
This evaluates effects of the two loss terms in the MLAE on the 9 performance metrics, with the Swiss Roll (800 points) data: (A) the LIS loss and (B) the pushaway loss . Table 6 shows the results when the LIS loss is applied between layers 0 and and between layers 0 and 0’. The conclusion is: (1) the LIS loss (A) is the most important for achieving the results of excellence; (2) the pushaway term (B), which is applied with decreasing and diminishing weight on convergence, helps unfold manifolds especially with challenging input. Overall, the “AB” is the best combination to make the algorithms work. See more in Appendix A.6.
5 Conclusions
The proposed MLDL framework imposes the crosslayer LIS prior constraint to optimize neural transformations in terms of local geometry preservation and local biLipschitz constants. Extensive experiments with manifold learning, reconstruction, and generation consistently demonstrate that the MLDL networks (MLEnc and MLAE) well preserve local geometry for manifold learning and data generation and achieve excellent local biLipschitz constants, advancing deep learning techniques. The main ideas of MLDL are general and effective enough and are potentially applicable to a wide range of neural networks for improving representation learning, data generation, and network stability and robustness. Future work includes the following: (1) While LIS preserves the Euclidean distance, nonlinear forms of metricpreserving will be explored to allow more flexibility for more complicated nonlinear tasks; (2) developing invertible, bijective neural network mappings using the LIS constraint with biLipschitz continuity; (3) extending the unsupervised version of MLDL to selfsupervise, semisupervised and supervised tasks; (4) further formulating MLDL so that crosslayer link hyperparameters become part of learnable parameters.
Acknowledgments
We would like to acknowledge funding support from the Westlake University and Bright Dream Joint Institute for Intelligent Robotics and thank Zicheng Liu, Zhangyang Gao, Haitao Lin, and Yiming Qiao for their assistance in processing experimental results. Thanks are also given to Zhiming Zhou for his helpful comments and suggestions for improving the manuscript.
Appendix
A.1 Definitions of performance metrics
We adopted most of the performance metrics used in TopoAE [21] because ther are suitable for evaluating geometry and topologybased manifold learning and data generation. We restrict the crosslayer related metrics to those concerning two metric spaces, namely, the input space indexed by and the latent space index by because the other compared algorithms do not use other crosslayer constrains. The following notations are used in the definitions:

: the pairwise distance in space (i.e. the input space );

: the pairwise distance in space (i.e. the latent space );

: the set of indices to the nearest neighbors (NN) of ;

: the set of indices to the NN of ;

: the closeness rank of in the NN of ;

: the closeness rank of in the NN of .
The evaluation metrics are defined below:

#Succ is the number of times, out of 10 runs each with a random seed, where a manifold learning method has successfully unfolded the 3D manifold (the Swiss Roll or SCurve) in the input to a 2D planar embedding without twists, tearing, or other defects.

LKL (Local KL divergence) measures the discrepancy between distributions of local distances in two spaces, defined as
where is the “similarity” (a nonlinear function of the distance) between and at layer , defined as
where is the locality parameter.

ARRC (Averaged relative rank change) measures the average of changes in neighbor ranking between two spaces (layers) and :
where and are the lower and upper bounds of the NN, and
in which is the normalizing term

Trust (Trustworthiness) measures how well the nearest neighbors of a point are preserved when going from space to space :
where and are the bounds of the number of nearest neighbors, so averaged for different NN numbers.

Cont (Continuity) is asymmetric to Trust (from space to space ):

LGD (Locally geometric distortion) measures how much corresponding distances between neighboring points differ in two metric spaces and is the primary metric for isometry, defined as (with , and ):
where is the size of used in MLDL neural network training.

MPE (Mean projection error) measures the “coplanarity” of a set of 3D points (in this work, the 3D layer before the final 2D latent layer). The leastsquares 2D plane in the 3D space is fitted from the 3D points . The 3D points is projected onto the fitted planes as . The MPE is defined as

min and max are the minimum and maximum of the local biLipschitz constant for the homeomorphism between layers and , with respect to the given neighborhood system:
where is that for NN used in defininig and

MRE (Mean reconstruction error) measures the difference between the input and output of an autoencoder, as usually defined. More generally, an MRE may also be defined to measure the difference between a pair of corresponding data in the multilayer encoder and decoder.
While the meanings of MRE and KLDivergence are well known, those of the other metrics are explained as follows:

#Succ is the primary metric measuring the success rate of unfolding a manifold. Without successful unfolding, the other metrics would not make sense. This metric is based on manual observation (see examples in A.8 at the end of this document).

RRE, Trust and Cont all measure changes in neighboring relationships acrosslayers. We think that of these three, the Cont is more appropriate than the other two because it emphasizes on the neighboring relationship in the target latent embedding space.

LGD is the primary metric measuring the degree to which the LIS constraint is violated. However, it is unable to detect folding.

min and max are the key metrics for the local biLipschitz continuity, max min . The closer to 1 they are, the better the network homeomorphism preserves the isometry, the more stable the network is in training and the more robust is it against adversarial attacks.

max can effectively identify the collapse in the latent space. This is because if the mapping maps two distinct input samples to an identical one (collapse), max will become huge (infinity) whereas LKL is not sensitive to such collapse.
Every set of experiments is run 10 times, each with a data set generated using a random seed in . Every final metric shown is the average of the 10 results. When a run is unsuccessful in unfolding the input manifold data, the resulting averaged statistics of metrics are not very meaningful so the numbers will be shown in gray color in the following tables of evaluation metrics.
A.2 MLEnc manifold learning
Table A1 compares the MLEnc with other algorithms in the 8 evaluation metrics for the SCurve. The MLEnc performs the best for all but MPE. Note, however, that tSNE and LLE failed to unfold the manifold hence their results should be considered as invalid even if tSNE achieved the lowest MPE value due to its collapsing to a small cluster. Because LLE and tSNE have zero success rate, their metrics do not make sense hence not compared in Table A1. The results show that MLEnc performs significantly better than the others for all the metrics except for MPE.
#Succ  LKL  RRE  Cont  LGD  min  max  MPE  

MLEnc  10  0.00737  0.000054  0.999994  0.000402  1.0017  1.17  0.0585 
MLLE  10  0.00931  0.008893  0.989029  0.045243  7.2454  32.43  0.1722 
HLLE  10  0.00971  0.008866  0.989082  0.045243  7.2341  31.69  0.0804 
LTSA  10  0.00971  0.008866  0.989082  0.045243  7.2341  31.69  0.0948 
ISOMAP  10  0.00931  0.001762  0.999233  0.023359  1.1148  13.36  0.0173 
LLE  0  0.05462  0.008055  0.991993  0.046917  8.4081  143.31  0.1445 
tSNE  0  0.01486  0.002071  0.997830  1.36791  14.6310  311.24  0.0412 
Having evaluated toy datasets with perceivable structure, the following presents the results with the MNIST dataset in 784D and the Spheres dataset in 101D spaces, obtained by using MLEnc and MLAE in comparison with others. For the MNIST dataset of 10 digits, a subset of 8000 points are randomly chosen for training and another subset of 8000 points for testing without overlapping. The MLEnc is used for NLDR by manifold learning. The results are shown in Fig. A1A3.
A.3 Robustness to sparsity and noise
Results for different simple sizes are presented in Table A2  A5. Performance metrics under different noise levels are presented in Table A6  A9.Here, four metrics, Trust, LGD, min and max, are selected to compare MLEnc with others.
700  800  1000  1500  2000  

MLEnc  0.9969  0.9984  0.9977  0.9999  0.9999 
MLLE  0.9760  0.9844  0.9891  0.9947  0.9960 
HLLE  0.9816  0.9851  0.9901  0.9947  0.9960 
LTSA  0.9816  0.9859  0.9900  0.9947  0.9960 
ISOMAP  0.9913  0.9950  0.9977  0.9995  0.9996 
700  800  1000  1500  2000  

MLEnc  0.0167  0.0040  0.0048  0.0016  0.0016 
MLLE  0.0480  0.0453  0.0407  0.0335  0.0293 
HLLE  0.0483  0.0454  0.0408  0.0335  0.0293 
LTSA  0.0483  0.0454  0.0408  0.0335  0.0293 
ISOMAP  0.0261  0.0238  0.0208  0.0161  0.0139 
700  800  1000  1500  2000  

MLEnc  1.0059  1.0049  1.0167  1.0070  1.0063 
MLLE  6.4093  7.3723  8.5129  10.150  50.581 
HLLE  6.7227  7.4430  8.4837  10.149  48.563 
LTSA  6.7226  7.4428  8.4848  10.149  48.526 
ISOMAP  1.1069  1.1075  1.1026  1.0968  8.3342 
700  800  1000  1500  2000  

MLEnc  2.9917  1.7239  18.306  1.9931  2.1216 
MLLE  360.03  240.62  442.48  43.901  50.581 
HLLE  326.59  216.43  301.56  42.356  48.563 
LTSA  314.81  215.84  296.37  42.358  48.526 
ISOMAP  31.299  34.354  27.030  13.247  8.3342 
0.05  0.1  0.15  0.2  0.25  0.30  

MLEnc  0.998  0.998  0.998  0.998  0.997  0.997 
MLLE  0.986  0.986  0.984  0.984  0.983  0.981 
HLLE  0.979  0.979  0.980  0.982  0.981  0.984 
LTSA  0.979  0.979  0.980  0.982  0.981  0.984 
ISOMAP  0.995  0.995  0.995  0.995  0.994  0.993 
0.05  0.1  0.15  0.2  0.25  0.30  

MLEnc  0.004  0.004  0.004  0.005  0.006  0.006 
MLLE  0.045  0.045  0.045  0.046  0.046  0.046 
HLLE  0.045  0.045  0.046  0.046  0.047  0.047 
LTSA  0.046  0.046  0.046  0.046  0.046  0.047 
ISOMAP  0.023  0.023  0.023  0.023  0.023  0.024 
0.05  0.1  0.15  0.2  0.25  0.30  

MLEnc  1.005  1.005  1.006  1.006  1.009  1.011 
MLLE  7.523  7.735  7.693  7.779  7.799  7.852 
HLLE  6.977  7.018  6.987  7.233  8.016  8.328 
LTSA  7.691  7.874  7.406  7.268  7.479  8.328 
ISOMAP  1.114  1.114  1.125  1.117  1.104  1.097 
0.05  0.1  0.15  0.2  0.25  0.30  
MLEnc  4.150  5.450  5.314  6.098  7.660  14.52 
MLLE  304.8  181.1  256.5  416.2  272.5  299.8 
HLLE  508.9  898.3  5997  8221  1162  714.1 
LTSA  508.9  898.3  5997  8221  1162  714.1 
ISOMAP  25.49  16.39  17.71  25.38  35.14  41.42 
A.4 Generalization to unseen data
Fig. A4 demonstrates that a learned MLEnc can generalize well to unseen data in unfolding a modified version of the same manifold to the corresponding version of embedding. The MLEnc model is trained with a Swiss Roll (800 points) dataset. The test is done as follows: First, a set of 8000 points of the Swiss Roll manifold are generated; the data set is modified by removing, from the generated 8000 points of the manifold, the shape of a diamonds, square, pentagram or fivering, respectively, creating 4 test sets. Each point of a test set is transformed independently by the trained MLEnc to obtain an embedding. We can see from each of the resulting embeddings that the unseen manifold data sets are well unfolded by the MLEnc and the removed shapes are kept very well, illustrating that the learned MLEnc has a good ability to generalize to unseen data. Since LLEbased, LTSA and ISOMAP algorithms do not possess such a generalization ability, the MLEnc is compared with the encoder parts of the AE based algorithms. Unfortunately, AE and VAE failed altogether for the Swiss Roll data sets.
A.5 MLAE for manifold data generation
Fig. LABEL:fig:manifolddatageneration shows manifold data reconstruction and generation using MLAE, for which cases AE and VAE all failed in learning to unfold. In the learning phase, the MLAE performs manifold learning for NLDR and then reconstruction, taking (a) the training data in the ambient space as input, output embedding (b) in the learned latent space, and then reconstruct back (c) in the ambient data space. In the generation phase, the MLDec takes as random input samples (d) in the latent space, and maps the samples to the manifold (e) in the ambient data space.
A.6 More ablation about the MLAE loss terms
Three more sets of ablation experiments are provided, further demonstrating that the use of the LIS constraint in MLDL has accomplished its promises. The first set evaluate different crosslayer weight schemes for the MLEnc, based on the 5layer network architecture (310010010032) presented in Section 4.1. The following 6 different crosslayer weight schemes (nonzero weights in ) are evaluated:

(between the input and latent layers only);

, , , , (between each pair of adjacent layers, the weight increasing as the other layer goes deeper);

, , , , (between the latent layer and each of the other layers, the weight increasing as the other layer goes deeper);

, , , , (between the input layer and each of the other layers, the weight increasing as the other layer goes deeper).

, , , , (between the input layer and each of the other layers, the weight being equal for all layers).

, , , , (between the input layer and each of the other layers, the weight decreasing as the other layer goes deeper).
Secondly, it is interesting to evaluate the MLEnc in its ability to preserve the local geometry structure not only at the latent layer, but also at intermediate layers. To see this aspect, we provide the metrics calculated between layer and layer . The ablation results are shown in Table A10. Schemes M1 and M4 have 100% success rate for the 10 runs, M5 and M6 have some successes whereas M2 and M3 have zero. Of M1 and M4, the latter seems better overall.
  Method  #Succ  LKL  RRE  Cont  LGD  min  max  MPE 

01  M1  10  0.0026  0.000792  0.9997  0.03081  2.29  3.29  0.0718 
M2  0  0.0062  0.001156  0.9993  0.02981  2.06  3.45  0.0068  
M3  0  0.0038  0.001781  0.9990  0.02761  1.86  3.39  0.0093  
M4  10  0.0011  0.000555  0.9998  0.01494  1.31  1.72  0.0135  
M5  5  0.0006  0.000220  0.9998  0.00199  1.00  1.18  0.0576  
M6  2  0.0003  0.000084  0.9999  0.00081  1.00  1.08  0.0566  
03  M1  10  0.0235  0.001534  0.9980  0.03541  2.28  5.98  0.0718 
M2  0  0.0667  0.047127  0.9620  0.03642  1.98  43.83  0.0068  
M3  0  0.0292  0.053970  0.9802  0.02235  1.09  49.98  0.0093  
M4  10  0.0135  0.000516  0.9986  0.00443  1.00  1.57  0.0135  
M5  5  0.0024  0.000182  0.9998  0.00110  1.00  1.14  0.0576  
M6  2  0.0031  0.000216  0.9998  0.00100  1.00  1.14  0.0566  
05  M1  10  0.0184  0.000414  0.9998  0.00385  1.00  2.14  0.0718 
M2  0  0.0655  0.102837  0.9499  0.03631  1.71  1152.42  0.0068  
M3  0  0.0488  0.105081  0.9661  0.02166  1.09  994.12  0.0093  
M4  10  0.0184  0.000440  0.9984  0.00400  1.00  1.73  0.0135  
M5  5  0.0327  0.022078  0.9969  0.00630  1.01  3.68  0.0576  
M6  2  0.0454  0.030612  0.9942  0.01321  1.05  8.13  0.0566 
Thirdly, we provide ablation experiments with different correspondinglayer weight schemes for the MLAE, based on the MLAE architecture (31001001003231001001001003) presented in Section 4.3 where the LIS is imposed between layers 0 and 5 with and all . A correspondinglayer scheme is determined by nonzero weights in between the corresponding layers in the MLEnc and MLDec. The following 4 weight schemes are evaluated:

(No LIS constraints between correspondinglayers, as baseline);

, , , , , (the correspondinglayer weight for the LIS constraint increases as the layer number becomes bigger);

, , , , , (all the correspondinglayer weights for the LIS constraint are equal);

, , , , , (the correspondinglayer weight for the LIS constraint decreases as the layer number becomes bigger).
The performance metrics are shown in Table A11, where the metric numbers are calculated between layers and . The results demonstrate that M10 outperforms the other three schemes in all metrics except for one max. When compared with incremental and equal weight scheme (M8 and M9), decreasing the correspondinglayer weight for the LIS constraint as the layer number becomes bigger can result in the greater performance gain, since the closer the data is to the input layer in the Encoder, the more authentic and reliable it is.
#Succ  LKL  RRE  Cont  LGD  min  max  MPE  MRE  

M7  10  0.00165  0.00070  0.9998  0.00514  1.017  2.540  0.04309  0.01846 
M8  10  0.00142  0.00065  0.9998  0.00480  1.017  2.324  0.04288  0.01810 
M9  10  0.00168  0.00070  0.9998  0.00520  1.018  2.543  0.04279  0.01788 
M10  10  0.00121  0.00063  0.9998  0.00459  1.017  2.377  0.04269  0.01762 
References
 (2018) Sorting out lipschitz function approximation. arXiv preprint arXiv:1811.05381. Cited by: §1, §3.1.
 (2018) Markov chain neural networks. CoRR abs/1805.00784. External Links: Link, 1805.00784 Cited by: §2.1.
 (201706) Spectrallynormalized margin bounds for neural networks. arXiv eprints, pp. arXiv:1706.08498. External Links: 1706.08498 Cited by: §1, §3.1.
 (1974) “Spatial interaction and the statistical analysis of lattice systems” (with discussions). Journal of the Royal Statistical Society, Series B 36, pp. 192–236. Cited by: §2.1.
 (201707) Geometric Deep Learning: Going beyond Euclidean data. IEEE Signal Processing Magazine 34 (4), pp. 18–42. Cited by: §1.
 (2009) Local multidimensional scaling for nonlinear dimension reduction, graph drawing, and proximity analysis. Journal of the American Statistical Association 104 (485), pp. 209–219. Cited by: §1.
 (201902) Certified Adversarial Robustness via Randomized Smoothing. arXiv eprints, pp. arXiv:1902.02918. External Links: 1902.02918 Cited by: §1, §3.1.
 (2003) Hessian eigenmaps: locally linear embedding techniques for highdimensional data. Proceedings of the National Academy of Sciences 100 (10), pp. 5591–5596. Cited by: §1, §4.
 (2016) Testing the manifold hypothesis. Journal of American Mathematical Society 29 (4), pp. 983–1049. Cited by: §1.
 (2008) Iterative nonlinear dimensionality reduction with manifold sculpting. In Advances in Neural Information Processing Systems 20, J. C. Platt, D. Koller, Y. Singer and S. T. Roweis (Eds.), pp. 513–520. Cited by: §1.
 (198411) “Stochastic relaxation, Gibbs distribution and the Bayesian restoration of images”. IEEE Transactions on Pattern Analysis and Machine Intelligence 6 (6), pp. 721–741. Cited by: §1.
 (2017) Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pp. 5767–5777. Cited by: §1.
 (2006) Reducing the dimensionality of data with neural networks. science 313 (5786), pp. 504–507. Cited by: §1, §4.2, §4.
 (2013) Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §4.2, §4.
 (2020) Lipschitz constant estimation of neural networks via sparse polynomial optimization. In International Conference on Learning Representations, Cited by: §1.
 (1995) Markov random field modeling in computer vision. Springer. Cited by: §1.
 (2008) Visualizing data using tsne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §1, §4.
 (2016) Nearly isometric embedding by relaxation. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon and R. Garnett (Eds.), pp. 2631–2639. Cited by: §1.
 (200201) Laplacian eigenmaps for dimensionality reduction and data representation. Technical Report, University of Chicago. Cited by: §1.
 (2018) Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, pp. 5767–5777. Cited by: §1.
 (2020) Topological autoencoders. In Proceedings of the 37th International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research. External Links: 1906.00722 Cited by: §1, §4.2, §4.3, §4, A.1 Definitions of performance metrics.
 (2011) Scikitlearn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §4.
 (2019) Losssensitive generative adversarial networks on lipschitz densities. International Journal of Computer Vision, pp. 1–23. Cited by: §1, §3.1.
 (2000) Nonlinear dimensionality reduction by locally linear embedding. science 290 (5500), pp. 2323–2326. Cited by: §1, §4.
 (2000) A global geometric framework for nonlinear dimensionality reduction. science 290 (5500), pp. 2319–2323. Cited by: §1, §4.
 (2018) Lipschitz regularity of deep neural networks: analysis and efficient estimation. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi and R. Garnett (Eds.), pp. 3835–3844. Cited by: §1.
 (201609) Topological Data Analysis. arXiv eprints. Cited by: §1.
 (201801) Evaluating the Robustness of Neural Networks: An Extreme Value Theory Approach. arXiv eprints, pp. arXiv:1801.10578. Cited by: §1, §3.1.
 (202007) Graph Structure of Neural Networks. arXiv eprints. Cited by: §2.1.
 (2018) Efficient neural network robustness certification with general activation functions. In Advances in Neural Information Processing Systems, pp. 4939–4948. Cited by: §1.
 (2007) MLLE: modified locally linear embedding using multiple weights. In Advances in Neural Information Processing systems, pp. 1593–1600. Cited by: §1, §4.
 (2019) Lipschitz generative adversarial nets. arXiv preprint arXiv:1902.05687. Cited by: §1, §3.1.