Markov-Lipschitz Deep Learning
We propose a novel framework, called Markov-Lipschitz deep learning (MLDL), to tackle geometric deterioration caused by collapse, twisting, or crossing in vector-based neural network transformations for manifold-based representation learning and manifold data generation. A prior constraint, called locally isometric smoothness (LIS), is imposed across-layers and encoded into a Markov random field (MRF)-Gibbs distribution. This leads to the best possible solutions for local geometry preservation and robustness as measured by locally geometric distortion and locally bi-Lipschitz continuity. Consequently, the layer-wise vector transformations are enhanced into well-behaved, LIS-constrained metric homeomorphisms. Extensive experiments, comparisons, and ablation study demonstrate significant advantages of MLDL for manifold learning and manifold data generation. MLDL is general enough to enhance any vector transformation-based networks. The code is available at https://github.com/westlake-cairi/Markov-Lipschitz-Deep-Learning.
Manifold learning aims to perform nonlinear dimensionality reduction (NLDR) mapping from the input data space to a latent space so that we can use the Euclidean metric to facilitate pattern analysis and manifold data generation does the inverse from the learned latent space. Numerous literature exist on manifold learning for NLDR, including classic such as ISOMAP  and LLE (locally linear embedding)  and more recent developments [8, 13, 31, 6, 10, 18] and popular visualization methods such as t-SNE . The problem can be considered from the viewpoints of geometry ) and topology [27, 21]. The manifold assumption [19, 9] is the basis for NLDR, and preserving local geometric structure is the key to its success.
|Manifold learning without decoder||Yes||No||Yes||Yes||Yes|
|Learned NLDR applicable to test data||Yes||Yes||No||No||No|
|Able to generate data of learned manifolds||Yes||No||No||No||No|
In this paper, we develop a novel deep learning framework, called Markov-Lipschitz deep learning (MLDL), for manifold learning-based NLDR, representation learning and data generation tasks. The motivations are the following: (1) We believe that existing layer-wise vector transformations of neural networks can be enhanced by imposing on them a constraint, which we call locally isometric smoothness (LIS), to become metric homeomorphisms. The resulting LIS-constrained homeomorphisms are continuous and bijective mappings, and due to the geometry-preserving property of LIS, avoid the mappings from collapse, twisting, or crossing, thus improve generalization, stability, and robustness. (2) Existing manifold learning methods are based on local geometric structures of data samples, thus may be modeled by conditional probabilities of Markov random fields (MRFs) locally and an MRF-Gibbs distribution globally.
The paper combines the above two features into the MLDL framework by implementing the LIS constraint through locally bi-Lipschitz continuity and encoding it into the energy (loss) function of an MRF-Gibbs distribution. The functional features of MLDL are summarized in Table 1 in comparison with other popular methods. The proposed MLDL has advantages over other AE-based methods in terms of manifold learning without a decoder and generating new data from the learned manifold. These merits are ascribed to its distinctive property of bijection and invertibility endowed by the LIS constraint. The main contributions, to the best of our knowledge, are summarized below:
Proposing the MLDL framework that (i) imposes the prior LIS constraint across network layers, (ii) constrains a neural network as a cascade of homeomorphic transformations, and (iii) encodes the constraint into an MRF-Gibbs prior distribution. This results in MLDL-based neural networks optimized in terms of not only local geometry preservation measured by geometric distortion but also homeomorphic regularity measured by locally bi-Lipschitz constant.
Proposing two instances of MLDL-based neural networks: Markov-Lipschitz Encoder (ML-Enc) for manifold learning and ML AutoEncoder (ML-AE). The decoder part (ML-Dec) of the ML-AE helps regularize manifold learning of ML-Enc and also acts as a manifold data generator.
Proposing an auxiliary term for MLDL training. It assists graduated optimization and prevents MLDL from falling into bad local optima (failure in unfolding the manifold into a plane in latent space).
Providing extensive experiments, with self- and comparative evaluations and ablation study, which demonstrate significant advantages of MLDL over existing methods in terms of locally geometric distortion and locally bi-Lipschitz continuity.
Related work. Besides the above-mentioned literature, related work also includes the following: Markov random field (MRF) concepts [11, 16] are used for establishing a connection between neighborhood-based loss (energy) functions and the global MRF-Gibbs distribution. Lipschitz continuity has been used to improve neural networks in generalization [3, 1], robustness against adversarial attacks [20, 28, 7], and generative modeling [32, 23]. Algorithms for estimation of Lipschitz constants are developed in [26, 12, 20, 30, 15].
Organization. In the following, Section 2 introduces the MLDL network structure and related preliminaries, Section 3 presents the key ideas of MLDL and the two MLDL neural networks and Section 4 presents extensive experiments. More details of experiments can be found in Appendix.
2 MLDL Network Structure
The structure of MLDL networks is illustrated in Fig. 1. The ML-Encoder, composed of a cascade of locally homeomorphic transformations, is aimed at manifold learning and NLDR, and the corresponding ML-Decoder is for manifold reconstruction and data generation. The LIS constraint is imposed across layers and encoded into the energy (loss) function of an MRF-Gibbs distribution. These lead to good properties in local geometry preservation and locally bi-Lipschitz continuity. This section introduces preliminaries and describes the MLDL concepts of local homeomorphisms and the LIS constraint in connection to MLDL.
2.1 MRFs on neural networks
Markov random fields (MRFs) can be used for manifold learning-based NLDR and manifold data generation at the data sample level because the modeling therein is done with respect to some neighborhood system. On the other hand, at the neural network level, layer-wise interaction and performance of a neural network can be model by Markov or relational graphs (e.g. [2, 29]). This section introduces the basic notion of MRFs for modeling relationships between data at each layer and interactions between layers.
Data samples on manifold. Let be a set of samples on a manifold . We aim to learn a latent representation of from based on the local structure of . For this we need to define a metric and a neighborhood system . In this paper, we use the Euclidean metric for and define as the -NN of , and when we write a pair , it is restricted by unless specified otherwise. Also, note that the concept of is used for training an MLDL network only and not needed for testing. When is Riemannian, its tangent subspace at any is locally isomorphic to the Euclidean space of and this concept is the basis for manifold learning.
MRF modeling of manifold data samples. Manifold learning algorithms usually work with a neighborhood system, namely, the direct influence of every sample point on the other points is limited to its neighbors for only (condition 1). As long as the neighboring relationship is defined, the conditional distribution is positive (condition 2). The collection of random variables is an MRF when these two conditions are satisfied, which is generally the case for manifold learning algorithms.
Furthermore, is an MRF on with respect to if and only if its joint distribution is a Gibbs distribution with respect to . A Gibbs distribution (GD) of is featured by two things: (1) it belongs to the exponential family; (2) its energy function is defined on cliques. A clique consists of either a single node or a pair of neighboring nodes . Let’s denote the respective set by and and the collection of all given cliques by if is also taken consideration. A GD of with respect to is of the following form
is the energy composed of clique potentials , and is a global parameter called the temperature. The MRF-Gibbs equivalence theorem enables the formulations of loss functions in a principled way based on probability and statistics, rather than heuristically, thus laying a rational foundation for optimization-based manifold learning. The LIS loss and other loss functions to be formulated in this paper are all based on clique potentials-based energy functions as in Equ. (1).
MRF modeling of neural network layers. We extend the concepts of the neighborhood system and clique potentials to impose between-layer constraints. Consider an -layered manifold transforming neural network, such as the ML-Encoder in Fig. 1. We have as the metric space for layer . Let and in which is the input data and fixed, and all the other (for ) are part of the transformed data or solution. Between a pair of layers and can exist a link (undirected) which we call super-cliques. Let be the set of all super-cliques, and be the subset of the pairwise super-cliques under consideration. Underlying is a neighborhood system which is independent of the input data. is an MRF on w.r.t. and its global joint probability can be modeled by a clique potential-based Gibbs distribution.
2.2 Local homeomorphisms
Transformation between graphs. Manifold learning from is associated with a graph where can be defined from the distance matrix . The objective of manifold learning or NLDR is to find a local homeomorphism
transforming from the input space to the latent space with . In this work, is realized by the ML-Enc which is required to best satisfy the LIS constraint. The reverse process, manifold data generation realized by the ML-Dec, is also a local homeomorphism
Cascade of local homeomorphisms. While (and its inverse) can be highly nonlinear and complex, we decompose it into a cascade of less nonlinear, locally isometric homeomorphisms . This can be done using an -layer neural network (e.g. ML-Enc) of the following form
in which is the input data, the nonlinear transformation at layer , the output of , the neighborhood system, and , the distance matrix given metric . The layer-wise transformation can be written as
in which is the weight matrix of the neural network to be learned, the distance matrix is updated after iterations, and the product is followed by a nonlinear activation.
Such a decomposition is made possible by the property that the tangent space of a Riemannian manifold is locally isomorphic to a simple Euclidean metric space. The layer-wise neural network unfolds in the input space, stage by stage, into in a latent space to best preserve the local geometric structure of . In other words, given , transforms onto , finally resulting in the embedding as .
Effective homeomorphisms. Although the actual neural transformations are from one layer to the next, between any two layers and is an effective compositional homeomorphism . The LIS constraint can be imposed between any pair of and to constrain for all and eventually on the overall .
Evolution in graph structure. The graphs, as represented by , are evolving from layer to layer as the learning continues except for the input layer. The neighborhood structures are allowed to change from to for admissible geometric deformations caused by the nonlinearity of . To account for such changes across-layer, we define the set union as the set of pair-wise cliques to be considered in formulating losses for for .
3 Markov-Lipschitz Neural Networks
3.1 Local isometry and Lipschitz continuity
The core of MLDL is the imposition of the prior LIS constraint across-layers (as orange-colored arcs and dashed lines in Fig. 1). MLDL requires that the homeomorphism should satisfy the LIS constraint across layers; that is, the distances (or some other metrics) be preserved locally, for , as far as possible, so as to optimize homeomorphic regularity between metric spaces at different network layers. Such a can be learned by minimizing the following energy functional
measures the overall geometric distortion and homeomorphic (ir)regularity of the neural transformation and reaches the lower bound of when the local isometry constraint is strictly satisfied. is said to be locally Lipschitz if for all there exists with
This requires that does not “collapsed”, i.e. for . A Lipschitz mapping or homeomorphism with a smaller tends to generalize better [3, 1], be more robust to adversarial attacks [28, 7], and more stable in generative model learning [32, 23]. A mapping is locally bi-Lipschitz if for all there exists with
The best possible bi-Lipschitz constant is and the closer to 1 the better. A good bi-Lipschitz homeomorphism would not only well preserve the local geometric structure but also improve the stability and robustness of resulting MLDL neural networks, as will be validated by our extensive experiments.
3.2 Markov-Lipschitz encoder (ML-Enc)
ML-Enc for Manifold Learning. The ML-Enc, unlike the other AEs, can learn an NLDR transformation from the input to latent code by using the LIS constraint only without the need for a reconstruction loss. It consists of a LIS loss plus a transit, auxiliary push-away loss
in which starts from a positive value at the beginning of manifold learning so that an auxiliary term is effective and gradually decreases to so that only the target takes effect finally.
LIS loss. The prior LIS constraint enforces the classic isometry, that of Equ. (2) and (3), across layers, as a key prior on the manifold learning and NLDR task. It is encoded into cross-layer super-clique potentials which are then summed over all into the energy function of an MRF-Gibbs distribution . The clique potentials due to the LIS constraint are defined as
where are between-layer super-cliques and are between-sample cliques. Note that is the given input data and is fixed whereas for are part of the solution which can be rewritten as . actually imposes constraint on the solution . It corresponds to the energy function in the prior MRF-Gibbs distribution, regardless of the input data .
Summing up the potentials gives rise to the energy function
where is the union of the two pairwise clique sets as defined before and are weights. Here the LIS loss , corresponding to Equ. (2), is expressed as a function of because given the network architecture, . The weights determine how the LIS constraint should be imposed across . Wishing that as the layer goes closer and closer to the deepest latent layer , the Euclidean metric makes more and more sense, we will evaluate three schemes: decreasing, increasing and constant as the link goes deeper for a scheme of linking between the input layer and each subsequent ML-Enc layer.
Push-away loss. The push-away loss is defined as
in which is the indicator function and is a bound. This term is aimed to help “unfold” nonlinear manifolds, by exerting a spring force to push away from each other those pairs which are non-neighbors at layer but nearby (distance smaller than ) at layer .
3.3 Markov-Lipschitz AutoEncoder (ML-AE)
The ML-AE has two purposes: (1) helping further regularize ML-Enc based manifold learning with the ML-Dec and reconstruction losses, and (2) enabling manifold data generation of the learned manifold. The ML-AE is constructed by appending the ML decoder (ML-Dec) to the ML-Enc, implementing the inverse of the ML-Enc, with reconstruction losses imposed. The ML-AE is symmetric to the ML-Enc in its structure (see Fig. 1). Nonetheless, an asymmetric decoder is also acceptable.
The LIS loss for the ML-Dec can be defined in a similar way to Equ. (4). The LIS constraint may also be imposed between the corresponding layers of the ML-Enc and ML-Dec. The total reconstruction loss is the sum of individual ones between the corresponding layers (shown as dashed lines in Fig. 1)
where are weights. The total ML-AE loss is then
Once trained, the ML decoder can be used to generate new data of the learned manifold.
The purpose of the experiments is to evaluate the ML-Enc and ML-AE in their ability to preserve local geometry of manifolds and to achieve good stability and robustness in terms of relevant evaluation metrics. While this section presents numerical evaluation of major comparisons, visualization and more numerical results can be found in Appendix.
Four datasets are used: (i) Swiss Roll (3-D) and (ii) S-Curve (3-D) generated by the sklearn library , (iii) MNIST (784-D), and (vi) Spheres (101-D) . Seven methods for manifold learning are compared: ML-Enc (ours), HLLE , MLLE and LTSA, ISOMAP , LLE  and t-SNE ; Four autoencoder methods are compared for manifold learning, reconstruction and manifold data generation: ML-AE (ours), AE , VAE , and TopoAE .
The Euclidean distance metric is used for all layers () and is normalized to by the dimensionality . The -NN scheme (or -ball ) is used to define neighborhood systems. The learning rate is set to 0.001, and the batch size is set to the number of samples. LeakyReLU is used as the activation function. Continuation: starts from an initial value at epoch 500 and decreases linearly to the minimum 0 at epoch 1000.
Hyper parameters. For Swiss Roll and S-Curve, the network structure is [784, 100, 100, 100, 3, 2], (working with range of 0.2 0.3), , (), , and continuation . For MNIST, the network structure is [784, 1000, 500, 250, 100, 2], (in the neighborhood system), , (), , and . For Spheres5500+5500, the Network structure is [101, 50, 25, 2], , , (), , and . For Spheres10000, the network structure is [101, 90, 80, 70, 2]. Other hyper parameters are the same as Spheres5500+5500. The implementation is based on the PyTorch library running on Ubuntu 18.04 on NVIDIA v100 GPU. The code is available at https://github.com/westlake-cairi/Markov-Lipschitz-Deep-Learning
Evaluation metrics include the following. (1) The number of successes (#Succ) is the number of successes (in unfolding the manifold) out of 10 solutions from random initialization (with random seed in ). (2) Local KL divergence (L-KL) measures the difference between distributions of local distances in the input and latent spaces. (3) Averaged relative rank change (ARRC), (4) Trustworthiness (Trust) and (4’) Continuity (Cont) measure how well neighboring relationships are preserved between two spaces (layers). (5) Locally geometric distortion (LGD) measures how much corresponding distances between neighboring points differ in two metric spaces. (6) Mean projection error (MPE) measures the “coplanarity” of 2D embeddings in a high-D space (in the following, the 3D layer before the 2D latent layer). (7) Minimum (-Min) and Maximum (-Max) are for local bi-Lipschitz constant values Equ. (3) of computed for all neighborhoods. (8) Mean reconstruction error (MRE) measures the difference between the input and output of autoencoders. Of the above, the MPE (or “coplanarity”) and -Min and -Max are used for the first time as evaluation metrics for manifold learning. Their exact definitions are given in Appendix A.1. Every set of experiments is run 10 times with the 10 random seeds, and the results are averaged into the final performance metric. When a run is unsuccessful, the numerical averages are not very meaningful so the numbers will be shown in gray color in the following tables.
4.1 ML-Enc for manifold learning and NLDR
NLDR quality. Table 2 compares the ML-Enc with 7 other methods (TopoAE also included) in 9 evaluation metrics, using the Swiss Roll (800 points) manifold data (the higher #Succ, Trust and Cont are, the better; the lower the other metrics, the better). Results with the S-Curve are given in Appendix A.2. While the MPE is calculated at layer (3D), the other metrics are calculated between the input and the latent layer. The results demonstrate that the ML-Enc outperforms all the other methods in all the evaluation metrics, particularly significant in terms of the isometry (LGD, ARRC, Trust and Cont) and Lipschitz (-Min and -Max) related metrics.
Robustness to sparsity and noise. Table 3 evaluates the success rates of 5 manifold learning methods (t-SNE and LLE not included because they had zero success) in their ability to unfold the manifold and robustness to varying numbers of samples (700, 800, 1000, 1500, 2000) and standard deviation of noise . The corresponding evaluation metrics are provided in Appendix A.3. These two sets of experiments demonstrate that the ML-Enc achieves the highest success rate, the best performance metrics, and the best robustness to data sparsity and noise.
Generalization to unseen data. The ML-Enc trained with 800 points can generalize well to unfold unseen samples of the learned manifold. The test is done as follows: First, a set of 8000 points of the Swiss Roll manifold are generated; the data set is modified by removing, from the generated 8000 points of the manifold, the shape of a diamonds, square, pentagram, or five-ring, respectively, creating 4 test sets. Each point of a test set is transformed independently by the trained ML-Enc to obtain an embedding ( shown in Appendix A.4). We can see that the unseen manifold data sets are well unfolded by the ML-Enc and the removed shapes are kept very well, illustrating that the learned ML-Enc has a good ability to generalize to unseen data. Since LLE-based, LTSA, and ISOMAP algorithms do not possess such a generalization ability, the ML-Enc is compared with the encoder parts of the AE based algorithms. Unfortunately, AE and VAE failed altogether for the Swiss Roll data sets.
4.2 ML-AE for manifold generation
Manifold reconstruction. This set of experiments compare the ML-AE with AE , VAE , and TopoAE . Table 4 compares the 9 quality metrics (each value being calculated as the averages for the 10 runs) for the 4 autoencoders with the Swiss Roll (800 points) data. The MPE is the average of the MPE’s at layers and and the other metrics are calculated between the input and output layers of the AE’s. While the other 3 autoencoders fail to unfold the manifold data sets (hence their metrics do not make much sense), the ML-AE produces good quality results especially in terms of the isometry and Lipschitz related metrics. The resulting metrics also suggest that -max could be used as an indicator of success/failure in manifold unfolding.
Manifold data generation. The ML-Dec part of the trained ML-AE can be used to generate new data of the learned manifold, mapping from random samples in the latent space to points in the ambient input space. The generated manifold data points, shown in Appendix A.5, are well-behaved due to the bi-Lipschitz continuity. It also suggests that it is possible to construct invertible, bijective mappings between spaces of different dimensionalities using the proposed MLDL method.
4.3 Results with high-dimensional datasets
Having evaluated toy datasets with perceivable structure, the following presents the results with the MNIST dataset in 784-D and the Spheres dataset in 101-D spaces, obtained by using ML-Enc and ML-AE in comparison with others, shown in Table 5 (and Figures. A1, A2 and A3 in Appendix A.2). For the MNIST dataset of 10 digits, a subset of 8000 points are randomly chosen for training and another subset of 8000 points for testing without overlapping. The ML-Enc is used for NLDR by manifold learning. The results are shown in the first parts of Table 5.
The Spheres10000 dataset, proposed by TopoAE , is composed of 1 big sphere enclosing 10 small ones in 101-D space. Spheres5500+5500 differs in that its big sphere consists of only 500 samples (whereas that in Spheres10000 has 5000) – the data is so sparse that the smallest within-sphere distance on the big sphere can be larger than that between the big sphere and some small ones. The results are shown in the lower parts of Table 5 and in Appendix A.2. From Table 5 (and Appendix A.2), we can see the following:
For the MNIST dataset, the ML-Enc achieves overall the best in terms of the performance metric values and possesses an ability to generalize to unseen test data whereas the other compared methods cannot. Visualization-wise, however, t-SNE delivers the most appealing result.
For the Spheres dataset, both the ML-AE and the ML-Enc (without a decoder) perform better than the SOTA TopoAE and the ML-AE generalizes better than the ML-Enc.
the ML-AE and the ML-Enc handle sparsity better and generalize better than the others.
Overall, the MLDL framework has demonstrated its superiority over the compared methods in terms of Lipschitz constant and isometry-related properties, realizing its promise.
4.4 Ablation study
This evaluates effects of the two loss terms in the ML-AE on the 9 performance metrics, with the Swiss Roll (800 points) data: (A) the LIS loss and (B) the push-away loss . Table 6 shows the results when the LIS loss is applied between layers 0 and and between layers 0 and 0’. The conclusion is: (1) the LIS loss (A) is the most important for achieving the results of excellence; (2) the push-away term (B), which is applied with decreasing and diminishing weight on convergence, helps unfold manifolds especially with challenging input. Overall, the “AB” is the best combination to make the algorithms work. See more in Appendix A.6.
The proposed MLDL framework imposes the cross-layer LIS prior constraint to optimize neural transformations in terms of local geometry preservation and local bi-Lipschitz constants. Extensive experiments with manifold learning, reconstruction, and generation consistently demonstrate that the MLDL networks (ML-Enc and ML-AE) well preserve local geometry for manifold learning and data generation and achieve excellent local bi-Lipschitz constants, advancing deep learning techniques. The main ideas of MLDL are general and effective enough and are potentially applicable to a wide range of neural networks for improving representation learning, data generation, and network stability and robustness. Future work includes the following: (1) While LIS preserves the Euclidean distance, nonlinear forms of metric-preserving will be explored to allow more flexibility for more complicated nonlinear tasks; (2) developing invertible, bijective neural network mappings using the LIS constraint with bi-Lipschitz continuity; (3) extending the unsupervised version of MLDL to self-supervise, semi-supervised and supervised tasks; (4) further formulating MLDL so that cross-layer link hyperparameters become part of learnable parameters.
We would like to acknowledge funding support from the Westlake University and Bright Dream Joint Institute for Intelligent Robotics and thank Zicheng Liu, Zhangyang Gao, Haitao Lin, and Yiming Qiao for their assistance in processing experimental results. Thanks are also given to Zhiming Zhou for his helpful comments and suggestions for improving the manuscript.
A.1 Definitions of performance metrics
We adopted most of the performance metrics used in TopoAE  because ther are suitable for evaluating geometry- and topology-based manifold learning and data generation. We restrict the cross-layer related metrics to those concerning two metric spaces, namely, the input space indexed by and the latent space index by because the other compared algorithms do not use other cross-layer constrains. The following notations are used in the definitions:
: the pairwise distance in space (i.e. the input space );
: the pairwise distance in space (i.e. the latent space );
: the set of indices to the -nearest neighbors (-NN) of ;
: the set of indices to the -NN of ;
: the closeness rank of in the -NN of ;
: the closeness rank of in the -NN of .
The evaluation metrics are defined below:
#Succ is the number of times, out of 10 runs each with a random seed, where a manifold learning method has successfully unfolded the 3D manifold (the Swiss Roll or S-Curve) in the input to a 2D planar embedding without twists, tearing, or other defects.
L-KL (Local KL divergence) measures the discrepancy between distributions of local distances in two spaces, defined as
where is the “similarity” (a nonlinear function of the distance) between and at layer , defined as
where is the locality parameter.
ARRC (Averaged relative rank change) measures the average of changes in neighbor ranking between two spaces (layers) and :
where and are the lower and upper bounds of the -NN, and
in which is the normalizing term
Trust (Trustworthiness) measures how well the nearest neighbors of a point are preserved when going from space to space :
where and are the bounds of the number of nearest neighbors, so averaged for different -NN numbers.
Cont (Continuity) is asymmetric to Trust (from space to space ):
LGD (Locally geometric distortion) measures how much corresponding distances between neighboring points differ in two metric spaces and is the primary metric for isometry, defined as (with , and ):
where is the size of used in MLDL neural network training.
MPE (Mean projection error) measures the “coplanarity” of a set of 3D points (in this work, the 3D layer before the final 2D latent layer). The least-squares 2D plane in the 3D space is fitted from the 3D points . The 3D points is projected onto the fitted planes as . The MPE is defined as
-min and -max are the minimum and maximum of the local bi-Lipschitz constant for the homeomorphism between layers and , with respect to the given neighborhood system:
where is that for -NN used in defininig and
MRE (Mean reconstruction error) measures the difference between the input and output of an autoencoder, as usually defined. More generally, an MRE may also be defined to measure the difference between a pair of corresponding data in the multi-layer encoder and decoder.
While the meanings of MRE and KL-Divergence are well known, those of the other metrics are explained as follows:
#Succ is the primary metric measuring the success rate of unfolding a manifold. Without successful unfolding, the other metrics would not make sense. This metric is based on manual observation (see examples in A.8 at the end of this document).
RRE, Trust and Cont all measure changes in neighboring relationships across-layers. We think that of these three, the Cont is more appropriate than the other two because it emphasizes on the neighboring relationship in the target latent embedding space.
LGD is the primary metric measuring the degree to which the LIS constraint is violated. However, it is unable to detect folding.
-min and -max are the key metrics for the local bi-Lipschitz continuity, -max -min . The closer to 1 they are, the better the network homeomorphism preserves the isometry, the more stable the network is in training and the more robust is it against adversarial attacks.
-max can effectively identify the collapse in the latent space. This is because if the mapping maps two distinct input samples to an identical one (collapse), -max will become huge (infinity) whereas L-KL is not sensitive to such collapse.
Every set of experiments is run 10 times, each with a data set generated using a random seed in . Every final metric shown is the average of the 10 results. When a run is unsuccessful in unfolding the input manifold data, the resulting averaged statistics of metrics are not very meaningful so the numbers will be shown in gray color in the following tables of evaluation metrics.
A.2 ML-Enc manifold learning
Table A1 compares the ML-Enc with other algorithms in the 8 evaluation metrics for the S-Curve. The ML-Enc performs the best for all but MPE. Note, however, that t-SNE and LLE failed to unfold the manifold hence their results should be considered as invalid even if t-SNE achieved the lowest MPE value due to its collapsing to a small cluster. Because LLE and t-SNE have zero success rate, their metrics do not make sense hence not compared in Table A1. The results show that ML-Enc performs significantly better than the others for all the metrics except for MPE.
Having evaluated toy datasets with perceivable structure, the following presents the results with the MNIST dataset in 784-D and the Spheres dataset in 101-D spaces, obtained by using ML-Enc and ML-AE in comparison with others. For the MNIST dataset of 10 digits, a subset of 8000 points are randomly chosen for training and another subset of 8000 points for testing without overlapping. The ML-Enc is used for NLDR by manifold learning. The results are shown in Fig. A1-A3.
A.3 Robustness to sparsity and noise
Results for different simple sizes are presented in Table A2 - A5. Performance metrics under different noise levels are presented in Table A6 - A9.Here, four metrics, Trust, LGD, -min and -max, are selected to compare ML-Enc with others.
A.4 Generalization to unseen data
Fig. A4 demonstrates that a learned ML-Enc can generalize well to unseen data in unfolding a modified version of the same manifold to the corresponding version of embedding. The ML-Enc model is trained with a Swiss Roll (800 points) dataset. The test is done as follows: First, a set of 8000 points of the Swiss Roll manifold are generated; the data set is modified by removing, from the generated 8000 points of the manifold, the shape of a diamonds, square, pentagram or five-ring, respectively, creating 4 test sets. Each point of a test set is transformed independently by the trained ML-Enc to obtain an embedding. We can see from each of the resulting embeddings that the unseen manifold data sets are well unfolded by the ML-Enc and the removed shapes are kept very well, illustrating that the learned ML-Enc has a good ability to generalize to unseen data. Since LLE-based, LTSA and ISOMAP algorithms do not possess such a generalization ability, the ML-Enc is compared with the encoder parts of the AE based algorithms. Unfortunately, AE and VAE failed altogether for the Swiss Roll data sets.
A.5 ML-AE for manifold data generation
Fig. LABEL:fig:manifolddatageneration shows manifold data reconstruction and generation using ML-AE, for which cases AE and VAE all failed in learning to unfold. In the learning phase, the ML-AE performs manifold learning for NLDR and then reconstruction, taking (a) the training data in the ambient space as input, output embedding (b) in the learned latent space, and then reconstruct back (c) in the ambient data space. In the generation phase, the ML-Dec takes as random input samples (d) in the latent space, and maps the samples to the manifold (e) in the ambient data space.
A.6 More ablation about the ML-AE loss terms
Three more sets of ablation experiments are provided, further demonstrating that the use of the LIS constraint in MLDL has accomplished its promises. The first set evaluate different cross-layer weight schemes for the ML-Enc, based on the 5-layer network architecture (3-100-100-100-3-2) presented in Section 4.1. The following 6 different cross-layer weight schemes (nonzero weights in ) are evaluated:
(between the input and latent layers only);
, , , , (between each pair of adjacent layers, the weight increasing as the other layer goes deeper);
, , , , (between the latent layer and each of the other layers, the weight increasing as the other layer goes deeper);
, , , , (between the input layer and each of the other layers, the weight increasing as the other layer goes deeper).
, , , , (between the input layer and each of the other layers, the weight being equal for all layers).
, , , , (between the input layer and each of the other layers, the weight decreasing as the other layer goes deeper).
Secondly, it is interesting to evaluate the ML-Enc in its ability to preserve the local geometry structure not only at the latent layer, but also at intermediate layers. To see this aspect, we provide the metrics calculated between layer and layer . The ablation results are shown in Table A10. Schemes M1 and M4 have 100% success rate for the 10 runs, M5 and M6 have some successes whereas M2 and M3 have zero. Of M1 and M4, the latter seems better overall.
Thirdly, we provide ablation experiments with different corresponding-layer weight schemes for the ML-AE, based on the ML-AE architecture (3-100-100-100-3-2-3-100-100-100-100-3) presented in Section 4.3 where the LIS is imposed between layers 0 and 5 with and all . A corresponding-layer scheme is determined by nonzero weights in between the corresponding layers in the ML-Enc and ML-Dec. The following 4 weight schemes are evaluated:
(No LIS constraints between corresponding-layers, as baseline);
, , , , , (the corresponding-layer weight for the LIS constraint increases as the layer number becomes bigger);
, , , , , (all the corresponding-layer weights for the LIS constraint are equal);
, , , , , (the corresponding-layer weight for the LIS constraint decreases as the layer number becomes bigger).
The performance metrics are shown in Table A11, where the metric numbers are calculated between layers and . The results demonstrate that M10 outperforms the other three schemes in all metrics except for one -max. When compared with incremental and equal weight scheme (M8 and M9), decreasing the corresponding-layer weight for the LIS constraint as the layer number becomes bigger can result in the greater performance gain, since the closer the data is to the input layer in the Encoder, the more authentic and reliable it is.
- (2018) Sorting out lipschitz function approximation. arXiv preprint arXiv:1811.05381. Cited by: §1, §3.1.
- (2018) Markov chain neural networks. CoRR abs/1805.00784. External Links: Cited by: §2.1.
- (2017-06) Spectrally-normalized margin bounds for neural networks. arXiv e-prints, pp. arXiv:1706.08498. External Links: Cited by: §1, §3.1.
- (1974) “Spatial interaction and the statistical analysis of lattice systems” (with discussions). Journal of the Royal Statistical Society, Series B 36, pp. 192–236. Cited by: §2.1.
- (2017-07) Geometric Deep Learning: Going beyond Euclidean data. IEEE Signal Processing Magazine 34 (4), pp. 18–42. Cited by: §1.
- (2009) Local multidimensional scaling for nonlinear dimension reduction, graph drawing, and proximity analysis. Journal of the American Statistical Association 104 (485), pp. 209–219. Cited by: §1.
- (2019-02) Certified Adversarial Robustness via Randomized Smoothing. arXiv e-prints, pp. arXiv:1902.02918. External Links: Cited by: §1, §3.1.
- (2003) Hessian eigenmaps: locally linear embedding techniques for high-dimensional data. Proceedings of the National Academy of Sciences 100 (10), pp. 5591–5596. Cited by: §1, §4.
- (2016) Testing the manifold hypothesis. Journal of American Mathematical Society 29 (4), pp. 983–1049. Cited by: §1.
- (2008) Iterative non-linear dimensionality reduction with manifold sculpting. In Advances in Neural Information Processing Systems 20, J. C. Platt, D. Koller, Y. Singer and S. T. Roweis (Eds.), pp. 513–520. Cited by: §1.
- (1984-11) “Stochastic relaxation, Gibbs distribution and the Bayesian restoration of images”. IEEE Transactions on Pattern Analysis and Machine Intelligence 6 (6), pp. 721–741. Cited by: §1.
- (2017) Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pp. 5767–5777. Cited by: §1.
- (2006) Reducing the dimensionality of data with neural networks. science 313 (5786), pp. 504–507. Cited by: §1, §4.2, §4.
- (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §4.2, §4.
- (2020) Lipschitz constant estimation of neural networks via sparse polynomial optimization. In International Conference on Learning Representations, Cited by: §1.
- (1995) Markov random field modeling in computer vision. Springer. Cited by: §1.
- (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §1, §4.
- (2016) Nearly isometric embedding by relaxation. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon and R. Garnett (Eds.), pp. 2631–2639. Cited by: §1.
- (2002-01) Laplacian eigenmaps for dimensionality reduction and data representation. Technical Report, University of Chicago. Cited by: §1.
- (2018) Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, pp. 5767–5777. Cited by: §1.
- (2020) Topological autoencoders. In Proceedings of the 37th International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research. External Links: Cited by: §1, §4.2, §4.3, §4, A.1 Definitions of performance metrics.
- (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §4.
- (2019) Loss-sensitive generative adversarial networks on lipschitz densities. International Journal of Computer Vision, pp. 1–23. Cited by: §1, §3.1.
- (2000) Nonlinear dimensionality reduction by locally linear embedding. science 290 (5500), pp. 2323–2326. Cited by: §1, §4.
- (2000) A global geometric framework for nonlinear dimensionality reduction. science 290 (5500), pp. 2319–2323. Cited by: §1, §4.
- (2018) Lipschitz regularity of deep neural networks: analysis and efficient estimation. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi and R. Garnett (Eds.), pp. 3835–3844. Cited by: §1.
- (2016-09) Topological Data Analysis. arXiv e-prints. Cited by: §1.
- (2018-01) Evaluating the Robustness of Neural Networks: An Extreme Value Theory Approach. arXiv e-prints, pp. arXiv:1801.10578. Cited by: §1, §3.1.
- (2020-07) Graph Structure of Neural Networks. arXiv e-prints. Cited by: §2.1.
- (2018) Efficient neural network robustness certification with general activation functions. In Advances in Neural Information Processing Systems, pp. 4939–4948. Cited by: §1.
- (2007) MLLE: modified locally linear embedding using multiple weights. In Advances in Neural Information Processing systems, pp. 1593–1600. Cited by: §1, §4.
- (2019) Lipschitz generative adversarial nets. arXiv preprint arXiv:1902.05687. Cited by: §1, §3.1.