Geometric Disentanglement by Random Convex Polytopes
Abstract
Finding and analyzing meaningful representations of data is the purpose of machine learning. The idea of representation learning is to extract representations from the data itself, e.g., by utilizing deep neural networks. In this work, we examine representation learning from a geometric perspective. Especially, we focus on the convexity of classes and clusters as a natural and desirable representation property, for which robust and scalable measures are still lacking. To address this, we propose a new approach called Random Polytope Descriptor that allows a convex description of data points based on the construction of random convex polytopes. This ties in with current methods for statistical disentanglement. We demonstrate the use of our technique on wellknown deep learning methods for representation learning. Specifically we find that popular regularization variants such as the Variational Autoencoder can destroy crucial information that is relevant for tasks such as outofdistribution detection.
1 Introduction
Finding a meaningful data representation that captures the information that is relevant for some task or problem lies at the very core of any datadriven discipline such as statistics or machine learning. In contrast to handcrafted data features, representation learning [4] aims to learn such representations from the data itself. The concept of representation learning is also one of the fundamental principles of deep learning [33, 47], which set a new state of the art in recent years in numerous domains such as computer vision, speech recognition, or natural language processing. Here, representation functions are given by multilayered neural networks that are parameterized by weights which are jointly optimized on some learning objective.
What makes a representation meaningful and which general principles are useful for representation learning is being actively discussed [4, 15, 53, 35, 54]. One prominent line of research considers the statistical nature of disentanglement [46, 5, 4, 11, 30, 35, 36], which is rooted in the idea that a learned representation should \enquoteseparate the distinct, informative factors of variations in the data [35]. While this is a valuable notion, other important properties of representation learning such as smoothness, the hierarchy of concepts, temporal and spatial coherence, or the preservation of natural clusters are also desired [4] and have been less studied. In this work, we focus on the convexity of classes and clusters in the image of deep neural network mappings and take a geometric perspective on representation learning.
Convexity is a very natural and fundamental representation property in machine learning in general. The separation of classes via multiple hyperplanes or halfspaces in supervised classification for instance, such as in Support Vector Machines (SVMs) [49] in the Reproducing Kernel Hilbert Space (RKHS) via the hinge loss or in deep neural networks (DNNs) [18] in the output space via the crossentropy loss, forms polyhedra that are convex and generally unbounded. Moreover, many wellknown unsupervised learning methods make explicit convexity assumptions on the support of the data representation, such as Voronoi cells in means clustering or ellipsoids in Gaussian Mixture Models (GMMs). Yet it is worth noting that convexity is also implicit in Gaussian prior assumptions in popular deep generative models such as Variational Autoencoders (VAEs) [26, 41, 53], Generative Adversarial Networks (GANs) [19, 13, 14], or Normalizing Flows [42, 28, 27]. Finally, oneclass classification methods for support estimation such as the OneClass SVM [48] or (Deep) Support Vector Data Description (SVDD) [52, 44] also rely on convexity assumptions on the data representation via maximummargin hyperplane and minimum enclosing hypersphere descriptions, respectively.
Although a convex representation of classes and clusters is such a natural and commonly desired property, measuring convexity in a robust and efficient way in highdimensional spaces is a challenging and fairly nontrivial problem. An approach that would aim to exactly solve the convex hull problem of a set of data points in an idealized setting is prohibitively expensive to compute in high dimensions. Moreover, such an exact convex hull description would not be robust towards outliers and thus rather undesirable in terms of generalization.
In this work, we propose a scalable and robust method called Random Polytope Descriptor (RPD) for evaluating convexity in representation learning. Our method is based on concepts from convex geometry and constructs a polytope (a piecewiselinear, bounded convex body) around the training data in representation space. Since polytopes by themselves may also suffer from a combinatorial explosion, we construct our descriptor from random convex polytopes instead which also makes it more robust. Finally, using the proximities to such polytopes allows us to judge the geometric disentanglement of representations, i.e. how well classes and clusters are separated into convex bodies. Our main contributions are the following:

We propose the Random Polytope Descriptor (RPD) method which is based on the construction of random convex polytopes for evaluating convexity in representation learning and theoretically prove its scalability.

We demonstrate the usefulness of RPD in experiments on wellknown supervised as well as unsupervised deep learning methods for representation learning.

We find that popular regularization variants of deep autoencoders such as the Variational Autoencoder can destroy crucial geometric information that is relevant for outofdistribution detection.
2 Related Work
How to evaluate the quality and usefulness of a representation is a difficult question to answer in general and subject of ongoing research [4, 15, 53, 35, 54]. If only one single task is of interest, a straightforward approach of course would be to evaluate the performance of a representation on some measure that is meaningful to the task at hand. Considering the supervised classification task for instance, one might evaluate the quality of a model and representation by using the accuracy measure, i.e. by how well the representation separates some test data via hyperplanes in agreement with the respective groundtruth labels. However, even in such a welldefined standard task as classification, other representation properties such as the robustness against outofdistribution samples [38, 20, 31, 34, 44] or adversarial attacks [50, 9, 6, 8], as well as interpretability aspects [32, 45] might be desirable and thus relevant representation quality criteria. This matter becomes even more challenging in the unsupervised or selfsupervised setting, where the aim is learn more generic data representations that prove useful for a variety of downstream tasks (e.g., multitask or transfer learning) [55, 43, 51].
Bengio et al. [4] have collected and formulated some wellknown generic criteria, socalled generic priors for representation learning. These include smoothness (if , then also ), the hierarchy of concepts (e.g., for images going from pixel to object features), semisupervised learning (representations for supervised and unsupervised tasks should align), temporal and spatial coherence (small variations across time and space should result in similar representations), the preservation of natural clusters (data generated from the same categorical variable should have similar representations), and statistical disentanglement (a representation should separate the distinct, informative factors of variations of the data). Different variants of the Variational Autoencoder (VAE) [26, 41, 21, 30, 25, 10] which especially emphasize statistical disentanglement are currently considered to be the state of the art for unsupervised representation learning [53, 35]. These approaches are closely related to earlier works on (nonlinear) independent component analysis (ICA) which study the problem of recovering statistically independent (latent) components of a signal [12, 3, 22, 2, 24].
As mentioned above, the convexity of classes and natural clusters is another fundamental representation property that is often only expressed implicitly through certain model assumptions. The basic property that linear interpolations between data representations should result in representations from the same class or cluster naturally ties into the generic priors above. This also follows the simplicity principle from Bengio et al. [4] that “in good representations, the factors are related through simple, typically linear dependencies.”
3 Methods
The applicability of our method depends on the general clustering assumption, that the training data is composed of different unobserved groups (classes), and that the joint density of observed variables is a mixture of classspecific densities. This assumption might be exemplified by assuming that the data come from a mixture of multivariate multinomial distributions, which is the natural assumption in data analysis. Moreover we assume, that the superlevel sets of the density function for each of the summands in the mixture form a convex set, except maybe for regions of very low probability.
The method is designed to work well on embeddings into feature space, where each of the classes are embedded into mutually disjoint convex sets. Such assumption is at the foundation of classical methods e.g. SVMs and kernel methods. However these methods do not take into account the finiteness of training data simply partitioning the whole feature space among classes, i.e. usually most classes are given infinite regions. While such feature might seem desired, as it affords generalization properties for models, it also increases the surface for adversarial and outofdistribution attacks. Here we attempt to remedy this limitation by explicitly partitioning the feature space into compact convex regions and an unbounded (nonconvex) “outofdistribution” set.
The general idea is to use the convex hull of points from the given class and the distance to as dissimilarity score. As mentioned earlier, both the convex hull computation and distance from a polytope are prohibitively expensive in higher dimensions, therefore we will replace both of these steps by their approximations. In this section we introduce the definition of (approximate) dual bounding body of and show how such polytope can be turned into a descriptor separating the points in the set from the exterior. Instead of computing actual distance, we will use the substitute of scaling distance (a piecewiselinear scaling of euclidean distance) from a well defined central point of the polytope.
3.1 An idealized setting
We consider distinct sets , where is the set of samples of class , considered in feature space . Associated with these classes are convex sets , where
If all pairwise intersections are disjoint, for , then these convex sets provide perfect descriptors, provided that we have some oracle which can test whether a new sample is contained in either of the sets or their complement.
In the applications below each class will be finite, whence the sets form convex polytopes; cf. [56] for general background. We will return to the more general setting later; but for now we will stick to finite sets and polytopes. In this case an oracle for checking is given by linear programming. While linear programs can be solved in (weakly) polynomial time, they are still somewhat expensive in high dimensions. If many samples need to be checked it is thus desirable to convert each set into a description in terms of linear inequalities. A minimal encoding of this kind is given by the facets of the polytope . Then the containment can be decided in time, where is the number of facets of . We assume that each set affinely spans , whence is fulldimensional, and thus its facet description is unique.
Converting into the facets of is the convex hull problem. It is worth noting that the dual problem of computing the vertices of a polytope given in terms of finitely many linear inequalities (assuming that it is bounded) is equivalent to the convex hull problem by means of cone polarity. If we let be the cardinality of , then McMullen’s upper bound theorem [37] says that
(1) 
That bound is actually tight as can be seen, e.g., from the cyclic polytopes. In view of cone polarity the same (tight) estimate holds for the number of vertices if the number of inequalities is given. This means that the number of vertices and the number of facets of a polytope may differ by several orders of magnitude.
This sketch of a classification algorithm is too naive to be useful in practice. Yet it is instructive to keep this in mind as a guiding principle. We identify three issues.

The idealized setup does not deal with outliers, i.e. is generally not robust.

Computing the convex hull of each class can be prohibitively expensive.

The classes may not be well represented by their convex hulls.
Our main algorithmic contribution is a concept for addressing all three issues simultaneously. For empirical data on convex hull computations see [1].
3.2 Random polytope descriptors
As our main contribution we propose a certain convex polytope as an approximate decription of some learned representation. Its algorithmic efficiency and its robustness to the presence of outliers can be tuned by a pair of parameters. Again we assume that is finite, and .
Definition 1.
Let be a set of unit vectors chosen uniformly at random, and let be a positive integer. The random polytope descriptor is the polyhedron
(2) 
where denotes the th largest element of the set.
Tacitly we will assume that is bounded. This is the case if and only if is positively spanning. It is worth noting that the random polytope descriptors form a variation of dual bounding bodies; cf. Section A.
We briefly sketch what it means to compute , and which cost this incurs. First we need random unit vectors, which can be found in ; cf. [29, 3.4.1.E.6]. There is no reason to be overly exact here; so it is adequate to assume that the coefficients of the random vectors are constantly bounded. Given , we then evaluate scalar products to obtain the Hdescription (2) of , resulting in a total complexity of . Throughout we take the number of latent dimensions as a constant.
In the context of unsupervised anomaly detection (see Section 4) it is often beneficial to discard a fixed amount of data from , i.e., label them as anomalous. The function serves as a \enquotethreshold to eliminate the most extreme points from the training set. For (semi)supervised learning there may be natural candidate functions to replace to define variants of , e.g., based on density estimation, weights, or simply the label. However, for simplicity we stick to unsupervised anomaly detection.
3.3 Scaling distance and anomaly scores
Let be a fulldimensional convex polytope with a distinguished interior point . The scaling distance of to with respect to is the quantity
(3) 
This is the smallest number such that inflated by a factor of from the central point contains . The point is contained in if and only if . If is a descriptor, e.g., as in Definition 1, this may serve as a dissimilarity (anomaly) score, where the points further from are assigned higher scores. Given , and an Hdescription of in terms of linear inequalities, we can compute in time, for constant, by evaluating scalar products.
There are several natural candidates for the central point of an arbitrary polytope , which differ with respect to computational complexity, when is given in terms of inequalities. For instance, the centroid is the center of gravity of ; this is hard even to approximate, cf. [39]. The vertex barycenter is the average of the vertices of ; this is hard to compute exactly, cf. [17]; further results in loc. cit. make it unlikely that there is an efficient approximation procedure either. The Chebyshev central point is the center of the largest sphere inscribed to ; this can be computed in time by linear programming, cf. [16] and [40]. Note that the latter worst case complexity assumes that the bit representation of each inequality in the Hdescription of is constantly bounded. For the special case of , where is a normally distributed random sample, all the three central points discussed are arbitrarily close to the origin, provided that is sufficiently large; cf. Fig. 1.
Let be the training data partitioned into classes. For the estimated distance of a point to the combined support of all classes we take the minimum
(4) 
of the scaling distances, where , and is the corresponding Chebyshev central point. Note that is a random variable as the construction of polytopes is not deterministic. If the estimated distance to the combined support is larger than one, then evaluating (4) also assigns a class which is closest to , namely the index at which the minimum is attained. For generic this is unique. It follows from our complexity analysis above that estimating the distance to the combined support takes time, for and fixed.
3.4 Convex separation
Since we do not use any a priori knowledge on the latent representation that we work with, it may happen that the random polytope desciptors of two (or more) classes intersect nontrivially. Again let be partitioned into classes and are the random polytope descriptors, for some fixed values of and . Then we can define the confusion coefficient of classes and as
(5) 
i.e., a low value indicates good separation. This may be seen as a Monte Carlo style approximation of the integral of the probability density function of the test data over the intersection . Again by evaluating scalar products, computing one confusion coefficient is in .
4 Experiments
We conduct experiments on the MNIST and FMNIST datasets, to assess the feasibility of our method. Both datasets, MNIST and FMNIST, comprise ten classes each, i.e., . It should be stressed that we are primarily interested in the overall reliability of the classification process rather than individual performance scores.
We train standard AE and VAE networks to embed the MNIST and FMNIST datasets into a dimensional latent space for various choices of . The autoencoder architectures are of LeNettype, having an encoder that has two convolutional layers with maxpooling followed by two fully connected layers that map to an encoding of dimensions. The decoder is constructed symmetrically where we replace convolution and maxpooling with deconvolution and upsampling respectively. We use batch normalization [23] and leaky ReLU activation functions these networks. For training, we take the entire training datasets, minimizing the reconstruction error (AE) and the reconstruction error regularized by the volume of the support of latent distribution (VAE). None on the networks are given any explicit convexity prior.
It is worth noting that constructing the Hrepresentation (2) of the random polytope descriptor is very fast.
On modern hardware
4.1 Outofdistribution detection
Our method suggests to consider the random polytope descriptors as a model for the supports of the individual class distributions in the training data. Here we check how well this approximation works to detect outofdistributions samples.
Our experimental setup is as follows. For each of the ten classes of FMNIST training data we constructed a random polytope . Since this is a supervised learning scenario it’s natural to chose . In further experiments (see Section 4.2 and specifically Fig. 9) increasing beyond (where is the dimension of the latent representation) had little to no influence on the performance, thus we settled on choosing . Then we evaluated the estimated distance to the support (4) for test data from the FMNIST and MNIST datasets. The results are depicted in Fig. 3.
Comparison of AE and VAE networks.
The minimal scaling distances for samples from FMNIST stay well below for both networks; recall from (3) that this signals containment in one of the classes. So this behavior is expected and a first hint at a reasonable performance of random polytope descriptors for anomaly detection; cf. Section 4.2 for details.
However, we observe a very clear distinction between the AE and VAE networks with respect to outofdistribution samples. Using AE network trained on FMNIST dataset, out of distribution MNIST test data are given significantly larger dissimilarity score. In sharp constrast the VAE network embedding of MNIST test set exhibits almost complete overlap with the support of the FMNIST training distribution. This means that the regularization step, which distinguishes VAE from AE, destroys information which may be exploited to perform outofdistribution attacks. We will come back to this when we will look at anomaly scores in Section 4.2.
It is worth noting that the FMNIST scores are not concentrated around (as one would observe, e.g., in the case of a Gaussian distribution). Instead they seem to be distributed near a sphere of some radius increasing with the dimension. For instance, for the VAE networks mean and variance range from to ).
Convex separation in classical networks.
We picked ten different classes from the CIFAR100 dataset, which to intuivie human understanding should be relatively close to each other. We evaluated convex separation of pretrained, stateoftheart image recognition networks: AlexNet, ImageNet and Resnet18 to form “confusion matrix” present below. Note that high confusion means that no linear separating hyperplane will provide good distinction between the classes.
4.2 Anomaly detection
As a second benchmark for the disentanglement of the learned representation we used an anomaly detection test. For each MNIST and FMNIST class an unsupervised anomaly detection scenario was created as follows. The training data consisted of all training data points from class “enriched” by data points chosen uniformly at random from the other classes, for some constant . This noisy data is then used to create a random polytope descriptor , and its accuracy against the entire test data is evaluated in the AUC metric (area under receiver operating characteristic curve). This experiment may be understood as a stochastic approximation of the intersection . Indeed, if two random polytope descriptors exhibit an intersection witnessed in the test data, the data points in the intersection will be counted as false negatives in the anomaly detection task.
All experiments have been conducted with the following choice of the parameters: , , and a grid of parameters in the range and . The results are shown in Figs. 9, 7, 8 and 6.
Comparison of AE and VAE networks.
When fixing , the number of hyperplanes, in Figs. 7 and 6 we can see the monotonous increase of scores with the dimension of the latent space. This aligns with the common understanding that higher the dimension, the better is reconstruction, hence more faithful representation of data.
When fixing , the dimension, we can observe the dependence of the AUC score on the number of hyperplanes defining an . The scores increase monotonously with both for MNIST Fig. 8 and FMNIST Fig. 9 datasets. This agrees with our intuition, as with a larger number of hyperplanes the approximate bounding body tightens its approximation of the true convex hull. It seems that setting the number of hyperplanes above does not lead to significant improvement in performance.
For the MNIST dataset it seems that combining our descriptors with VAE is able separate the clusters in dimension with improvement up to . Provided that the assumption on the convexity of clusters is satisfied, the deterioration of the performance for larger s (for a fixed ) is expected, as in higher dimensions a larger number of hyperplanes is needed for similar quality of approximation. On the other hand, AE in general performs much poorer in this scenario but its performance does not deteriorate. The difference in behavior of AE and VAE networks suggest that the deterioration of performance for higher dimensions may be the consequence of overfitting by VAE networks.
A similar experiment on FMNIST, arguably more complex dataset, shows (Fig. 7) no deterioration of performance while increasing the dimension, and overall better performance of VAE networks.
A possible explanation of the phenomenon we observe with VAE networks is as follows. Since VAE networks try to pack data close to the origin, all classes necessarily will be in close proximity to each other. It is possible though that while achieving simple (convex) shapes of clusters in low dimensions, more space in higher results in complex shapes of embedded clusters, which might be an indicator of overfitting.
The sustained performance of VAE networks for FMNIST dataset might be an indicator that the intrinsic dimension of the dataset is higher than what have been studied here. On the practical side of network design, one may argue that the regularization in VAE network is too strong for MNIST dataset.
4.3 Contracting property of autoencoders
Autoencoder networks can be seen as pair of networks , with
optimized to be close to the identity (on the support of the training distribution). One may ask for the image of the reversed composition
which follows the path of adversarial learning.
In particular we are interested in measuring how far from the identity is the composition , and whether the image of the map is closer to the support of the training data. In the second case we may say that the network has contracting properties. Since the ’s provide a natural “calibration” of the distance, we can sample vertices of an at random (which are all of scaling distance ) and observe the distribution of scaling distance of the image of of those vertices.
Here we compare the behavior of s with a classical means descriptor (based on Voronoi centers); see Fig. 2. We start with a sample of vertices of a for a fixed class (and hence of scaling distance to that class equal to ) and apply map to them. For comparison we also produce the histograms for (scaled) Euclidean distances to Voronoi center of a class. Such experiments are depicted in Fig. 5. Notice that while both networks exhibit strongly contractive behavior, the VAE network it is especially pronounced. The simplest explanation is networks nearly constant response on the entire Voronoi cell.
Acknowledgments
We thank KlausRobert Müller for fruitful discussions. MJ received partial support by Deutsche Forschungsgemeinschaft (EXC 2046: \enquoteMATH, SFBTRR 195: \enquoteSymbolic Tools in Mathematics and their Application, and GRK 2434: \enquoteFacets of Complexity). MK was supported by Deutsche Forschungsgemeinschaft (EXC 2046: \enquoteMATH, Project EF13) and National Science Centre, Poland grant 2017/26/D/ST1/00103. LR acknowledges support by the German Federal Ministry of Education and Research (BMBF) in the project ALICE III (01IS18049B).
Appendix A Dual Bounding Bodies
A first key idea is to replace convex hulls by suitable dual bounding bodies, which form outer approximations. To explain the concept it suffices to consider a single class. Let be a set of points in , and let be a finite set of directions.
Definition 2.
The dual bounding body of with respect to is the polyhedron
(6) 
By construction is convex, and it contains ; so it contains , too. If is finite then the supremum in (6) is a maximum, , that is attained for some , which only depends on . In that case if and only if contains all facet normals of . Checking if is contained in takes time for .
In the sequel we will assume that both, and , are finite, and is positively spanning, i.e., contains the origin in its interior. As is finite the latter property is satisfied if and only if the polyhedron is bounded, i.e., a polytope.
Remark 1.
If is finite, positively spanning and contained in , then is the polar dual of the polytope . The facets of are tangent to , and for all .
Appendix B Random Polytopes
The following observation will turn out to be practically relevant. It says that our random polytope descriptors are combinatorially benign, in contrast to the worst case scenario described by the upper bound theorem (1).
Theorem 3.
Let be normally distributed with mean zero (and constant variance). For arbitrary and the number of vertices of is in the order of , for considered constant.
Proof.
The normal distribution is rotationally invariant, and so is the uniform choice of directions . Consequently, the construction of follows the RotationSymmetry Model of Borgwardt [7], and this yields the claim. ∎
We use the stochastic approach of [7] of random polytope. There, a random model of polytopes is defined as follows: we say that
where is chosen uniformly over .
Let and assume that , where denotes the Haussdorff distance. This may be read that the height of the largest spherical cap cut from by a facet of is at most . Then given with probability
where is a set of points chosen uniformly at random and
Here is the area of a spherical cap of height normalized so that regardless of dimension.
Theorem 4.
Let and let and be given and let denote the set of its vertices. Denote by a set of directions chosen uniformly at random and by the maximal distance over all vertices of to the unit sphere. Note that all facets of are contained in a dimensional annulus around the origin. The mean of randomly chosen vertices of is at most from the barycenter of with probability at least if
where is the Haussdorff distance of to the sphere .
Footnotes
 Timings given for i56200U laptop processor
References
 (2017) Computing convex hulls and counting integer points with \polymake. Math. Program. Comput. 9 (1), pp. 1–38. External Links: Document, Link, MathReview Entry Cited by: §3.1.
 (2002) Kernel independent component analysis. Journal of Machine Learning Research 3 (Jul), pp. 1–48. Cited by: §2.
 (1995) An informationmaximization approach to blind separation and blind deconvolution. Neural Computation 7 (6), pp. 1129–1159. Cited by: §2.
 (2013) Representation learning: a review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (8), pp. 1798–1828. Cited by: §1, §1, §2, §2, §2.
 (2007) Scaling learning algorithms towards ai. LargeScale Kernel Machines 34 (5), pp. 1–41. Cited by: §1.
 (2018) Wild patterns: ten years after the rise of adversarial machine learning. Pattern Recognition 84, pp. 317–331. Cited by: §2.
 (1987) The simplex method. Algorithms and Combinatorics, Vol. 1, SpringerVerlag, Berlin. Note: A probabilistic analysis External Links: ISBN 3540170960, Document, Link, MathReview (Jürgen Köhler) Cited by: Appendix B, Appendix B.
 (2019) On evaluating adversarial robustness. arXiv preprint arXiv:1902.06705. Cited by: §2.
 (2017) Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, pp. 39–57. Cited by: §2.
 (2018) Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pp. 2610–2620. Cited by: §2.
 (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2172–2180. Cited by: §1.
 (1994) Independent component analysis, a new concept?. Signal processing 36 (3), pp. 287–314. Cited by: §2.
 (2017) Adversarial feature learning. In International Conference on Learning Representations, Cited by: §1.
 (2017) Adversarially learned inference. In International Conference on Learning Representations, Cited by: §1.
 (2018) A framework for the quantitative evaluation of disentangled representations. In International Conference on Learning Representations, Cited by: §1, §2.
 (1982) Optimal scaling of balls and polyhedra. Math. Programming 23 (2), pp. 138–147. External Links: ISSN 00255610, Document, Link, MathReview (K. G. Murty) Cited by: §3.3.
 (2009) Complexity of approximating the vertex centroid of a polyhedron. In Algorithms and computation, Lecture Notes in Comput. Sci., Vol. 5878, pp. 413–422. External Links: Document, Link, MathReview Entry Cited by: §3.3.
 (2016) Deep learning. MIT press. Cited by: §1.
 (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680. Cited by: §1.
 (2017) A baseline for detecting misclassified and outofdistribution examples in neural networks. In International Conference on Learning Representations, Cited by: §2.
 (2017) VAE: learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, Cited by: §2.
 (1999) Nonlinear independent component analysis: existence and uniqueness results. Neural Networks 12 (3), pp. 429–439. Cited by: §2.
 (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pp. 448–456. Cited by: §4.
 (2020) Variational autoencoders and nonlinear ica: a unifying framework. In International Conference on Artificial Intelligence and Statistics, Cited by: §2.
 (2018) Disentangling by factorising. In International Conference on Machine Learning, pp. 2649–2658. Cited by: §2.
 (2014) Autoencoding variational bayes. In International Conference on Learning Representations, Cited by: §1, §2.
 (2018) Glow: generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10215–10224. Cited by: §1.
 (2016) Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pp. 4743–4751. Cited by: §1.
 (1998) The art of computer programming. Vol. 2. AddisonWesley, Reading, MA. Note: Seminumerical algorithms, Third edition [of MR0286318] External Links: ISBN 0201896842, MathReview Entry Cited by: §3.2.
 (2017) Variational inference of disentangled latent concepts from unlabeled observations. In International Conference on Learning Representations, Cited by: §1, §2.
 (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pp. 6402–6413. Cited by: §2.
 (2019) Unmasking Clever Hans predictors and assessing what machines really learn. Nature Communications 10, pp. 1096. Cited by: §2.
 (2015) Deep learning. Nature 521 (7553), pp. 436–444. Cited by: §1.
 (2018) A simple unified framework for detecting outofdistribution samples and adversarial attacks. In Advances in Neural Information Processing Systems, pp. 7167–7177. Cited by: §2.
 (2019) Challenging common assumptions in the unsupervised learning of disentangled representations. In International Conference on Machine Learning, pp. 4114–4124. Cited by: §1, §2, §2.
 (2019) Disentangling disentanglement in variational autoencoders. In International Conference on Machine Learning, pp. 4402–4412. Cited by: §1.
 (1970) The maximum numbers of faces of a convex polytope. Mathematika 17, pp. 179–184. External Links: ISSN 00255793, Document, Link, MathReview (G. T. Sallee) Cited by: §3.1.
 (2015) Deep neural networks are easily fooled: high confidence predictions for unrecognizable images. In Computer Vision and Pattern Recognition, pp. 427–436. Cited by: §2.
 (2007) Approximating the centroid is hard. In Computational geometry (SCG’07), pp. 302–305. External Links: Document, Link, MathReview Entry Cited by: §3.3.
 (1988) A polynomialtime algorithm, based on Newton’s method, for linear programming. Math. Programming 40 (1, (Ser. A)), pp. 59–93. External Links: ISSN 00255610, Document, Link, MathReview (Hubertus Th. Jongen) Cited by: §3.3.
 (2014) Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, Vol. 32, pp. 1278–1286. Cited by: §1, §2.
 (2015) Variational inference with normalizing flows. In International Conference on Machine Learning, pp. 1530–1538. Cited by: §1.
 (2017) An overview of multitask learning in deep neural networks. arXiv preprint arXiv:1706.05098. Cited by: §2.
 (2018) Deep oneclass classification. In International Conference on Machine Learning, Vol. 80, pp. 4390–4399. Cited by: §1, §2.
 (2020) Toward interpretable machine learning: transparent deep neural networks and beyond. arXiv preprint arXiv:2003.07631. Cited by: §2.
 (1992) Learning factorial codes by predictability minimization. Neural Computation 4 (6), pp. 863–879. Cited by: §1.
 (2015) Deep learning in neural networks: an overview. Neural Networks 61, pp. 85–117. Cited by: §1.
 (2001) Estimating the support of a highdimensional distribution. Neural Computation 13 (7), pp. 1443–1471. Cited by: §1.
 (2002) Learning with kernels. MIT press. Cited by: §1.
 (2014) Intriguing properties of neural networks. In International Conference on Learning Representations, Cited by: §2.
 (2018) A survey on deep transfer learning. In International Conference on Artificial Neural Networks, pp. 270–279. Cited by: §2.
 (2004) Support vector data description. Machine Learning 54 (1), pp. 45–66. Cited by: §1.
 (2018) Recent advances in autoencoderbased representation learning. In 3rd Workshop on Bayesian Deep Learning (NeurIPS 2018), Cited by: §1, §1, §2, §2.
 (2019) Are disentangled representations helpful for abstract visual reasoning?. In Advances in Neural Information Processing Systems, pp. 14222–14235. Cited by: §1, §2.
 (2016) A survey of transfer learning. Journal of Big Data 3 (1), pp. 9. Cited by: §2.
 (1995) Lectures on polytopes. Graduate Texts in Mathematics, Vol. 152, SpringerVerlag, New York. External Links: ISBN 038794365X, MathReview (Margaret M. Bayer) Cited by: §3.1.