Autoencoders are widely used for unsupervised learning and as a regularization scheme in semi-supervised learning. However, theoretical understanding of their generalization properties and of the manner in which they can assist supervised learning has been lacking. We utilize recent advances in the theory of deep learning generalization, together with a novel reconstruction loss, to provide generalization bounds for autoencoders. To the best of our knowledge, this is the first such bound. We further show that, under appropriate assumptions, an autoencoder with good generalization properties can improve any semi-supervised learning scheme. We support our theoretical results with empirical demonstrations.
Generalization Bounds for Autoencoders]Generalization Bounds For Unsupervised and Semi-Supervised Learning With Autoencoders
Editor: Under Review for COLT 2019
Keywords: Autoencoders, generalization, unsupervised learning, semi-supervised learning
An autoencoder (AE) (Hinton and Salakhutdinov (2006)) is a type of feedforward neural network, aiming at reconstructing its own input through a narrow bottleneck. It typically comprises two parts: and with some representation of the input . Usually, is of a smaller dimension than . The network is trained to find encoder and decoder functions such that some loss is minimized. A typical choice is the square loss . In recent years, autoencoders have emerged as a standard tool for both unsupervised and semi-supervised learning (SSL) (Vincent et al. (2010), Zhuang et al. (2015), Ghifary et al. (2016), Bousmalis et al. (2016), Epstein et al. (2018)). Unfortunately, as is frequently the case with deep learning approaches, the empirical practice has not been matched by parallel advances in theory. That is, unsupervised learning with autoencoders has not been able to benefit from the recent bounds for supervised deep learning (Bartlett et al. (2017), Neyshabur et al. (2018) Arora et al. (2018), Golowich et al. (2018)). In SSL, the discrepancy is more severe still. There is a fundamental tension between the goals of a supervised learner and those of an AE. One might say that an AE “wants to remember everything a classifier wants to forget”. Indeed, given two slightly different images of the digit , a classifier would like to ignore all differences, aiming instead to see them as similar objects. An AE, on the other hand, aims at reconstructing precisely those nuances (e.g. width, location in the image, style) that do not matter for the classification. Some works (Rifai et al. (2011)) have attempted to encourage AEs to weigh classification-relevant features more heavily, but this still dodges the basic question - why might an AE be useful for supervised learning?
We address the gaps above. First, we introduce a margin-based reconstruction loss which allows a natural adaptation of existing generalization bounds for autoencoders, and show that a bounded loss in that sense implies a bounded loss in the standard metric. Second, we demonstrate a mechanism by which a well-performing autoencoder is likely to assist in SSL; namely, it allows one to reduce dimension while preserving the input structure. More formally, we show that if the input distribution satisfies a certain clustering assumption, then the encoder part of an autoencoder with a small generalization error maps most of the input to a low-dimensional distribution that itself satisfies a reasonably-good clustering assumption. Finally, we extend Singh et al. (2008) by showing conditions under which any supervised learner can benefit from AE-enabled SSL.
The remainder of the paper is organized as follows. In Section 2 we review some prior work on AEs and SSL. In 3, we survey some recent generalization bounds for supervised learning with deep networks, and the characterization by Singh et al. (2008) of conditions under which SSL is guaranteed to be beneficial. In Section 4, we introduce our proposed reconstruction loss and use it to obtain generalization bounds for AEs. In Section 5, we apply these bounds to show that if an AE generalizes well, its encoder is limited in its ability to shrink the distances between most input pairs and discuss what this implies for semi-supervised learning and Singh et al. (2008). In Section 6, we explore our bounds empirically. Finally, we discuss possible implications of this work and some future research directions.
The main contributions of the present work are the following. (i) We adapt recent margin-based generalization bounds for feedforward networks to autoencoders via a novel loss. (ii) We tie good AE reconstruction performance to a non-contractiveness property of the encoder component. (iii) We show that this implies the ability to trade off separation margins between input clusters for reduced dimension, which is beneficial for semi-supervised learning.
2 Related Work
A great deal of work has been devoted to dimensionality reduction since the introduction of (linear) principal component analysis (PCA) in 1901 by Karl Pearson. Nonlinear manifold-based methods introduced in Roweis and Saul (2000); Tenenbaum et al. (2000), were followed by work on AEs Hinton and Salakhutdinov (2006) that led to significantly improved results for practical problems. Later work by Vincent et al. (2010), introduced a de-noising based criterion for training AEs, and demonstrated its improved representation quality compared to a reconstruction based criterion, contributing to better classification performance on subsequent supervised learning tasks. Further details, and a survey of AEs, can be found in Bengio et al. (2013).
Subsequent work directly addressed the SSL setting. Several papers demonstrated the empirical utility of SSL Rifai et al. (2011); Ranzato and Szummer (2008); Rasmus et al. (2015); Weston et al. (2012). Within the related transfer learning setting, Zhuang et al. (2015) showed how to improve learning performance by combining two types of encoders, the first, based on an unsupervised embedding from the source and target domains, and the second, based on the labels available from the source data. Ghifary et al. (2016), suggested a joint encoder for both classification of labeled data, and reconstruction of unlabeled data, thereby maintaining both types of information, and enhancing performance in the face of scarce labels. Bousmalis et al. (2016) and Epstein et al. (2018) use AEs for SSL and semi-supervised transfer learning, by explicitly learning to separate representations into private and shared components.
Within the framework of statistical learning theory, several recent papers have significantly improved previous generalization bounds for deep networks by incorporating more refined attributes of the network structure, aiming to explain the paradoxical effect of improved performance while over-training the network. Using covering number techniques, Bartlett et al. (2017) provide margin based bounds that relate generalization error to the network’s Lipschitz constant and matrix norms of the weights. Neyshabur et al. (2018) establish similar matrix-norm-based margin bounds using a PAC-Bayes approach. Arora et al. (2018) present compression-based results by compressing the weights of a well performing network and bounding the error of the compressed network. Finally, Golowich et al. (2018) are able, under certain (restrictive) assumptions on matrix norms, to achieve generalization bounds that are completely independent of the network size.
The value of SSL has been subject to much debate. Rigollet (2007) provide a mathematical framework for the intuitive cluster assumption of Seeger (2000), and show that for unlabeled data to be beneficial, some clustering criterion is required (specifically, that the data consists of separated clusters with identical labels within each cluster). Based on a density level set approach, they prove fast rates of convergence in the SSL setting. Lafferty and Wasserman (2007) and Niyogi (2008)) study SSL within a minimax framework, the former work shows that, under the so-called manifold assumption, optimal minimax rates of convergence may be achieved, while the second work demonstrates a separation between two classes of problems. When the structure of the data manifold is known, fast rates can be achieved, while without such knowledge, convergence cannot be guaranteed. More directly related to our work, and building on the clustering assumption, Singh et al. (2008) identify situations in which semi-supervised learning can improve upon supervised learning. Unfortunately, their SSL bound suffers from the curse of dimensionality, and so depends exponentially on the dimension. Our work can be seen as allowing a trade-off between the clustering separation and the dimension, suggesting how to improve the bounds in Singh et al. (2008).
van Rooyen and Williamson (2015) provide a principled approach to feature representation, and characterize the relation between the information retained by features about the input, and the loss incurred by a classifier based on these features. They suggest the application of their results to SSL, but do not provide explicit conditions or generalization bounds in this setting. Recently, Le et al. (2018) provide such bounds for semi-supervised learning with linear AEs and a joint reconstruction/classification loss, using uniform stability arguments. They also provide empirical results that show that nonlinear AEs can indeed contribute to supervised learning. While their results are close in spirit to ours, we are more concerned with understanding how the structure of the data affects SSL, and, in particular, characterizing when, and to what extent, clustering of the input contributes to performance through unsupervised learning. Moreover, we rely on bounds that are specific to neural networks, rather than on the looser stability based bounds.
3.1 Generalization for Feed-forward Networks
Let be an input space, an output space, and a distribution on . Throughout the paper, we shall assume that and that . Denote by the marginal distribution on the inputs. The -th entry of a vector is denoted by . For a collection of matrices , denote by the -layer feedforward network , where is a non-linearity. We shall focus on the ReLU function . Let us recall a few recent results (Bartlett et al. (2017), Neyshabur et al. (2018), Arora et al. (2018)) concerning supervised -way classification with feedforward networks. The output of a network is a vector . For a pair , define the (supervised) -margin loss as
That is, the loss is only if the correct, -th output entry is not only the largest one - but the second-largest entry is at least away. Given a sample of size , Denote by and the corresponding empirical and expected function losses
Note that is the standard classification loss.
All three aforementioned papers can be considered to suggest the same type of claim,
where is a generalization term depending on the network parameters, failure probability , sample size and margin . We shall use the bound appearing in Neyshabur et al. (2018), as it is the simplest to state111The bound in Bartlett et al. (2017) is strictly tighter and allows for non-linearities other than ReLU, however., but the similar results appearing in the other two papers can be adapted for our purpose as well.
3.2 Semi-Supervised Learning - Now It Helps Now It Doesn’t
We briefly review the necessary background from Singh et al. (2008), stating the results and terminology in a somewhat simplified manner. First, we define the clustering assumption they are working under. Suppose the input distribution is a finite mixture of smooth component densities with disjoint supports. Suppose further that each is bounded away from zero and supported on a unique compact and connected set with smooth boundaries:
where are -dimensional Lipschitz functions. Finally, assume the target label is constant on each 333Inputs with equal labels need not form a single cluster. Indeed, typically they form a number of separate clusters.. Then we say satisfies the clustering assumption with cluster-margin 444The notation in the original is . We have changed it to avoid confusion with the -margin loss. if each two clusters are at least apart. More formally, for , let
Then the cluster-margin is simply . The support sets of the components in are called the decision sets of . Denote by the set of all hypotheses .
A clairvoyant supervised learner is a function mapping labeled training sets of size to hypotheses in , with perfect knowledge of the decision sets of . A semi-supervised learner is a function mapping unlabeled training sets of size and labeled training sets of size to hypotheses in .
The following theorem (a slightly weaker version of Corollary 1 in Singh et al. (2008)) asserts that under suitable conditions, semi-supervised learning can always perform as well as any clairvoyant learner.
(Singh et al. (2008)) Let satisfy the clustering assumption with cluster-margin . Assume L is a bounded loss. Denote by the excess loss of a learner , where is the infimum loss over all possible learners. Suppose there exists a clairvoyant learner for which
Then there exists a semi-supervised learner such that if , then
The constant does not depend on , or .
Note the exponential dependence on the input dimension in Prop. 3. Mapping the input to a significantly lower dimension without decreasing too much is beneficial for the bound.
4 Generalization Bounds for Autoencoders
Let us now turn to autoencoders and their generalization properties. We introduce a novel entry-wise -margin reconstruction loss and state a generalization bound for this loss. Furthermore, we show that such a bound implies a bound for the standard loss as well.
For simplicity, we consider .555All the definitions and results from here on can be extended straightforwardly to support finer input resolution, that is, to allow input values on a discrete grid for some integer . The forms of the bound in Prop. 6 and Thm. 9 do not change with , though the -margin loss of any given autoencoder might. The value in the definition of in Sec. 5 is replaced by . We consider feed-forward fully-connected networks with output entries in . 666The restriction of the output to can be achieved by applying a sigmoid to the output, with the beneficial side effect of dividing the network Lipschitz constant by 4, as the Lipschitz constant of the sigmoid function is 1/4. Given a sample and a network , the reconstructed output is , though we will sometimes abuse the notation and simply write or . Note that while the inputs are binary, the prediction for each entry can be an intermediate value. An autoencoder network is a composition of an encoder and a decoder , both fully-connected feedforward networks.
For a margin , we define the -margin loss to be the average amount of entries that were not reconstructed with a confidence of at least . That is,
where is the indicator function.
Note that the loss is bounded between and . The corresponding expected loss and empirical loss on samples are denoted and , respectively:
For any positive , with probability at least over a training set of size , for any of depth and a constant depending only on the maximal input norm and on the structure of , we have
A common measure of the reconstruction performance of an AE is the squared-error loss
We would like to be able to bound the generalization error in terms of this loss as well. Fortunately, we are able to bound by a function of . Let
Let be an input and its reconstruction. Suppose that is at most . Then is at most .
Indeed, at most entries are reconstructed with accuracy less than . They contribute at most to . The remaining entries contribute at most to , for a total loss at most .
By linearity of expectation, an expected -margin loss implies a squared-error loss at most . Similarly, by the Jensen inequality, the expected error
is bounded from above by .
Suppose further that the reconstruction errors of the entries are distributed symmetrically around the average of the possible values. That is, that the distance from the corresponding input is, on average, for the entries with poor reconstruction, and for the remaining entries. Then
The empirical results in Fig. 2 suggest that this symmetric error assumption is reasonable.
For any positive , with probability at least over a training set of size , for any of depth and network-related constant independent of , we have
4.1 Proof of Theorem. 6
We follow the strategy appearing in Neyshabur et al. (2018).
First, let us state the result in greater detail. Let be an autoencoder with weights and ReLU non-linearities. Let
be the spectral and Frobenius norms of a -dimensional matrix . Let be the maximum norm of an input, the depth of , the upper bound on the number of output units in each layer.
(Detailed version of Thm. 6) For any and any , with probability at least over a training set of size , for any , we have
The proof consists of three steps. Firstly, we show that a small perturbation of the weight matrices implies a small perturbation of the network output (Lemma 11, Lemma 2 in Neyshabur et al. (2018)). Secondly, for perturbations of the network parameters such that the network output does not change much relative to , we state a PAC-Bayesian bound controlling by means of (Lemma 12, analogous to Lemma 1 in Neyshabur et al. (2018)). Finally, we use Lemma 11 to calculate the maximal amount of perturbation that satisfies the conditions of Lemma 12. This level of perturbation, substituted into the PAC-Bayesian bound, yields the theorem.
(Perturbation bound) Let be a perturbation such that . Then for any input ,
Recall that the Kullback-Leibler divergence between two distributions and is
(PAC-Bayesian bound) Let be an autoencoder, a data-independent distribution on the parameters. Then for any , w.p. at least , for any random perturbation s.t. , we have
Consider the weights .
By the homogeneity of ReLU, .
and . We can, therefore, assume that all weights are normalized and
for all .
Consider and a prior distribution of the same form. By Thm. 4.1 in Tropp (2012), with probability at least 1/2,
By Lemma 11, for an appropriate ,
For such a , satisfies the condition of Lemma 12. We can now bound the term in the PAC-Bayesian bound for the chosen and 777We have skipped over a nuance necessary to ensure that the prior is data-independent. See the end of the proof of Theorem 1 in Neyshabur et al. (2018) for the details,
Note that the upper bound on given in Eq. 22 depends on the dimensions of . Thm. 10 and its proof simply use , the largest output unit number of any layer. Assuming that the layer sizes decrease exponentially approaching the bottleneck (see, e.g., Hinton and Salakhutdinov (2006)), there is some room for tightening the bound.
5 Autoencoders and Semi-Supervised Learning
In this section we show that, under appropriate assumptions, a sufficiently good autoencoder can contribute to the advantage of SSL over any supervised learning scheme. Specifically, we consider the following strategy - first training the AE on the unlabeled data, and then applying the bound in Prop. 3 to the code, that is, to the output of the encoder (see Fig. 1). We stress that we do not propose this strategy as an optimal empirical approach. Indeed, training to minimize both reconstruction and supervised losses simultaneously has been established as a more successful approach, in practice (e.g., Bousmalis et al. (2016), Epstein et al. (2018).) However, the scheme we are considering allows for a theoretical treatment and for an explanation of the relationship between the autoencoder performance and its contribution to semi-supervised learning.
We need some further notation, in order to state our main result in this section. Denote by 888We will occasionally omit or and simply write . the subset , that is, the inputs for which the reconstruction error deviates from by at most . Note that by the Markov inequality, for ,
or in other words, the measure of is at least . Note that this allows us to trade off the measure of for the tightness of (see Fig. 2). Observe, too, that by Thm. 9, as and . Thus, as the generalization error of vanishes, so does the set of “bad” inputs. Denote by the distribution induced by on , and by the corresponding distribution on .
Assume that the input distribution satisfies the clustering assumption with margin . Let be an autoencoder with expected reconstruction loss , bottleneck dimension and decoder Lipschitz constant 999The decoder is a Lipschitz function. Indeed, is at most the product of the spectral norms of the weight matrices in , though that is typically a very loose bound. See Arora et al. (2018) for a discussion of the behavior of .. Then for any , satisfies the clustering assumption with cluster-margin at least
Furthermore, suppose there exists a clairvoyant learner for which
Then there exists a semi-supervised learner such that if , then
The constant does not depend on , or .
Now, consider two input points . Applying a standard “ epsilon” argument,
In particular, for at least apart, is at least .
We have established that if an autoencoder generalizes well, it does not bring two input points in too close together. Recall that is the Lipschitz constant of the decoder. If, for any points , is at least some , then cannot be less than . This implies that maps clusters at least apart to clusters at least apart. In other words, if , the input distribution, satisfies the clustering assumption with margin , then its restriction does as well, and , the output distribution of , satisfies the clustering assumption with cluster-margin . Applying Prop. 3 to the completes the proof.
All experiments were implemented in Keras (Chollet (2015)) over Tensorflow (Abadi et al. (2015)). We use two digit image datasets for our experiments. The MNIST dataset (LeCun et al. (1998)) is a collection of 70000 grayscale images of hand-written digits, split into 60,000 training and 10,000 test samples. The SVHN dataset (Netzer et al. (2011)) is a collection of 99289 RGB images of hand-written digits, split into 73,257 training and 26,032 test samples. We have the converted the SVHN samples into grayscale. First, we provide evidence for the generalization bound in Thm. 6. For each dataset, we train an autoencoder on an increasing fraction of the training set, and plot the bound (divided by the constant vs. the empirically observed test error (Fig 2). The margin values we use are . We see that, for both datasets, the bound correlates well with the test error. Moreover, the plot trends suggest an asymptotic convergence of the bound to the test error.
Next, we examine the control over as a function of that Corollary 8 provides. Fig. 2 plots the empirical average reconstruction error over the test set vs the predicted bound. The worst-case bound correlates well with the empirical error, but it is overly pessimistic by a factor of approximately 3. The average-case bound in Eq. 15 is closer to the observed error, loose only by a factor of approximately .
The proof of Thm. 14 requires a restriction to , the set of samples with small reconstruction error. A reasonable concern is that such a restriction rejects a large fraction of the inputs. While Eq. 25 provides some guarantees on the size of for negligible -s, Fig. 2 shows that, already for values small relative to , most test samples are in .
Finally, in Table. 1 we examine the various quantities appearing in Thm. 14. The first and second rows use a small and a large autoencoder, respectively, trained on MNIST. The third row uses an autoencoder trained on SVHN. The first column, , describes the change in dimensions from the AE input to the bottleneck. The second column, , gives the estimated cluster-margin of the input distribution. is the estimated cluster-margin of the bottleneck distribution. is the estimated decoder Lipschitz constant. Finally, the fifth column gives the extent to which the term in Eq. 28 improves due to the dimension reduction, for the corresponding value of . We can see that the change in the cluster-margin is roughly inverse to (though, for the given training sets, was not small enough for Eq. 26 to yield positive values of ).
|MNIST (large network)|
We have adapted existing generalization bounds for feedforward networks, together with a novel reconstruction loss, to obtain a generalization bound for autoencoders. To the best of our knowledge, this is the first such bound. We went on to tie the good reconstruction performance of an autoencoder to a non-contractiveness property of the encoder component. This property, in turn, implies the ability to trade off cluster-margins between input clusters for reduced dimension, which is beneficial for semi-supervised learning. Empirical evidence supports our theoretical results.
The bound we have obtained concerns only the gap between the empirical and expected losses. It neither guarantees the existence of an autoencoder achieving a negligible empirical error nor explains why such networks seem to exist in practice, particularly for images. We believe that the answer has to do with the properties of natural images - that typical image datasets satisfy a manifold hypothesis. That is, they lie on, or near, a low-dimensional manifold that is mapped to a higher dimension where they are observed. Assuming the mapping is invertible, and both the mapping and its inverse can be approximated well by sufficiently expressive networks, this does imply the existence of a good autoencoder for the dataset. Such considerations lead us to believe that a good generative model for the data (possibly along the lines of Ho et al. (2018)) could shed further light on unsupervised and semi-supervised learning with autoencoders.
An interesting and worthwhile extension of our work would be to consider more practical approaches to SSL. Specifically, combining supervised and unsupervised losses through shared layers, as is often done in practice. Such approaches have been shown to be effective both in SSL and transfer learning, and the present approach could shed theoretical light on their success.
We thank Ron Amit for numerous useful suggestions and corrections.
- Abadi et al. (2015) Martín Abadi et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/. Software available from tensorflow.org.
- Arora et al. (2018) Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. CoRR abs/1802.05296, 2018.
- Bartlett et al. (2017) Peter L. Bartlett, Dylan J. Foster, and Matus Telgarsky. Spectrally-normalized margin bounds for neural networks. Advances in Neural Information Processing Systems, 6240-6249, 2017.
- Bengio et al. (2013) Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
- Bousmalis et al. (2016) Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, and Dumitru Erhan. Domain separation networks. Advances in Neural Information Processing Systems 29 (NIPS 2016), 2016.
- Chollet (2015) FranÃÂ§ois Chollet. keras. https://github.com/fchollet/keras, 2015.
- Epstein et al. (2018) Baruch Epstein, Ron Meir, and Tomer Michaeli. Joint autoencoders: a flexible meta-learning framework. ECML 2018, 2018.
- Ghifary et al. (2016) Muhammad Ghifary, W. Bastiaan Kleijn, Mengjie Zhang, David Balduzzi, and Wen Li. Deep reconstruction-classification networks for unsupervised domain adaptation. In ECCV, pages 597–613. Springer, 2016.
- Golowich et al. (2018) Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. COLT 2018, 2018.
- Hinton and Salakhutdinov (2006) G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science vol. 313 no. 5786 pp. 504-507 2006, 2006.
- Ho et al. (2018) Nhat Ho, Tan Nguyen, Ankit Patel, Anima Anandkumar, Michael I. Jordan, and Richard G. Baraniuk. Neural rendering model: Joint generation and prediction for semi-supervised learning. Corr abs/1811.02657, 2018.
- Lafferty and Wasserman (2007) J. Lafferty and L. Wasserman. Statistical analysis of semi-supervised regression. NIPS 2007, 2007.
- Le et al. (2018) Lei Le, Andrew Patterson, and Martha White. Supervised autoencoders: Improving generalization performance with unsupervised regularizers. NIPS 2018, 2018.
- LeCun et al. (1998) Y. LeCun et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278-2324, 1998.
- Netzer et al. (2011) Yuval Netzer et al. Reading digits in natural images with unsupervised feature learning. NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
- Neyshabur et al. (2018) Benham Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. ICLR 2018, 2018.
- Niyogi (2008) P. Niyogi. Manifold regularization and semi-supervised learning: Some theoretical analyses. Technical Report TR-2008-01, Computer Science Department, University of Chicago, 2008.
- Ranzato and Szummer (2008) Marc’Aurelio Ranzato and Martin Szummer. Semi-supervised learning of compact document representations with deep networks. In Proceedings of the 25th international conference on Machine learning, pages 792–799. ACM, 2008.
- Rasmus et al. (2015) Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semi-supervised learning with ladder networks. In Advances in Neural Information Processing Systems, pages 3546–3554, 2015.
- Rifai et al. (2011) Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, and Yoshua Bengio. Contractive auto-encoders: Explicit invariance during feature extraction. ICML 2011, 2011.
- Rigollet (2007) P. Rigollet. Generalization error bounds in semi-supervised classification under the cluster assumption. JMLR 2007 1369â1392, 2007.
- Roweis and Saul (2000) Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear embedding. science, 290(5500):2323–2326, 2000.
- Seeger (2000) M. Seeger. Learning with labeled and unlabeled data. Technical report, Institute for ANC, Edinburgh, UK, 2000.
- Singh et al. (2008) Aarti Singh, Robert D. Nowak, and Xiaojin Zhu. Unlabeled data: Now it helps, now it doesnât. NIPS 2008, 2008.
- Tenenbaum et al. (2000) Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction. science, 290(5500):2319–2323, 2000.
- Tropp (2012) Joel A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics 2012, 389-434, 2012.
- van Rooyen and Williamson (2015) Brendan van Rooyen and Robert C. Williamson. A theory of feature learning. Corr abs/1504.00083, 2015.
- Vincent et al. (2010) Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. JMLR 2010, 2010.
- Weston et al. (2012) Jason Weston, Frédéric Ratle, Hossein Mobahi, and Ronan Collobert. Deep learning via semi-supervised embedding. In Neural Networks: Tricks of the Trade, pages 639–655. Springer, 2012.
- Zhuang et al. (2015) Fuzhen Zhuang, Xiaohu Cheng, Ping Luo, Sinno Jialin Pan, and Qing H. Supervised representation learning: Transfer learning with deep autoencoders. IJCAI 2015, 2015.