On denoising autoencoders trained to minimise binary cross-entropy
Denoising autoencoders (DAEs) are powerful deep learning models used for feature extraction, data generation and network pre-training. DAEs consist of an encoder and decoder which may be trained simultaneously to minimise a loss (function) between an input and the reconstruction of a corrupted version of the input. There are two common loss functions used for training autoencoders, these include the mean-squared error (MSE) and the binary cross-entropy (BCE). When training autoencoders on image data a natural choice of loss function is BCE, since pixel values may be normalised to take values in and the decoder model may be designed to generate samples that take values in . We show theoretically that DAEs trained to minimise BCE may be used to take gradient steps in the data space towards regions of high probability under the data-generating distribution. Previously this had only been shown for DAEs trained using MSE. As a consequence of the theory, iterative application of a trained DAE moves a data sample from regions of low probability to regions of higher probability under the data-generating distribution. Firstly, we validate the theory by showing that novel data samples, consistent with the training data, may be synthesised when the initial data samples are random noise. Secondly, we motivate the theory by showing that initial data samples synthesised via other methods may be improved via iterative application of a trained DAE to those initial samples.
Autoencoders are a class of neural network models that are conceptually simple, yet provide a powerful means of learning useful representations of observed data through unsupervised learning bengio2009learning (). Trained autoencoders can be used in downstream tasks, either through extracting the learned representations as input for other algorithms lange2012autonomous (), or by fine-tuning the existing model for other tasks vincent2008extracting (). Additionally, recent work has led to the development of generative autoencoders, i.e., autoencoders that both learn useful representations and allow the synthesis of novel data samples that are consistent with the training data kingma2013auto (); rezende2014stochastic (); makhzani2015adversarial ().
Autoencoders consist of two models, an encoder and a decoder, arranged in series. The two models may be trained simultaneously to minimise a reconstruction loss between the input to the encoder, and the output of the decoder. A common variant of the autoencoder is the denoising autoencoder (DAE), which is trained to recover clean versions of corrupted input samples vincent2008extracting (); vincent2010stacked ().
Formally, a DAE with encoder, , decoder, and corruption process, , is trained to minimise:
where are data samples whose underlying distribution is . Parameters and are learned. There may be additional regularisation terms in the loss function.
It has been shown alain2014regularized () that the optimal reconstruction function, learned by a DAE trained to minimise the mean-squared reconstruction error approximates the gradient of the log-likelihood of the data-generating distribution with respect to the data. This theoretical result has been a driving force behind many publications nguyen2016plug (); rasmus2015semi (); huszar2017variational () (see Section 5 for more details).
However, the mean-squared error (MSE) is only one of two common loss functions used when training autoencoders. This is especially true when the data distribution consists of images which take values in and the pixel intensity can be thought of as the probability of a pixel being “on”. In this situation, it is possible to use the binary cross-entropy (BCE) loss, which naturally applies to random variables that may be “on” or “off”.
In this paper, we extend the theory presented by Alain and Bengio alain2014regularized () (Section 3) and present empirical results to validate the theory in a practical setting (Section 4.2). Furthermore, we present an application (Section 4.3), where by applying the theory we may improve samples synthesised from both variational autoencoders kingma2013auto (); rezende2014stochastic () and adversarial autoencoders makhzani2015adversarial () trained with the denoising criterion.
Alain and Bengio alain2014regularized () show that if a regularised DAE is trained to minimise the reconstruction loss:
where , where , the optimal reconstruction function, , in the limit, as , the optimal reconstruction function is given by alain2014regularized ():
Equation (3) is equivalent to a gradient ascent step in data space, , moving towards regions of higher likelihood, where the step size is given by .
Alain and Bengio alain2014regularized () only showed that Equation (3) held for the MSE loss function given in Equation (2). Here we show that Equation (3) also holds when BCE is used for the loss function rather than MSE.
Consider a reconstruction model, , trained to minimise:
where is the BCE loss; this expands to:
Letting and shortening to for ease of notation:
The optimal reconstruction model, is obtained by solving the following equation:
The purpose of our experiments is twofold. To validate the theory presented in Section 3, we show that it is possible to synthesise novel data samples that are consistent with the training data, starting from random noise. We then provide a practical motivation, by showing that samples synthesised via generative autoencoders kingma2013auto (); rezende2014stochastic (); makhzani2015adversarial () using existing sampling methods can be improved via iterative application of a trained denoising generative autoencoder to those initial samples.
For our experiments we train denoising variants of two state-of-the-art generative autoencoder models—the variational autoencoder (VAE) kingma2013auto (); rezende2014stochastic () and the adversarial autoencoder (AAE) makhzani2015adversarial ()—for the two reasons specified above. Firstly, we use these models in order to validate Equation (3) (which we have previously shown is a consequence of Equation (9)). Secondly, we demonstrate how our sampling method, as detailed in Equation (10), may be used to improve the perceptual quality of data samples synthesised via other methods.
Generative autoencoders are autoencoders which are trained with a reconstruction loss, as well as an extra regularisation loss which encourages the distribution of the encoded samples to conform to a specified prior distribution; for simplicity this is often chosen to be the multivariate normal distribution , and we use this standard choice in our own experiments. For details of the derivation and implementation of these regularisation losses, we refer readers to the original literature kingma2013auto (); makhzani2015adversarial ().
We turn these models into denoising variants—denoising variational autoencoders (DVAEs) im2015denoising () and denoising adversarial autoencoders (DAAEs)—simply by replacing the reconstruction loss with a denoising loss. Following Alain and Bengio alain2014regularized (), we use an additive Gaussian noise corruption process; we use noise sampled from . Each of our denoising generative autoencoder models is trained till convergence, such that the learned reconstruction function, is close to the optimal .
In line with the recommendations of Radford et al. radford2015unsupervised (), we use strided convolutional layers in the encoders, followed by fractionally-strided convolutions in the decoders. Each layer is followed by batch normalisation ioffe2015batch () and ReLU nonlinearities, except for the final layer of the decoders, which have their values restricted to by a sigmoid function. The architecture of the encoders and decoders match the discriminator and generator networks of Radford et al. radford2015unsupervised (). For the adversarial model in the DAAE, we use a fully-connected network with dropout and leaky ReLU nonlinearities. When we perform sampling using our trained generative autoencoders, we utilise minibatch statistics for the batch normalisation layers, as it can help compensate for inputs that are far from the training distribution.
For training we use the CelebA dataset, which consists of two hundred thousand aligned and cropped images of celebrity faces liu2015deep (). It is widely used for the qualitative evaluation of generative models as it contains a large amount of high quality image data, and the use of human faces makes it easier for unrealistic qualities of synthesised samples to be spotted. We preprocess the images by cropping and resizing them to px, and then train our models using the Adam optimiser (using hyperparameters , and ) kingma2014adam () on the training split of the dataset for 20 epochs. All of our experiments were carried out using the Torch library collobert2011torch7 ().
4.2 Sampling From Noise
In Section 3 we showed that utilising an optimally trained reconstruction model is equivalent to a gradient step in the direction of more likely data samples. We can validate this empirically by iteratively applying a trained reconstruction model to an initial image, where each pixel is sampled independently from some random distribution. The trained reconstruction model is applied such that:
where . According to Equation (3), each iteration should produce image samples that are more likely under the data-generating distribution. Qualitatively, this means that several iterations of applying the reconstruction model to random noise results in progressively more realistic faces.
The results of our sampling procedure, given by Equation (10), are shown in Figure (1). The initial samples are noise drawn from a uniform distribution and the proceeding samples appear more like faces. A few reconstruction iterations are sufficient to produce face-like images, and several further iterations result in a significant improvement in perceptual quality. These results show that iterative application of Equation (10) moves samples from regions of low probability under the data-generating distribution to regions of higher probability.
If we consider that the support for images of valid faces exist on manifolds with lower dimensions than the data arjovsky2017towards (), then the gradient for samples far from the manifold may be very weak. One potential solution for this problem is to add a small amount of noise before each iteration (of applying Equation 10), which has the effect of smoothing out the (log) probability distribution, hence making it easier to take gradient steps. We demonstrate the result of applying this technique in Figure 2, where we show the results of applying Equation (10) for many iterations.
4.3 Improving Initial Samples
In this section we show how the iterative sampling process described by Equation (10) can improve samples drawn from a DVAE or DAAE by first drawing samples in the standard way kingma2013auto (), and then applying iterative encoding and decoding to move the sample towards more likely regions of the data-generating distribution.
Normally, samples are drawn from a generative autoencoder by drawing a latent encoding, from the chosen prior distribution, and then passing it through the trained decoder. We refer to a data sample drawn in this way as . In order to improve these data samples we propose iteratively applying a trained reconstruction function (Equation (10)) in order to move the sample towards regions of higher probability under the data-generating distribution. The process is detailed in Equation (11), where we initialise the process with the conventional prior distribution .
The results of this procedure are shown in Figure 3. Although the quality of the initial samples vary greatly, we usually observe an improvement with several steps of the sampling procedure. This improvement is more noticeable when the initial samples are worse (see Figure 3(d, e, f)).
5 Relation to Previous Work
We consider our work in relation to previous work where denoising autoencoders have been used to learn both homogeneous transition operators of Markov chains nguyen2016plug (); bengio2013generalized (); bengio2014deep () and non-homogeneous transition operators salimans2015markov (); bachman2015variational ().
Bengio et al. bengio2013generalized (); bengio2014deep () construct the transition operator of a Markov chain whose stationary distribution is the data-generating distribution. The transition operator is implemented by corrupting a data sample and reconstructing it. The reconstruction function is a trained DAE. Bengio et al. bengio2013generalized (); bengio2014deep () initialise their sampling process with a sample that is consistent with the training data and apply their iterative sampling process to the MNIST lecun1998gradient () hand-written numbers dataset and produce a series of samples from different modes in the data-generating distribution. For example, starting with a 9, the process may generate 7’s, 6’s and 0’s, as well as samples that do not correspond to numbers. Their experiments demonstrate that iterative sampling leads to convergence to a distribution, rather than a fixed point. Note that in the work of Bengio et al. bengio2013generalized (); bengio2014deep (), transitions between modes are made possible by addition of sufficient noise before each reconstruction step.
In contrast to the work of Bengio et al. bengio2013generalized (); bengio2014deep (), our work focuses on moving towards to a single data point in a distribution rather than movement between all modes in the data-generating distribution. Specifically, in our work, the addition of noise between iterations was used to encourage a smoother data space to take steps in (contrast Figures 1 and 2), whereas in Bengio et al. bengio2013generalized (); bengio2014deep () the corruption process was required to allow successful transition between modes. Note, that by applying the iterative process in Equation (10), starting from different initial random noise inputs, we are able to generate samples from different modes in the data-generating distribution.
Interestingly, Bengio et al. bengio2014deep (); bengio2013generalized () found that their sampling process often lead to a series of nonsensical samples when transitioning between modes; they introduced the “walkback” algorithm which involves updating the reconstruction model while sampling from it at the same time. The need for such a process indicates the drawbacks of attempting to transition between modes of a data distribution, especially when that data distribution consists of images.
When considering the specific application of DAEs to images, it may make more sense to consider a gradient method that searches for a single, highly likely point in a distribution, rather than aiming to capture the whole distribution starting from a single sample. This is because image samples exist on manifolds which are often in lower dimensions than the data space arjovsky2017towards (). This means that manifolds are unlikely to intercept arjovsky2017towards (), making it difficult to move from one mode or manifold to another while generating sampling that are consistent with the training data.
Other related work includes that of Nguyen et al. nguyen2015deep (), who define a Markov chain with a transition operation defined as:
where and are hyperparameters. Nguyen et al. approximate and choose the step size to be , rather than allowing to be the step size, as we do with our transition operator in Equation (10). Nguyen et al. nguyen2015deep () train their DAE models using MSE because of the properties shown by Alain and Bengio alain2014regularized () for this specific cost function. Given our results, their models could be constructed using a BCE loss, which may be more appropriate for image data.
There are several other studies, where Markov chain sampling has been applied both during training and sample synthesis salimans2015markov (); bachman2015variational () (and the “walkback” algorithm of Bengio et al. bengio2013generalized ()) however, we emphasise that our work has focused on applying Markov chains only during sample synthesis and not during training. Application during training may be another avenue for future work.
Rather than trying to synthesise samples from all modes in the data-generating distribution bengio2013generalized (), our aim is to move towards more likely data samples. This may involve moving along manifolds towards more likely regions —leading to small changes in identity, pose and even gender. Further, since we are taking gradient steps with fixed size, (Equation (3)), it is possible that if is not sufficiently small we may “overshoot” a local maximum, and start gradient ascent on a different, higher, local maximum (see Figure 4). This explains why there may be changes in identity, pose or gender between samples.
To produce a variety of samples—visiting different modes in the data-generating distribution—we may initialise the sampling process with samples from different location in the data generating distribution.
Building on previous work alain2014regularized (), we show that denoising autoencoders trained to minimise the binary cross-entropy loss may be used to approximate the gradient of the log density distribution of the data-generating distribution with respect to data samples. As a result, the sampling process, detailed in Equation (10), can be applied to any kind of autoencoder trained with a binary cross-entropy denoising loss. Empirically, we validate our findings by showing that it is possible to synthesise novel samples consistent with the data-generating distribution starting from random noise. In addition, we provide a practical application, demonstrating that it is possible to improve the perceptual quality of initial samples drawn from denoising variants of variational kingma2013auto (); rezende2014stochastic () and adversarial makhzani2015adversarial () autoencoders via our proposed sampling procedure.
We would like to acknowledge the EPSRC for funding through a Doctoral Training studentship and the support of the EPSRC CDT in Neurotechnology.
- (1) Y. Bengio, Learning deep architectures for AI, Foundations and trends® in Machine Learning 2 (1) (2009) 1–127.
- (2) S. Lange, M. Riedmiller, A. Voigtlander, Autonomous reinforcement learning on raw visual input data in a real world application, in: IEEE International Joint Conference on Neural Networks, 2012, pp. 1–8.
- (3) P. Vincent, H. Larochelle, Y. Bengio, P.-A. Manzagol, Extracting and composing robust features with denoising autoencoders, in: International Conference on Machine Learning, 2008, pp. 1096–1103.
- (4) D. P. Kingma, M. Welling, Auto-encoding variational Bayes, in: International Conference on Learning Representations, 2014.
- (5) D. J. Rezende, S. Mohamed, D. Wierstra, Stochastic backpropagation and approximate inference in deep generative models, in: International Conference on Machine Learning, 2014.
- (6) A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, Adversarial autoencoders, arXiv preprint arXiv:1511.05644.
- (7) P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P.-A. Manzagol, Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion, Journal of Machine Learning Research 11 (Dec) (2010) 3371–3408.
- (8) G. Alain, Y. Bengio, What regularized auto-encoders learn from the data-generating distribution, The Journal of Machine Learning Research 15 (1) (2014) 3563–3593.
- (9) A. Nguyen, J. Yosinski, Y. Bengio, A. Dosovitskiy, J. Clune, Plug & play generative networks: Conditional iterative generation of images in latent space, arXiv preprint arXiv:1612.00005.
- (10) A. Rasmus, M. Berglund, M. Honkala, H. Valpola, T. Raiko, Semi-supervised learning with ladder networks, in: Advances in Neural Information Processing Systems, 2015, pp. 3546–3554.
- (11) F. Huszár, Variational inference using implicit distributions, arXiv preprint arXiv:1702.08235.
- (12) D. J. Im, S. Ahn, R. Memisevic, Y. Bengio, Denoising criterion for variational auto-encoding framework, arXiv preprint arXiv:1511.06406.
- (13) A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks, in: International Conference on Learning Representations, 2016.
- (14) S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in: International Conference on Machine Learning, 2015, pp. 448–456.
- (15) Z. Liu, P. Luo, X. Wang, X. Tang, Deep learning face attributes in the wild, in: IEEE International Conference on Computer Vision, 2015, pp. 3730–3738.
- (16) D. Kingma, J. Ba, Adam: A method for stochastic optimization, in: International Conference on Learning Representations, 2014.
- (17) R. Collobert, K. Kavukcuoglu, C. Farabet, Torch7: A matlab-like environment for machine learning, in: BigLearn, Advances in Neural Information Processing Systems Workshop, no. EPFL-CONF-192376, 2011.
- (18) M. Arjovsky, L. Bottou, Towards principled methods for training generative adversarial networks, arXiv preprint arXiv:1701.04862.
- (19) Y. Bengio, L. Yao, G. Alain, P. Vincent, Generalized denoising auto-encoders as generative models, in: Advances in Neural Information Processing Systems, 2013, pp. 899–907.
- (20) Y. Bengio, E. Thibodeau-Laufer, G. Alain, J. Yosinski, Deep generative stochastic networks trainable by backprop, in: International Conference on Machine Learning, Vol. 32, 2014.
- (21) T. Salimans, D. Kingma, M. Welling, Markov chain monte carlo and variational inference: Bridging the gap, in: Proceedings of the 32nd International Conference on Machine Learning, 2015, pp. 1218–1226.
- (22) P. Bachman, D. Precup, Variational generative stochastic networks with collaborative shaping., in: ICML, 2015, pp. 1964–1972.
- (23) Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86 (11) (1998) 2278–2324.
- (24) A. Nguyen, J. Yosinski, J. Clune, Deep neural networks are easily fooled: High confidence predictions for unrecognizable images, in: IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 427–436.