Improving VAEs’ Robustness to Adversarial Attack

Improving VAEs’ Robustness to Adversarial Attack


Variational autoencoders (VAEs) have recently been shown to be vulnerable to adversarial attacks, wherein they are fooled into reconstructing a chosen target image. However, how to defend against such attacks remains an open problem. We make significant advances in addressing this issue by introducing methods for producing adversarially robust VAEs. Namely, we first demonstrate that methods used to obtain disentangled latent representations produce VAEs that are more robust to these attacks. However, this robustness comes at the cost of reducing the quality of the reconstructions. We, therefore, further introduce a new hierarchical VAE, the Seatbelt-VAE, which can produce high–fidelity autoencoders that are also adversarially robust. We confirm the empirical capabilities of the Seatbelt-VAE on several different datasets and with current state–of–the–art VAE adversarial attack schemes.




1 Introduction



Figure 1: Latent-space adversarial attacks on CelebA for different models. Here we start with the image of Hugh Jackman in the top left and introduce an adversary that tries to produce reconstructions thath look like Anna Wintour as per the top right. This is done by applying a distortion (third column) to the original image to produce an adversarial input (second column). We can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. Adding a regularisation term by using the -TCVAE produces an adversarial reconstruction that does not look like Wintour, but it is also far from a successful reconstruction. Our proposed model, the Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour. Note that, in addition to this robustness, the Seatbelt-VAE further provides a clearer reconstruction for the original image.

Variational autoencoders (VAEs) are a powerful approach to learning deep generative models and probabilistic autoencoders (Kingma and Welling, 2014; Rezende et al., 2014).

However, recent work has shown that they are vulnerable to adversarial attacks (Tabacof et al., 2016; Gondim-Ribeiro et al., 2018; Kos et al., 2018), wherein an adversary attempts to fool the VAE to produce reconstructions similar to a chosen target by adding distortions to the original input as shown in Figure 1. In particular, these papers have shown that effective attacks can be made by finding local perturbations of the original input that produce latent-space representations which are similar to that of the adversary’s target. This kind of attack can be harmful in applications where the encoder’s output is used downstream, as in Xu et al. (2017); Kusner et al. (2017); Theis et al. (2017); Townsend et al. (2019); Ha and Schmidhuber (2018); Higgins et al. (2017b). Furthermore, VAEs are often themselves used as a mechanism for protecting classifiers from adversarial attack (Schott et al., 2019; Ghosh et al., 2019). As such, ensuring VAEs are robust to adversarial attack is an important endeavor.

Despite these vulnerabilities, little progress has been made in the literature on how to defend VAEs from such adversarial attacks. The aim of this paper is thus to investigate and introduce possible strategies for defense. Moreover, we seek to find ways to defend VAEs in a manner that maintains reconstruction performance.

Our first contribution towards this aim is to show that regularising the variational objective (i.e. the ELBO) during training can lead to more robust VAEs. Specifically, we leverage ideas from the disentanglement literature (Mathieu et al., 2019) to improve VAEs’ robustness by learning simpler and smoother representations that are less vulnerable to attack. In particular, we show that the total correlation (TC) term used by Kim and Mnih (2018); Chen et al. (2018); Esmaeili et al. (2019) to encourage independence between the dimensions of the learned latent representations, also serves as an effective regulariser for learning robust VAEs.

Though a clear improvement over the standard VAE, a severe drawback of this approach is that the gains in robustness are coupled with drops in the reconstruction performance, due to the increased regularisation. Furthermore, we find that the achievable robustness with this approach can be limited (see Figure 1) and thus potentially insufficient for particularly sensitive tasks.

To address this, we introduce a new TC–regularised hierarchical VAE: the Seatbelt-VAE. By using a richer latent space representation that the standard VAE, the Seatbelt-VAE can learn deep generative models which are not only even more robust to adversarial attacks than those just using TC regularisation, but which are also able to achieve this while providing reconstructions which are comparable to, and often even better than, the standard VAE.

To summarize, our key contributions are:

  • Providing insights into what makes VAEs vulnerable to attack and how we might go about defending them.

  • Unearthing new connections between disentanglement and robustness to adversarial attack.

  • A demonstration that regularised VAEs, trained with an up-weighted total correlation, are significantly more robust to adversarial attacks than vanilla VAEs.

  • Introducing a regularised hierarchical VAE, the Seatbelt-VAE, that provides further robustness to adversarial attack while providing improved reconstructions.

2 Background

2.1 Variational Autoencoders

Variational autoencoders (VAEs) are a deep extension of factor analysis suitable for high-dimensional data like images (Kingma and Welling, 2014; Rezende et al., 2014). They introduce a joint distribution over data and latent variables : where is an appropriate distribution given the form of the data, the parameters of which are represented by deep nets with parameters , and is a common choice for the prior. As exact inference is intractable, one performs amortised stochastic variational inference by introducing an inference network for the latent variables, , which often also takes the form of a Gaussian, . We can then perform gradient ascent on the evidence lower bound (ELBO)

w.r.t. both and , using the reparameterisation trick to take gradients through Monte Carlo samples from .

2.2 Attacking on VAEs

In an adversarial attack, an agent is trying to manipulate the behaviour of some machine learning model towards a goal of their choosing, such as fooling a classifier to misclassify an image through adding a small perturbation (Akhtar and Mian, 2018; Gilmer et al., 2018). For many deep learning models, very small changes in the input, of little importance to the human eye, can produce large changes in the model’s output.

Attacks on VAEs have been proposed by Tabacof et al. (2016); Gondim-Ribeiro et al. (2018); Kos et al. (2018). Here the adversary looks to apply small input distortions that produce reconstructions to be close to a target adversarial image. An example of this is shown in Figure 1, where a successful attack is performed on a standard VAE to turn Hugh Jackman into Anna Wintour.

The current most effective mode of attack on VAEs is known as a latent space attack (Tabacof et al., 2016; Gondim-Ribeiro et al., 2018; Kos et al., 2018). This aims to find a distorted image such that its posterior is close to that of the agent’s chosen target image . This, in turn, implies that the likelihood is high when conditioned on draws from the encoding of the adversarial example. It is particularly important to be robust to this attack if one is concerned with using the encoder network of a VAE as part of a downstream task.

For a VAE with a single stochastic layer, the latent-space adversarial objective introduced by Tabacof et al. (2016); Gondim-Ribeiro et al. (2018) is


where we are penalising the norm of so as to aim for attacks that change the image less. We can then simply optimise to find a good distortion .

3 Defending VAEs

Given these approaches to attacking VAEs, the critical question is now how to defend them. This problem was not considered by these prior works.

To solve it, we first need to consider the question: what makes VAEs vulnerable to adversarial attacks? We argue that two key factors dictate whether we can perform a successful attack on a VAE: a) whether we can induce significant changes in the encoding distribution through only small changes in the data , and b) whether we can induce significant changes in the reconstructed images through only small changes to the latents . The first of these relates to the smoothness of the encoder mapping, the latter to the smoothness of the decoder mapping.

Consider, for the sake of argument, the case where the encoder–decoder process is almost completely noiseless. Here successful reconstruction places no direct pressure for similar encodings to correspond to similar images: given sufficiently powerful networks, we can have an embedding where very small changes to imply very large changes to the reconstructed image because there is no ambiguity in the “correct” encoding of a particular datapoint. In essence, we can have a lookup–table style behaviour, where nearby realisations of do not necessarily relate to each other and very different images can have very similar encodings.

Such a system will now be very vulnerable to adversarial attacks: small changes to the image can lead to large changes in the encoding, and small changes to the encoding can lead to large changes in the reconstruction. Our autoencoder will also tend to overfit and have gaps in the aggregate posterior as each will be tightly peaked. This can then easily be exploited by an adversary.

We postulate two possible ways to avoid this undesirable behaviour. Firstly, we could try and directly regulate the networks used by the encoder and decoder to limit the capacity of the system to have small differences in images induce large differences in latents. Secondly, we can try to regulate the level of noise in the encoding to indirectly force a smoothness in the embedding. Having a noisy encoding creates uncertainty in the latent that gives rise to a particular image, forcing similar latents to correspond to similar images. In other words, we can avoid the aforementioned vulnerabilities by either ensuring our encode–decode process is sufficiently simple, or sufficiently noisy. The fact that the VAE is vulnerable to adversarial attack suggests that its standard setup does not sufficiently encourage either of these to provide an adequate defence. Introducing additional regularisation to enforce simplicity or noisiness thus provides an intriguing prospect for defending them.

Though in principle direct regularisation of the networks (e.g. through regularisation of their weights) might be a viable defence in a number of scenarios, we will, in this paper, instead focus on indirect regularisation approaches as discussed in the next section. The reason for this is that controlling the macroscopic behaviour of the networks through low-level regularisations can be difficult to control and, in particular, difficult to calibrate.

3.1 Disentanglement and Robustness

Recent research into disentangling VAEs (Higgins et al., 2017a; Siddharth et al., 2017; Kim and Mnih, 2018; Chen et al., 2018; Esmaeili et al., 2019; Mathieu et al., 2019) and the information bottleneck (Alemi et al., 2017, 2018) have looked to regularise the ELBO with the hope of providing more interpretable or simpler embeddings. This hints at an interesting link to robustness and raises the question: can we use methods for encouraging disentanglement to also encourage robustness?

Of particular relevance is the recent work of Mathieu et al. (2019). They introduce the notion of overlap in the embedding of a VAE and show how controlling it is critical to achieving smooth and meaningful latent representations. Overlap encapsulates both the level of uncertainty in the encoding process and also a locality of this uncertainty: to learn a smooth representation we not only need our encoder distribution to have an appropriate level of entropy, we also want the different possible encodings to be similar to each other, rather than spread out through the space.

Mathieu et al. (2019) further show that the success of many methods for disentanglement, and in particular the -VAE Higgins et al. (2017a), are rooted precisely in controlling this level of overlap. Controlling overlap is exactly what we need to carry out in our second suggested approach to defending VAEs. We therefore propose to train more robust VAEs by using the same ELBO regularisers as employed by disentanglement methods.

A further link between disentanglement and robustness is that disentangled representations may often also be both simpler and more human–interpretable. For example, if we were hypothetically able to learn an embedding for CelebA where one of the latent variables has a clear and smooth correspondence with skin tone, then it is likely to be difficult to conduct an adversarial attack to produce an image with a different skin tone without making substantial changes to this latent. Thus, not only might disentangled representations be more robust if they induce simpler and smoother mappings through regularisation, if they encourage human–interpretable features, this should also make it more difficult to conduct successful attacks from the perspective of human–perceived changes to the reconstruction. For instance, the definition of a successful attack is rooted in what features of an image are perceived as important to a human observer: successful attacks are those which change the qualitative nature of the reconstruction, not those which induce the largest change in individual pixels. As such, there are strong links between disentanglement and robustness through common ideas of what it means to manipulate a datapoint such as an image.

3.2 Regularising for Robustness

There are a number of different disentanglement methods that one might consider using to train robust VAEs. Perhaps the simplest would be to use a -VAE Higgins et al. (2017a), wherein we up-weight the term in the VAE’s ELBO by a factor . Indeed this is the disentanglement approach that has been shown to most directly relate to overlap, with the value of transpiring to be directly linked to the entropy of the encoder (Mathieu et al., 2019).

However, the -VAE is known to only provide disentanglement at the expense of substantial reductions in reconstruction quality as the data likelihood term has, in effect, been down-weighted (Kim and Mnih, 2018; Chen et al., 2018; Mathieu et al., 2019). Furthermore, the level of disentanglement it can achieve is lesser than more recent methods (Kim and Mnih, 2018; Chen et al., 2018).

Because of these shortfalls, we instead propose to regularise through penalisation of a total-correlation (TC) term as per Kim and Mnih (2018); Chen et al. (2018). This looks to directly force independence across the different latent dimensions in aggregate posterior , such that the distribution of the data in the latent space (i.e. where we draw a datapoint at random and then pass it through the encoder) factorises across dimensions. As we are up-weighting the total correlation by we refer to this as the -TCVAE as per (Chen et al., 2018). This approach has been shown to provide improved disentanglement to the -VAE, while also having a smaller deleterious effect on reconstruction quality.

To be more precise, the TC-decomposition of the VAE objective presented in (Hoffman and Johnson, 2016; Makhzani et al., 2016; Kim and Mnih, 2018; Chen et al., 2018; Esmaeili et al., 2019) reveals an explicit a TC term of the variational posterior . The factor and -TCVAEs upweight this term, to produce the variational objective (with :


where is the aggregate posterior, indexes over dimensions, and


is the TC term.

Chen et al. (2018); Esmaeili et al. (2019) give a differentiable, stochastic approximation to , rendering this decomposition possible to use as a training objective using stochastic gradient descent. However this is a biased estimator: it is a nested expectation, for which unbiased, finite–variance, estimators do not generally exist (Rainforth et al., 2018). Consequently, it has the unfortunate consequence of needing large batch sizes to have the desired behaviour; for small batch sizes its practical behaviour mimics that of the -VAE (Mathieu et al., 2019, Appendix C).

3.3 Adversarial Attacks on TC-Penalised VAEs

We now consider attacking these TC-penalised VAEs and demonstrate one of the key contributions of the paper: that empirically this form of regularisation makes adversarial attacks on VAEs via their latent space harder to carry out.

To do this, we first train them under the -TCVAE objective (i.e. Eq (2)), jointly optimising for a given . Once trained, we then attack the models using the methods outlined in Section 2.2. Namely, we use an adversary that tries to find a distortion to the input which minimises the attack loss as per Eq (1).

One possible metric for how successfully such attacks have been is the achieved value reached of the attack loss . If the latent space distributions for the original input and for the distorted input match exactly, then and the model has been completely fooled: reconstructions from samples from the attacked posterior would be indistinguishable from those from the target posterior. Meanwhile, the larger the converged value of the attack loss the less similar these distributions are and thus the more different the reconstructed image is to the adversarial target image.

We carry our these attacks for dSprites (Matthey et al., 2017), Chairs (Aubry et al., 2014) and 3D faces (Paysan et al., 2009), for a range of and values. We pick values of following the methodology in Tabacof et al. (2016); Gondim-Ribeiro et al. (2018), and use L-BFGS-B for gradient descent (Byrd et al., 1995). We also tried varying the dimensionality of the latent space of the model, , but found it had little effect on the effectiveness of the attack.

(a) dSprites Losses (b) Chairs Losses (c) 3D Faces Losses

Figure 5: Attacker’s achieved loss (i.e. Eq (1)), for (a) dSprites (b) Chairs (c) 3D Faces, for -TCVAE for different values. Higher losses are better. Shading corresponds to the 95 CI over variation due to variation of attacks performed for 10 images for each combination of and taking 50 geometrically distributed values between and .

In Figure 5 we show the effect on the attack loss from varying , averaged over different original input-target pairs and over different values of . Note that the plot is logarithmic in the values of the loss. We see a clear pattern for each dataset that the loss values reached by the adversary increases as we increase from the standard VAE (i.e. ). This analysis is also borne out by visual inspection of the effectiveness of these attacks as shown in Figure 1 and a number of other example attacks for different datasets, and , as shown in Appendix I. We will return to give further experimental results in Section 5.

An interesting point of note in Figure 5 is that in many cases the achievable adversarial loss actually starts to decrease again if is set too large. This is analogous to having too large an overlap when training for disentanglement as per Mathieu et al. (2019) or an overly restrictive information bottleneck (Alemi et al., 2017). The effect can be explained by thinking about what happens in the limit of . Here there is no pressure in the objective to produce good reconstructions, as such the encoder simply focuses on matching the prior regardless of the input. The prior does not use any information from the input and the KL term in becomes small for all possible distortions, even . For large but finite values of there will still be pressure to produce good reconstructions, but this will be dominated by the TC term which is most easily minimised by simply encoding to the prior.

4 The Seatbelt-VAE

We are now armed with the fact that penalising the total correlation in the ELBO leads to more robust VAEs. However, this TC-penalisation in single layer VAEs comes at the expense of model reconstruction quality Chen et al. (2018). Our aim is to develop a model that is robust to adversarial attack while mitigating this trade-off between robustness and sample quality.

To achieve this, we now consider instead using hierarchical VAEs Rezende et al. (2014); Sønderby et al. (2016); Zhao et al. (2017); Maaløe et al. (2019). These are known for their superior modelling capabilities and more accurate reconstructions. As these gains stem from using more complex hierarchical latent spaces, rather than less noisy encoders, this suggests they may be able to produce better reconstructions and generative capabilities, while also remain robust to adversarial attacks when appropriately regularised.

The simplest hierarchical extension of conditional stochastic variables in the generative model is the Deep Latent Gaussian Model (DLGM) of (Rezende et al., 2014). Here the forward model factorises as a chain


where each is a Gaussian distribution with mean and variance parameterised by deep nets, while is an isotropic Gaussian.

Unfortunately, we found that naively applying TC-correlation penalisation to DLGM-style VAEs did not confer the improved robustness we observed in single layer VAEs. We postulate that this observed weakness is inherent to the structure of chain factorisation in the generative model: this structure means that the data-likelihood depends solely on , the bottom-most latent variable, and attackers need only manipulate to produce a successful attack.

To account for this, we instead propose a generative model in which the likelihood depends on all the latent variables in the chain , rather than just the bottom layer . This leads to the following factorisation of the generative structure (which shares some similarity to that of BIVA (Maaløe et al., 2019))


To construct the ELBO, we must further introduce an inference network . On the basis of simplicity and that it produces effective empirical performance, we simply use a chain factorisation for this as per (Rezende et al., 2014):


where each conditional distribution takes the form of a Gaussian. Note that, marginalising out intermediate layers, we see is a non-Gaussian, highly flexible, distribution. A summary of the dependency structure for the generative and inference networks is shown in Figure 8 for the case .




Figure 8: Seatbelt-VAE. Shaded lines indicate -TC factorisation in a given node.

To defend this model against adversarial attack, we further introduce a TC regularisation term as per the last section. We refer to the resulting model as the Seatbelt-VAE due to the protection it confers to adversarial attack. Because we find that, empirically, models of this type struggle to converge when TC-penalisation is applied to either the bottom-most layer or every layer, the Seatbelt-VAE only applies a TC-penalisation to the topmost latent variable . In other words, following the Factor and -TCVAEs, we up-weight the term for of the same form as


in Eq (2) to give


where indexes over the coordinates in .

Similar to Kim and Mnih (2018) and Chen et al. (2018), we can, in fact, reach Eq (6) by exposing this total-correlation term through an explicit decomposition of the KL (see Appendix C for a derivation). Specifically, now considering the ELBO for the whole dataset and using to indicate the empirical average over the datak, we have:


where and indexes over the latent variables in the hierarchical chain. We see that, when , the Seatbelt-VAE reduces to a -TCVAE, and for it produces a DLGM with our augmented likelihood function.

(a) Faces
(b) Faces
(c) Chairs
(d) Chairs
Figure 13: Numerically measuring the robustness of Seatbelt-VAEs (=4) and -TCVAEs models for different values of . Note that the -TCVAE with corresponds to the standard VAE. Sub-figures (a) and (c) show the negative adversarial likelihood of a target image given an attacked latent representation for Chairs and 3D Faces respectively. Larger values of correspond to less successful adversarial attacks. Sub-figures (b) and (d) instead show the adversarial loss , where higher values also indicate robustness to attacks. Note that these loss plots have a logarithmic axis. Shading in the -TCVAE results corresponds to the 95 CI over variation for 10 images for each combination of and taking 50 geometrically distributed values between and . For the Seatbelt-VAE, we instead fix and , such that the shading corresponds to the 95 CI over variation for 10 images and the 50 values of . As we go to the largest values of for both Chairs and 3D Faces, grows by a factor of and doubles for Seatbelt-VAE. -TCVAEs do not experience such a large uptick in adversarial loss and negative adversarial likelihood. These results tell us that the Seatbelt-VAE can offer very strong protection from the adversarial attacks studied. See Appendix H.4 for heatmaps detailing these metrics for a range of Seatbelt depths (i.e varying ).

As with the -TCVAE, training using stochastic gradient ascent with minibatches of the data is complicated by the presence of aggregate posteriors which depend on the entire dataset. To deal with this, we derive a minibatch estimator that is a generalisation to disentangled hierarchical VAEs of the Minibatch–Weighted–Sampling estimator proposed in Chen et al. (2018); Esmaeili et al. (2019) in the context of -TCVAEs. As discussed in Section 3.2, this estimator is inherently biased for finite dataset sizes, such that large batch sizes are required to provide a good estimate of the TC. See Appendix D for further details.

4.1 Attacking the Seatbelt-VAE

In the Seatbelt-VAE the likelihood over data is conditioned on all layers, so manipulations to any layer have the potential to be significant. We focus on simultaneously attacking all layers of the Seatbelt-VAE, noting that, as shown in the Appendix, this is more effective that just targeting the top or base layers individually. Hence our adversarial objective for the Seatbelt-VAE is based on the following generalisation of that introduced in (Tabacof et al., 2016; Gondim-Ribeiro et al., 2018) to attack all the layers at the same time:


5 Experiments

We now demonstrate that Seatbelt-VAEs confer superior robustness to -TCVAEs and standard VAEs, while preserving the ability to reconstruct inputs effectively. Through this, we demonstrate that Seatbelt-VAEs are a powerful tool for learning robust deep generative models.

5.1 Methods

We first expand on our experiments in Section 3.3 and perform a battery of adversarial attacks on each of the introduced models. We randomly sample 10 input-target pairs for each dataset. As in Tabacof et al. (2016); Gondim-Ribeiro et al. (2018), for each image pair we consider 50 different values of geometrically-distributed from to . Thus each model undergoes 500 attacks for each attack mode. As before, we used L-BFGS-B for gradient descent (Byrd et al., 1995). We perform these experiments on Chairs (Aubry et al., 2014), 3D faces (Paysan et al., 2009), and CelebA (Liu et al., 2015). Additional results for dSprites (Higgins et al., 2017a) can be found in Appendices H, I, and J. We used the same encoder and decoder architectures as Chen et al. (2018) for each dataset. Details of neural network architectures and training are given in Appendix E.

We evaluate the effectiveness of adversarial attacks using the attack objective as before, along with , the negative likelihood of the target image () given the embedding generated by the adversary (). Like with , higher values of this metric denote a less successful attack.

(a) Chairs ELBO
(b) 3D Faces ELBO
(c) Reconstructions
Figure 17: Plots showing the effect of varying on the reconstructions of TC-penalised models. In sub-figures (a) and (b) we plot the final ELBO of the two TC-penalised models trained on the Chairs and 3D faces datasets, but calculated without the additional penalisation that was applied during training [Eqs (2) and (7) respectively]. Shading corresponds to the 95 CI over variation due to variation of for -TCVAE and for Seatbelt. As increases, the ELBO degrades to a much lesser degree for Seatbelt-VAE, relative to -TCVAE. Sub-figure (c) serves as a visual confirmation of these results. The top row shows CelebA input data. The bottom row, showing the reconstructions from a Seatbelt-VAE with and , clearly maintains facial identity better than those from a -TCVAE in the middle row. Many of the individuals’ finer facial features are lost by the -TCVAE but are maintained by the Seatbelt-VAE. Combined, these plots show that resistance of the quality of the reconstructions of Seatbelt to increasing is visually perceptible as well as measurable. Substantial additional visual results are given in the Appendix.


(a) for -TCVAE


(b) for Seatbelt-VAE

Figure 20: Robustness of -TCVAE and Seatbelt-VAE to noising of inputs. Here we add Gaussian noise to datapoints drawn from the CelebA dataset and feed them into the encoder to create a noisy embedding. We then evaluate a model’s ability to decode this noisy embedding to the original non-noised data by measuring the distribution of when , for which higher values indicate better denoising. We show these likelihood values as density plots for the -TCVAE in (a) and for the Seatbelt-VAE with in (b), taking in both cases. Note the axis scalings are different for each. We see that for both using produces autoencoders that are better at denoising their inputs. Namely, the mean of the density, i.e. , shifts dramatically to higher values for relative to . In other words, for both these models, the likelihood of the dataset in the noisy setting is much closer to the non-noisy dataset when . Also note that is generally higher for the Seatbelt-VAE than the -TCVAE.

5.2 Visual Appraisal of Attacks

We first visually appraise the effectiveness of attacks on vanilla VAEs, -TCVAEs and Seatbelt-VAEs. As mentioned in Section 1, Figure 1 shows the results of latent space attacks on three models trained on CelebA. It is apparent that the -TCVAE provides additional resilience to the attacks compared with the standard VAE. Furthermore, this figure shows that the Seatbelt-VAE was sufficiently robust to almost completely thwart the adversary, producing an adversarial construction that still resembles the original input. Moreover, this was achieved while still producing a clearer non–adversarial reconstruction that either the VAE or -TCVAE. See Appendix I for more examples.

One might expect that adversarial attacks targeting a single generative factor underpinning the data would be easier for the attacker. However, we find that TC-penalised models protect effectively against these attacks as well. For instance, see the Appendix I.1 for plots showing an attacker attempting to rotate a dSprites heart.

5.3 Numerical Appraisal of Robustness

Having ascertained perceptually that the Seatbelt-VAE offers the strongest protection to adversarial attack, we now demonstrate this quantitatively. Figure 13 shows and over a range of datasets and s for the Seatbelt-VAEs with and -TCVAEs. This figure demonstrates that the combination of depth and high TC-penalisation offers the best protection to adversarial attacks and that the Seatbelt extension confers much greater protection to adversarial attack than a single layer -TCVAE.

In Appendix H.1, we also calculate the distance between target images and adversarial outputs and show that the loss of effectiveness of adversarial attacks is not due to the degradation of reconstruction quality from increasing . We also include results in Appendix H for “output” attacks Gondim-Ribeiro et al. (2018), which we find to be generally less effective. Here, the attacker directly tries to reduce the L2 distance between the reconstructed output and the target image. Again, TC-penalised models, and in particular Seatbelt-VAEs, outperformed standard VAEs.

5.4 ELBO and Reconstruction Quality

Though Seatbelt-VAEs offer better protection to adversarial attack than -TCVAEs, we also motivate their utility by way of their reconstruction quality. In Figure 17 we plot the final ELBO of the two TC-penalised models, but calculated without the additional penalisation that was applied during training. We further show the effect of depth and TC-penalisation on reconstructions of CelebA. Both these plots show that Seatbelt-VAEs’ reconstructions are more resilient to increasing than -TCVAEs’. This resilience is both visually perceptible and measurable.

5.5 Noised input data

We finish by testing robustness to unstructured attacks where we noise the inputs and evaluate the model’s ability to reconstruct the original. Through this, we are evaluating their ability to denoise inputs. See Figure 20 for an illustration of the denoising properties of TC-penalised models trained on the CelebA dataset. This ability to denoise may partially explain these models’ robustness to more structured attacks.

6 Conclusion

We have shown that VAEs can be rendered more robust to both to adversarial attacks and noising of the inputs by adopting a TC-penalisation in the evidence lower bound. This increase in robustness can be strengthened even further by using our proposed hierarchical VAE, the Seatbelt-VAE, which uses a carefully chosen generative structure where the likelihood makes use of all the latent variables.

Designing robust VAEs is becoming pressing as they are increasingly deployed as subcomponents in larger pipelines. As we have shown, methods typically used for disentangling, motivated by their ability to provide interpretable representations, also confer robustness to VAEs. Studying the beneficial effects of these methods is starting to come to the fore of research into VAEs Kumar and Poole (2019). We hope this work sparks further interest in the interplay between disentangling, regularisation, and model robustness.


  1. Threat of Adversarial Attacks on Deep Learning in Computer Vision: A Survey. IEEE Access 6, pp. 14410–14430. External Links: Document, ISSN 21693536 Cited by: §2.2.
  2. Deep Variational Information Bottleneck. In ICLR, External Links: ISBN 1612.00410v5 Cited by: §3.1, §3.3.
  3. Fixing a Broken ELBO. ICML. Cited by: §3.1.
  4. Seeing 3D chairs: Exemplar part-based 2D-3D alignment using a large dataset of CAD models. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3762–3769. External Links: ISBN 9781479951178, Document, ISSN 10636919 Cited by: §3.3, §5.1.
  5. A Limited Memory Algorithm for Bound Constrained Optimization. SIAM J. Sci. Comput. 16 (5), pp. 1190–1208. External Links: Document, ISSN 1064-8275 Cited by: §3.3, §5.1.
  6. Isolating Sources of Disentanglement in Variational Autoencoders. In NeurIPS, Cited by: §1, §3.1, §3.2, §3.2, §3.2, §3.2, §4, §4, §4, §5.1.
  7. Structured Disentangled Representations. In AISTATS, Cited by: §1, §3.1, §3.2, §3.2, §4.
  8. Resisting Adversarial Attacks Using Gaussian Mixture Variational Autoencoders. In AAAI, Cited by: §1.
  9. Motivating the Rules of the Game for Adversarial Example Research. CoRR. Cited by: §2.2.
  10. Adversarial Attacks on Variational Autoencoders. CoRR. Cited by: §1, §2.2, §2.2, §2.2, §3.3, §4.1, §5.1, §5.3.
  11. World Models. In NeurIPS, External Links: Document Cited by: §1.
  12. -VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. In ICLR, External Links: Document, ISSN 1078-0874 Cited by: §3.1, §3.1, §3.2, §5.1.
  13. DARLA: Improving Zero-Shot Transfer in Reinforcement Learning. In ICML, Cited by: §1.
  14. ELBO surgery: yet another way to carve up the variational evidence lower bound. In NeurIPS, Cited by: §3.2.
  15. Disentangling by Factorising. In NeurIPS, Cited by: §1, §3.1, §3.2, §3.2, §3.2, §4.
  16. Auto-encoding Variational Bayes. In ICLR, Cited by: §1, §2.1.
  17. Adversarial Examples for Generative Models. In IEEE Security and Privacy Workshops, pp. 36–42. External Links: Document Cited by: §1, §2.2, §2.2.
  18. On Implicit Regularization in -VAE. In NeurIPS Bayesian Deep Learning Workshop, Cited by: §6.
  19. Grammar Variational Autoencoder. In ICML, Cited by: §1.
  20. Deep Learning Face Attributes in the Wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: §5.1.
  21. BIVA: A Very Deep Hierarchy of Latent Variables for Generative Modeling. NeurIPS. Cited by: §4, §4.
  22. Adversarial Autoencoders. In ICLR, External Links: Link, ISBN 0928-4931, Document, ISSN 09284931 Cited by: §3.2.
  23. Disentangling Disentanglement in Variational Autoencoders. In ICML, Cited by: §1, §3.1, §3.1, §3.1, §3.2, §3.2, §3.2, §3.3.
  24. dSprites: Disentanglement testing Sprites dataset. Cited by: §3.3.
  25. A 3D face model for pose and illumination invariant face recognition. In 6th IEEE International Conference on Advanced Video and Signal Based Surveillance, AVSS 2009, pp. 296–301. External Links: ISBN 9780769537184, Document Cited by: §3.3, §5.1.
  26. On nesting Monte Carlo estimators. In ICML, External Links: ISBN 9781510867963 Cited by: §3.2.
  27. Stochastic Backpropagation and Approximate Inference in Deep Generative Models. In ICML, Cited by: §1, §2.1, §4, §4, §4.
  28. Toward the First Adversarially Robust Neural Network Model on MNIST. In ICLR, Cited by: §1.
  29. Learning disentangled representations with semi-supervised deep generative models. In NeurIPS, Cited by: §3.1.
  30. Ladder Variational Autoencoders. In NeurIPS, Cited by: §4.
  31. Adversarial Images for Variational Autoencoders. In NIPS Workshop on Adversarial Training, Cited by: §1, §2.2, §2.2, §2.2, §3.3, §4.1, §5.1.
  32. Lossy Image Compression with Compressive Autoencoders. In ICLR, Cited by: §1.
  33. Practical Lossless Compression with Latent Variables using Bits Back Coding. ICLR. Cited by: §1.
  34. Variational Autoencoder for Semi-supervised Text Classification. In AAAI, pp. 3358–3364. External Links: ISBN 9781450329569, Document, ISSN 10688838 Cited by: §1.
  35. Learning Hierarchical Features from Generative Models. In ICML, Cited by: §4.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description