Adversarial Dropout for Supervised and SemiSupervised Learning
Abstract
Recently, training with adversarial examples, which are generated by adding a small but worstcase perturbation on input examples, has improved the generalization performance of neural networks. In contrast to the biased individual inputs to enhance the generality, this paper introduces adversarial dropout, which is a minimal set of dropouts that maximize the divergence between 1) the training supervision and 2) the outputs from the network with the dropouts. The identified adversarial dropouts are used to automatically reconfigure the neural network in the training process, and we demonstrated that the simultaneous training on the original and the reconfigured network improves the generalization performance of supervised and semisupervised learning tasks on MNIST, SVHN, and CIFAR10. We analyzed the trained model to find the performance improvement reasons. We found that adversarial dropout increases the sparsity of neural networks more than the standard dropout. Finally, we also proved that adversarial dropout is a regularization term with a rankvalued hyper parameter that is different from a continuousvalued parameter to specify the strength of the regularization.
Introduction
Deep neural networks (DNNs) have demonstrated the significant improvement on benchmark performances in a wide range of applications. As neural networks become deeper, the model complexity also increases quickly, and this complexity leads DNNs to potentially overfit a training data set. Several techniques [Hinton et al.2012, Poole, SohlDickstein, and Ganguli2014, Bishop1995b, Lasserre, Bishop, and Minka2006] have emerged over the past years to address this challenge, and dropout has become one of dominant methods due to its simplicity and effectiveness [Hinton et al.2012, Srivastava et al.2014].
Dropout randomly disconnects neural units during training as a method to prevent the feature coadaptation [Baldi and Sadowski2013, Wager, Wang, and Liang2013, Wang and Manning2013, Li, Gong, and Yang2016]. The earlier work by Hinton et al. \shortcitehinton2012improving and Srivastava et al. \shortcitesrivastava2014dropout interpreted dropout as an extreme form of model combinations, a.k.a. a model ensemble, by sharing extensive parameters on neural networks. They proposed learning the model combination through minimizing an expected loss of models perturbed by dropout. They also pointed out that the output of dropout is the geometric mean of the outputs from the model ensemble with the shared parameters. Extending the weight sharing perspective, several studies [Baldi and Sadowski2013, Chen et al.2014, Jain et al.2015] analyzed the ensemble effects from the dropout.
The recent work by Laine & Aila (\citeyearlaine2016temporal) enhanced the ensemble effect of dropout by adding selfensembling terms. The selfensembling term is constructed by a divergence between two sampled neural networks from the dropout. By minimizing the divergence, the sampled networks learn from each other, and this practice is similar to the working mechanism of the ladder network [Rasmus et al.2015], which builds a connection between an unsupervised and a supervised neural network. Our method also follows the principles of selfensembling, but we apply the adversarial training concept to the sampling of neural network structures through dropout.
At the same time that the community has developed the dropout, adversarial training has become another focus of the community. Szegedy et al. (\citeyearszegedy2013intriguing) showed that a certain neural network is vulnerable to a very small perturbation in the training data set if the noise direction is sensitive to the models’ label assignment given , even when the perturbation is so small that human eyes cannot discern the difference. They empirically proved that robustly training models against adversarial perturbation is effective in reducing test errors. However, their method of identifying adversarial perturbations contains a computationally expensive inner loop. To compensate it, Goodfellow et al. (\citeyeargoodfellow2014explaining) suggested an approximation method, through the linearization of the loss function, that is free from the loop. Adversarial training can be conducted on supervised learning because the adversarial direction can be defined when true labels are known. Miyato et al. (\citeyearmiyato2015distributional) proposed a virtual adversarial direction to apply the adversarial training in the semisupervised learning that may not assume the true value. Until now, the adversarial perturbation can be defined as a unit vector of additive noise imposed on the input or the embedding spaces [Szegedy et al.2013, Goodfellow, Shlens, and Szegedy2014, Miyato et al.2015].
Our proposed method, adversarial dropout, can be viewed from the dropout and from the adversarial training perspectives. Adversarial dropout can be interpreted as dropout masks whose direction is optimized adversarially to the model’s label assignment. However, it should be noted that adversarial dropout and traditional adversarial training with additive perturbation are different because adversarial dropout induces the sparse structure of neural network while the other does not make changes in the structure of the neural network, directly.
Figure 1 describes the proposed loss function construction of adversarial dropout compared to 1) the recent dropout model, which is model [Laine and Aila2016] and 2) the adversarial training [Goodfellow, Shlens, and Szegedy2014, Miyato et al.2015]. When we compare adversarial dropout to model, both divergence terms are similarly computed from two different dropped networks, but adversarial dropout uses an optimized dropped network to adapt the concept of adversarial training. When we compare adversarial dropout to the adversarial training, the divergence term of the adversarial training is computed from one network structure with two training examples: clean and adversarial examples. On the contrary, the divergence term of the adversarial dropout is defined with two network structures: a randomly dropped network and an adversarially dropped network.
Our experiments demonstrated that 1) adversarial dropout improves the performance on MNIST supervised learning compared to the dropout suggested by model, and 2) adversarial dropout showed the stateoftheart performance on the semisupervised learning task on SVHN and CIFAR10 when we compare the most recent techniques of dropout and adversarial training. Following the performance comparison, we visualize the neural network structure from adversarial dropout to illustrate that the adversarial dropout enables a sparse structure compared to the neural network of standard dropout. Finally, we theoretically show the original characteristics of adversarial dropout that specifies the strength of the regularization effect by the rankvalued parameter while the adversarial training specifies the strength with the conventional continuousvalued scale.
Preliminaries
Before introducing adversarial dropout, we briefly introduce stochastic noise layers for deep neural networks. Afterwards, we review adversarial training and temporal ensembling, or model, because two methods are closely related to adversarial dropout.
Noise Layers
Corrupting training data with noises has been wellknown to be a method to stabilize prediction [Bishop1995a, Maaten et al.2013, Wager, Wang, and Liang2013]. This section describes two types of noise injection techniques, such as additive Gaussian noise and dropout noise.
Let denote the hidden variables in a neural network, and this layer can be replaced with a noisy version . We can vary the noise types as the followings.

Additive Gaussian noise: , where with the parameter to restrict the degree of noises.

Dropout noise: , where is the elementwise product of two vectors, and the elements of the noise vector are with the parameter . To simply put, this function specifies that with probability and with probability .
Both additive Gaussian noise and dropout noise are generalization techniques to increase the generality of the trained model, but they have different properties. The additive Gaussian noise increases the margin of decision boundaries while the dropout noise affects a model to be sparse [Srivastava et al.2014]. These noise layers can be easily included in a deep neural network. For example, there can be a dropout layer between two convolutional layers. Similarly, a layer of additive Gaussian noise can be placed on the input layer.
SelfEnsembling Model
The recently reported selfensembling (SE) [Laine and Aila2016], or model, construct a loss function that minimizes the divergence between two outputs from two sampled dropout neural networks with the same input stimulus. Their suggested regularization term can be interpreted as the following:
(1) 
where and are randomly sampled dropout noises in a neural network , whose parameters are . Also, is a nonnegative function that represents the distance between two output vectors: and . For example, can be the cross entropy function, , where and are the vectors whose elements represent the probability of the class. The divergence could be calculated between two outputs of two different structures, which turn this regularization to be semisupervised. model is based on the principle of model, which is the ladder network by Rasmus et al. (\citeyearrasmus2015semi). Our proposed method, adversarial dropout, can be seen as a special case of model when one dropout neural network is adversarially sampled.
Adversarial Training
Adversarial dropout follows the training mechanism of adversarial training, so we briefly introduce a generalized formulation of the adversarial training. The basic concept of adversarial training (AT) is an incorporation of adversarial examples on the training process. Additional loss function by including adversarial examples [Szegedy et al.2013, Goodfellow, Shlens, and Szegedy2014, Miyato et al.2015] can be defined as a generalized form:
(2)  
Here, is a set of model parameters, is a hyperparameter controlling the intensity of the adversarial perturbation . The function is an output distribution of a neural network to be learned. Adversarial training can be diversified by differentiating the definition of , as the following.

Adversarial training (AT) [Goodfellow, Shlens, and Szegedy2014, Kurakin, Goodfellow, and Bengio2016] defines as ignoring and , so is an onehot encoding vector of .

Virtual adversarial training (VAT) [Miyato et al.2015, Miyato, Dai, and Goodfellow2016] defines as where is the current estimated parameter. This training method does not use any information from in the adversarial part of the loss function. This enables the adversarial part to be used as a regularization term for the semisupervised learning.
Method
This section presents our adversarial dropout that combines the ideas of adversarial training and dropout. First, we formally define the adversarial dropout. Second, we propose a training algorithm to find the instantiations of adversarial dropouts with a fast approximation method.
General Expression of Adversarial Dropout
Now, we propose the adversarial dropout (AdD), which could be an adversarial training method that determines the dropout condition to be sensitive on the model’s label assignment. We use as an output distribution of a neural network with a dropout mask. The below is the description of the additional loss function by incorporating adversarial dropout.
(3)  
Here, indicates a divergence function; represents an adversarial target function that can be diversified by its definition; is an adversarial dropout mask under the function when is a set of model parameters; is a sampled random dropout mask instance; is a hyperparameter controlling the intensity of the noise; and is the dropout layer dimension.
We introduce the boundary condition, , which indicates a restriction of the number of the difference between two dropout conditions. An adversarial dropout mask should be infinitesimally different from the random dropout mask. Without this constraint, the network with adversarial dropout may become a neural network layer without connections. By restricting the adversarial dropout with the random dropout, we prevent finding such irrational layer, which does not support the back propagation. We found that the Euclidean distance between two vectors can be calculated by using the graph edit distance or the Jaccard distance. In the supplementary material, we proved that the graph edit distance and the Jaccard distance can be abstracted as Euclidean distances between two vectors.
In the general form of adversarial training, the key point is the existence of the linear perturbation . We can interpret the input with the adversarial perturbation as this adversarial noise input . From this perspective, the authors of adversarial training limited the adversarial direction only on the space of the additive Gaussian noise , where is a sampled Gaussian noise on the input layer. In contrast, adversarial dropout can be considered as a noise space generated by masking hidden units, where is hidden units, and is an adversarially selected dropout condition. If we assume the adversarial training as the Gaussian additive perturbation on the input, the perturbation is linear in nature, but adversarial dropout could be nonlinear perturbation if the adversarial dropout is imposed upon multiple layers.
Supervised Adversarial Dropout
Supervised Adversarial dropout (SAdD) defines as , so is a onehot vector of as the typical neural network. The divergence term from Formula 3 can be converted as follows:
(4)  
In this case, the divergence term can be seen as the pure loss function for a supervised learning with a dropout regularization. However, is selected to maximize the divergence between the true information and the output from the dropout network, so eventually becomes the mask on the most contributing features. This adversarial mask provides the learning opportunity on neurons, so called dead filter, that was considered to be less informative.
Virtual Adversarial Dropout
Virtual adversarial dropout (VAdD) defines . This uses the loss function as a regularization term for semisupervised learning. The divergence term in Formula 3 can be represented as bellow:
(5)  
VAdD is a special case of a selfensembling model with two dropouts. They are 1) a dropout, , sampled from a random distribution with a hyperparameter and 2) a dropout, , composed to maximize the divergence function of the learner, which is the concept of the noise injection from the virtual adversarial training. The two dropouts create a regularization as the virtual adversarial training, and the inference procedure optimizes the parameters to reduce the divergence between the random dropout and the adversarial dropout. This optimization triggers the selfensemble learning in [Laine and Aila2016]. However, the adversarial dropout is different from the previous selfensembling because one dropout is induced by the adversarial setting, not by a random sampling.
Learning with Adversarial Dropout
The full objective function for the learning with the adversarial dropout is given by
(6) 
where is the negative loglikelihood for given under the sampled dropout instance . There are two scalarscale hyperparameters: (1) a tradeoff parameter, , for controlling the impact of the proposed regularization term and (2) the constraints, , specifying the intensity of adversarial dropout.
Combining Adversarial Dropout and Adversarial Training
Additionally, it should be noted that the adversarial training and the adversarial dropout are not exclusive training methods. A neural network can be trained by imposing the input perturbation with the Gaussian additive noise, and by enabling the adversarially chosen dropouts, simultaneously. Formula 7 specifies the loss function of simultaneously utilizing the adversarial dropout and the adversarial training.
(7) 
where and are tradeoff parameters controlling the impact of the regularization terms.
Fast Approximation Method for Finding Adversarial Dropout Condition
Once the adversarial dropout, , is identified, the evaluation of simply becomes the computation of the loss and the divergence functions. However, the inference on is difficult because of three reasons. First, we cannot obtain a closedform solution on the exact adversarial noise value, . Second, the feasible space for is restricted under , which becomes a constraint in the optimization. Third, is a binaryvalued vector rather than a continuousvalued vector because indicates the activation of neurons. This discrete nature requires an optimization technique like integer programming.
To mitigate this difficulty, we approximated the objective function, , with the first order of the Taylor expansion by relaxing the domain space of . This Taylor expansion of the objective function was used in the earlier works of adversarial training [Goodfellow, Shlens, and Szegedy2014, Miyato et al.2015]. After the approximation, we found an adversarial dropout condition by solving an integer programming problem.
To define a neural network with a dropout layer, we separate the output function into two neural subnetworks, , where is the upper part neural network of the dropout layer and is the under part neural network. Our objective is optimizing an adversarial dropout noise by maximizing the following divergence function under the constraint :
(8) 
where is a sampled dropout mask, and is a parameter of the neural network model. We approximate the above divergence function by deriving the first order of the Taylor expansion by relaxing the domain space of from the multiple binary spaces, , to the real value spaces, . This conversion is a common step in the integer programming research as [Hemmecke et al.2010]:
(9) 
where is the Jacobian vector given by when indicates no noise injection. The above Taylor expansion provides a linearized optimization objective function by controlling . Therefore, we reorganized the Taylor expansion with respect to as the below:
(10) 
where is the element of . Since we cannot proceed further with the given formula, we introduce an alternative Jaccobian formula that further specifies the dropout mechanism by and as the below.
(11) 
where is the output vector of the under part neural network of the adversarial dropout.
The control variable, , is a binary vector whose elements are either one or zero. Under this approximate divergence, finding a maximal point of can be viewed as the 0/1 knapsack problem [Kellerer, Pferschy, and Pisinger2004], which is one of the most popular integer programming problems.
To find with the constraint, we propose Algorithm 1 based on the dynamic programming for the 0/1 knapsack problem. In the algorithm, is initialized with , and changes its value by the order of the degree increasing the objective divergence until ; or there is no increment in the divergence. After using the algorithm, we obtain that maximizes the divergence with the constraint, and we evaluate the loss function .
We should notice that the complex vector of the Taylor expansion is not , but . In the case of virtual adversarial dropout, whose divergence is formed as , is the minimal point leading the gradient to be zero because of the identical distribution between the random and the optimized dropouts. This zero gradient affects the approximation of the divergence term as zero. To avoid the zero gradients, we set the complex vector of the Taylor expansion as .
This zero gradient situation does not occur when the model function, , contains additional stochastic layers because when and are independently sampled noises from another stochastic layers.
Error rate () with labels  
Method  1,000  All (60,000) 
Plain (only dropout)  2.99 0.23  0.53 0.03 
AT    0.51 0.03 
VAT  1.35 0.14  0.50 0.01 
model  1.00 0.08  0.50 0.02 
SAdD    0.46 0.01 
VAdD (KL)  0.99 0.07  0.47 0.01 
VAdD (QE)  0.99 0.09  0.46 0.02 
Experiments
This section evaluates the empirical performance of adversarial dropout for supervised and semisupervised classification tasks on three benchmark datasets, MNIST, SVHN, and CIFAR10. In every presented task, we compared adversarial dropout, model, and adversarial training. We also performed additional experiments to analyze the sparsity of adversarial dropout.
Supervised and Semisupervised Learning on MNIST task
In the first set of experiments, we benchmark our method on the MNIST dataset [LeCun et al.1998], which consists of 70,000 handwritten digit images of size where 60,000 images are used for training and the rest for testing.
Our basic structure is a convolutional neural network (CNN) containing three convolutional layers, which filters are 32, 64, and 128, respectively, and three maxpooling layers sized by . The adversarial dropout applied only on the final hidden layer. The structure detail and the hyperparameters are described in Appendix B.1.
We conducted both supervised and semisupervised learnings to compare the performances from the standard dropout, model, and adversarial training models utilizing linear perturbations on the input space. The supervised learning used 60,000 instances for training with full labels. The semisupervised learning used 1,000 randomly selected instances with their labels and 59,000 instances with only their input images. Table 1 shows the test error rates including the baseline models. Over all experiment settings, SAdD and VAdD further reduce the error rate from model, which had the best performance among the baseline models. In the table, KL and QE indicate KullbackLeibler divergence and quadratic error, respectively, to specify the divergence function, .
Supervised and Semisupervised Learning on SVHN and CIFAR10
SVHN with labels  CIFAR10 with labels  
Method  1,000  73,257 (All)  4,000  50,000 (All) 
model [Laine and Aila2016]  4.82  2.54  12.36  5.56 
Tem. ensembling [Laine and Aila2016]  4.42  2.74  12.16  5.60 
Sajjadi et al. [Sajjadi, Javanmardi, and Tasdizen2016]      11.29   
VAT [Miyato et al.2017]  3.86    10.55  5.81 
model (our implementation)  4.35 0.04  2.53 0.05  12.62 0.29  5.77 0.11 
VAT (our implementation)  3.74 0.09  2.69 0.04  11.96 0.10  5.65 0.17 
SAdD    2.46 0.05    5.46 0.16 
VAdD (KL)  4.16 0.08  2.31 0.01  11.68 0.19  5.27 0.10 
VAdD (QE)  4.26 0.14  2.37 0.03  11.32 0.11  5.24 0.12 
VAdD (KL) + VAT  3.55 0.05  2.23 0.03  10.07 0.11  4.40 0.12 
VAdD (QE) + VAT  3.55 0.07  2.34 0.05  9.22 0.10  4.73 0.04 
We experimented the performances of the supervised and the semisupervised tasks on the SVHN [Netzer et al.2011] and the CIFAR10 [Krizhevsky and Hinton2009] datasets consisting of color images in ten classes. For these experiments, we used the largeCNN [Laine and Aila2016, Miyato et al.2017]. The details of the structure and the settings are described in Appendix B.2.
Table 7 shows the reported performances of the close family of CNNbased classifiers for the supervised and semisupervised learning. We did not consider the recently advanced architectures, such as ResNet [He et al.2016] and DenseNet [Huang et al.2016], because we intend to compare the performance increment by the dropout and other training techniques.
In supervised learning tasks using all labeled train data, adversarial dropout models achieved the top performance compared to the results from the baseline models, such as model and VAT, on both datasets. When applying adversarial dropout and adversarial training together, there were further improvements in the performances.
Additionally, we conducted experiments on the semisupervised learning with randomly selected labeled data and unlabeled images. In SVHN, 1,000 labeled and 72,257 unlabeled data were used for training. In CIFAR10, 4,000 labeled and 46,000 unlabeled data were used. Table 7 lists the performance of the semisupervised learning models, and our implementations with both VAdD and VAT achieved the top performance compared to the results from [Sajjadi, Javanmardi, and Tasdizen2016].
Our experiments demonstrate that VAT and VAdD are complementary. When applying VAT and VAdD together by simply adding their divergence terms on the loss function, see Formula 7, we achieved the stateoftheart performances on the semisupervised learning on both datasets; 3.55% of test error rates on SVHN, and 10.04% and 9.22% of test error rates on CIFAR10. Additionally, VAdD alone achieved a better performance than the selfensemble model ( model). This indicates that considering an adversarial perturbation on dropout layers enhances the selfensemble effect.
Effect on Features and Sparsity from Adversarial Dropout
Dropout prevents the coadaptation between the units in a neural network, and the dropout decreases the dependency between hidden units [Srivastava et al.2014]. To compare the adversarial dropout and the standard dropout, we analyzed the coadaptations by visualizing features of autoencoders on the MNIST dataset. The autoencoder consists with one hidden layer, whose dimension is 256, with the ReLU activation. When we trained the autoencoder, we set the dropout with , and we calculated the reconstruction error between the input data and the output layer as a loss function to update the weight values of the autoencoder with the standard dropout. On the other hand, the adversarial dropout error is also considered when we update the weight values of the autoencoder with the parameters, = 0.2, and = 0.3. The trained autoencoders showed similar reconstruction errors on the test dataset.
Figure 2 shows the visualized features from the autoencoders. There are two differences identified from the visualization; 1) adversarial dropout prevents that the learned weight matrix contains black boxes, or dead filters, which may be all zero for many different inputs and 2) adversarial dropout tends to standardize other features, except for localized features viewed as black dots, while the standard dropout tends to ignore the neighborhoods of the localized features. These show that adversarial dropout standardizes the other features while preserving the characteristics of localized features from the standard dropout . These could be the main reason for the better generalization performance.
The important sideeffect of the standard dropout is the sparse activations of the hidden units [Hinton et al.2012]. To analyze the sparse activations by adversarial dropout, we compared the activation values of the autoencoder models with nodropout, dropout, and adversarial dropout on the MNIST test dataset. A sparse model should only have a few highly activated units, and the average activation of any unit across data instances should be low [Hinton et al.2012]. Figure 3 plot the distribution of the activation values and their means across the test dataset. We found that the adversarial dropout has fewer highly activated units compared to others. Moreover, the mean activation values of the adversarial dropout were the lowest. These indicate that adversarial dropout improves the sparsity of the model than the standard dropout does.
Disucssion
The previous studies proved that the adversarial noise injections were an effective regularizer [Goodfellow, Shlens, and Szegedy2014]. In order to investigate the different properties of adversarial dropout, we explore a very simple case of applying adversarial training and adversarial dropout to the linear regression.
Linear Regression with Adversarial Training
Let be a data point and be a target where . The objective of the linear regression is finding that minimizes .
To express adversarial examples, we denote as the adversarial example of where utilizing the fast gradient sign method (FGSM) [Goodfellow, Shlens, and Szegedy2014], is a control parameter representing the degree of adversarial noises. With the adversarial examples, the objective function of the adversarial training can be viewed as follows:
(12) 
The above equation is translated into the below formula by isolating the terms with as the additive noise.
(13) 
where . The second term shows the regularization by multiplying the degree of the adversarial noise, , at each data point. Additionally, the third term indicates the regularization with , which form the scales of by the gradient direction differences over all data points. The penalty terms are closely related with the hyperparameter . When approaches to zero, the regularization term disappears because the inputs become adversarial examples, not anymore. For a large , the regularization constant grows larger than the original loss function, and the learning becomes infeasible. The previous studies proved that the adversarial objective function based on the FGSM is an effective regularizer. This paper investigated that training a linear regression with adversarial examples provides two regularization terms of the above equation.
Linear Regression with Adversarial Dropout
Now, we turn to the case of applying adversarial dropout to a linear regression. To represent the adversarial dropout, we denote as the adversarially dropped input of where with the hyperparameter, , controlling the degree of the adversarial dropout. For simplification, we used one vector as the sampled dropout, , of the adversarial dropout. If we apply Algorithm 1, the adversarial dropout can be defined as follows:
(14) 
where is the lowest element of . This solution satisfies the constraint, . With this adversarial dropout condition, the objective function of the adversarial dropout can be defined as the belows:
(15) 
When we isolate the terms with , the above equation is translated into the below formula.
(16) 
where and . The second term is the regularization of the largest loss changes from the features of each data point. The third term is the regularization with . These two penalty terms are related with the hyperparameter controlling the degree of the adversarial dropout, because the indicates the number of elements of the set . When becomes zero, the two penalty terms disappears because there will be no dropout by the constraint on .
There are two differences between the adversarial dropout and the adversarial training. First, the regularization terms of the adversarial dropout are dependent on the scale of the features of each data point. In regularization, the gradients of the loss function are rescaled with the data points. In regularization, the data points affect the scales of the weight costs. In contrast, the penalty terms of adversarial training are dependent on the degree of adversarial noise, , which is a static term across the instances because is a singlevalued hyper parameter given in the training process. Second, the penalty terms of the adversarial dropout are selectively activated by the degree of the loss changes while the penalty terms of the adversarial training are always activated.
Conclusion
The key point of our paper is combining the ideas from the adversarial training and the dropout. The existing methods of the adversarial training control a linear perturbation with additive properties only on the input layer. In contrast, we combined the concept of the perturbation with the dropout properties on hidden layers. Adversarially dropped structure becomes a poor ensemble model for the label assignment even when very few nodes are changed. However, by learning the model with the poor structure, the model prevents overfitting using a few effective features. The experiments showed that the generalization performances are improved by applying our adversarial dropout. Additionally, our approach achieved thestateoftheart performances of 3.55% on SVHN and 9.22% on CIFAR10 by applying VAdD and VAT together for the semisupervised learning.
References
 [Baldi and Sadowski2013] Baldi, P., and Sadowski, P. J. 2013. Understanding dropout. In Advances in Neural Information Processing Systems. 2814–2822.
 [Bishop1995a] Bishop, C. M. 1995a. Training with noise is equivalent to tikhonov regularization. Neural computation 7(1):108–116.
 [Bishop1995b] Bishop, C. M. 1995b. Regularization and complexity control in feedforward networks.
 [Chen et al.2014] Chen, N.; Zhu, J.; Chen, J.; and Zhang, B. 2014. Dropout training for support vector machines. arXiv preprint arXiv:1404.4171.
 [Goodfellow, Shlens, and Szegedy2014] Goodfellow, I. J.; Shlens, J.; and Szegedy, C. 2014. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572.
 [He et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Identity mappings in deep residual networks. In European Conference on Computer Vision, 630–645. Springer.
 [Hemmecke et al.2010] Hemmecke, R.; Köppe, M.; Lee, J.; and Weismantel, R. 2010. Nonlinear integer programming. In 50 Years of Integer Programming 19582008. Springer. 561–618.
 [Hinton et al.2012] Hinton, G. E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. R. 2012. Improving neural networks by preventing coadaptation of feature detectors. arXiv preprint arXiv:1207.0580.
 [Huang et al.2016] Huang, G.; Liu, Z.; Weinberger, K. Q.; and van der Maaten, L. 2016. Densely connected convolutional networks. arXiv preprint arXiv:1608.06993.
 [Jain et al.2015] Jain, P.; Kulkarni, V.; Thakurta, A.; and Williams, O. 2015. To drop or not to drop: Robustness, consistency and differential privacy properties of dropout. arXiv preprint arXiv:1503.02031.
 [Kellerer, Pferschy, and Pisinger2004] Kellerer, H.; Pferschy, U.; and Pisinger, D. 2004. Introduction to npcompleteness of knapsack problems. In Knapsack problems. Springer. 483–493.
 [Kingma and Ba2014] Kingma, D., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
 [Krizhevsky and Hinton2009] Krizhevsky, A., and Hinton, G. 2009. Learning multiple layers of features from tiny images.
 [Kurakin, Goodfellow, and Bengio2016] Kurakin, A.; Goodfellow, I.; and Bengio, S. 2016. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533.
 [Laine and Aila2016] Laine, S., and Aila, T. 2016. Temporal ensembling for semisupervised learning. arXiv preprint arXiv:1610.02242.
 [Lasserre, Bishop, and Minka2006] Lasserre, J. A.; Bishop, C. M.; and Minka, T. P. 2006. Principled hybrids of generative and discriminative models. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, volume 1, 87–94. IEEE.
 [LeCun et al.1998] LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradientbased learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324.
 [Lee et al.2015] Lee, C.Y.; Xie, S.; Gallagher, P.; Zhang, Z.; and Tu, Z. 2015. Deeplysupervised nets. In Artificial Intelligence and Statistics, 562–570.
 [Li, Gong, and Yang2016] Li, Z.; Gong, B.; and Yang, T. 2016. Improved dropout for shallow and deep learning. In Advances in Neural Information Processing Systems, 2523–2531.
 [Lin, Chen, and Yan2013] Lin, M.; Chen, Q.; and Yan, S. 2013. Network in network. arXiv preprint arXiv:1312.4400.
 [Maaten et al.2013] Maaten, L.; Chen, M.; Tyree, S.; and Weinberger, K. Q. 2013. Learning with marginalized corrupted features. In Proceedings of the 30th International Conference on Machine Learning (ICML13), 410–418.
 [Miyato et al.2015] Miyato, T.; Maeda, S.i.; Koyama, M.; Nakae, K.; and Ishii, S. 2015. Distributional smoothing with virtual adversarial training. arXiv preprint arXiv:1507.00677.
 [Miyato et al.2017] Miyato, T.; Maeda, S.i.; Koyama, M.; and Ishii, S. 2017. Virtual adversarial training: a regularization method for supervised and semisupervised learning. arXiv preprint arXiv:1704.03976.
 [Miyato, Dai, and Goodfellow2016] Miyato, T.; Dai, A. M.; and Goodfellow, I. 2016. Virtual adversarial training for semisupervised text classification. stat 1050:25.
 [Netzer et al.2011] Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; and Ng, A. Y. 2011. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, 5.
 [Poole, SohlDickstein, and Ganguli2014] Poole, B.; SohlDickstein, J.; and Ganguli, S. 2014. Analyzing noise in autoencoders and deep networks. arXiv preprint arXiv:1406.1831.
 [Rasmus et al.2015] Rasmus, A.; Berglund, M.; Honkala, M.; Valpola, H.; and Raiko, T. 2015. Semisupervised learning with ladder networks. In Advances in Neural Information Processing Systems, 3546–3554.
 [Sajjadi, Javanmardi, and Tasdizen2016] Sajjadi, M.; Javanmardi, M.; and Tasdizen, T. 2016. Regularization with stochastic transformations and perturbations for deep semisupervised learning. In Advances in Neural Information Processing Systems, 1163–1171.
 [Salimans and Kingma2016] Salimans, T., and Kingma, D. P. 2016. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, 901–901.
 [Salimans et al.2016] Salimans, T.; Goodfellow, I. J.; Zaremba, W.; Cheung, V.; Radford, A.; and Chen, X. 2016. Improved techniques for training gans. CoRR abs/1606.03498.
 [Sanfeliu and Fu1983] Sanfeliu, A., and Fu, K.S. 1983. A distance measure between attributed relational graphs for pattern recognition. IEEE transactions on systems, man, and cybernetics (3):353–362.
 [Springenberg et al.2014] Springenberg, J. T.; Dosovitskiy, A.; Brox, T.; and Riedmiller, M. 2014. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806.
 [Springenberg2015] Springenberg, J. T. 2015. Unsupervised and semisupervised learning with categorical generative adversarial networks. arXiv preprint arXiv:1511.06390.
 [Srivastava et al.2014] Srivastava, N.; Hinton, G. E.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1):1929–1958.
 [Srivastava, Greff, and Schmidhuber2015] Srivastava, R. K.; Greff, K.; and Schmidhuber, J. 2015. Highway networks. arXiv preprint arXiv:1505.00387.
 [Szegedy et al.2013] Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.; and Fergus, R. 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199.
 [Wager, Wang, and Liang2013] Wager, S.; Wang, S.; and Liang, P. S. 2013. Dropout training as adaptive regularization. In Advances in Neural Information Processing Systems, 351–359.
 [Wang and Manning2013] Wang, S., and Manning, C. 2013. Fast dropout training. In Proceedings of the 30th International Conference on Machine Learning (ICML13), 118–126.
Appendix A Appendix A. Distance between Two Dropout Conditions
In this section, we describe process of induction for the boundary condition from the constraints . We applied two distance metrics, graph edit distance (GED) and Jaccard distance (JD) and proved that restricting upper bounds of two metrics is same with limiting the Euclidean distance.
(17) 
Following subsections show the propositions.
A.1. Graph Edit Distance
When we consider a neural network as a graph, we can apply the graph edit distance [Sanfeliu and Fu1983] to measure relative difference between two dropouted networks, and , by dropout masks, and . The following is the definition of graph edit distance (GED) between two networks.
(18) 
where denotes the set of edit path transforming into , is the cost of each graph edit operation , and is the number of the edit operations required to change to . For simplification, we only considered edge insertion and deletion operations and their cost are same as . When a hidden node (vertex) is dropped, the cost of the GED is where is the numbers of lower layer nodes and is the number of upper layer nodes. If we consider a hidden node (vertex) is revival, change of GED is same as . This leads following proposition.
Proposition 1
Given two networks and , generated by two dropout masks and , and all graph edit costs are same as , graph edit distance with two dropout masks can be interpreted as:
(19) 
Due to and are binary masks, their Euclidean distance can provide the number of different dropped nodes.
A.2. Jaccard Distance
When we consider a dropout condition as a set of selected hidden nodes, we can apply Jaccard distance to measure difference two dropout masks, and . The following equation is the definition of Jaccard distance:
(20) 
Since and are binary vectors, can be converted as and can be viewed as . This leads the following proposition.
Proposition 2
Given two dropout masks and , which are binary vectors, Jaccard distance between them can be defined as:
(21) 
Appendix B Appendix B. Detailed Experiment Setup
This section describes the network architectures and settings for the experimental results in this paper. The tensorflow implementations for reproducing these results can be obtained from https://github.com/sungraepark/AdversarialDropout.
B.1. MNIST : Convolutional Neural Networks
Name  Description 

input  28 X 28 image 
conv1  32 filters, 1 x 1, pad=’same’, ReLU 
pool1  Maxpool 2 x 2 pixels 
drop1  Dropout, 
conv2  64 filters, 1 x 1, pad=’same’, ReLU 
pool2  Maxpool 2 x 2 pixels 
drop2  Dropout, 
conv3  128 filters, 1 x 1, pad=’same’, ReLU 
pool3  Maxpool 2 x 2 pixels 
adt  Adversarial dropout, , 
dense1  Fully connected 2048 625 
dense2  Fully connected 625 10 
output  Softmax 
The MNIST dataset (LeCun et al., 1998) consists of 70,000 handwritten digit images of size where 60,000 images are used for training and the rest for testing. The CNN architecture is described in Table 1. All networks were trained using Adam [Kingma and Ba2014] with a learning rate of 0.001 and momentum parameters of and . In all implementations, we trained the model for 100 epochs with minibatch size of 128.
For the constraint of adversarial dropout, we set , which indicates 10 () adversarial changes from the randomly selected dropout mask. In all training, we ramped up the tradeoff parameter, , for proposed regularization term, . During the first 30 epochs, we used a Gaussian rampup curve , where advances linearly from zero to one during the rampup period. The maximum values of are 1.0 for VAdD (KL) and VAT , and 30.0 for VAdD (QE) and model.
B.2. SVHN and CIFAR10 : Supervised and Semisupervised learning
Name  Description 

input  32 X 32 RGB image 
noise  Additive Gaussian noise 
conv1a  128 filters, 3 x 3, pad=’same’, LReLU() 
conv1b  128 filters, 3 x 3, pad=’same’, LReLU() 
conv1c  128 filters, 3 x 3, pad=’same’, LReLU() 
pool1  Maxpool 2 x 2 pixels 
drop1  Dropout, 
conv2a  256 filters, 3 x 3, pad=’same’, LReLU() 
conv2b  256 filters, 3 x 3, pad=’same’, LReLU() 
conv2c  256 filters, 3 x 3, pad=’same’, LReLU() 
pool2  Maxpool 2 x 2 pixels 
conv3a  512 filters, 3 x 3, pad=’valid’, LReLU() 
conv3b  256 filters, 1 x 1, LReLU() 
conv3c  128 filters, 1 x 1, LReLU() 
pool3  Global average pool (6 x 6 1 x 1)pixels 
add  Adversarial dropout, , 
dense  Fully connected 128 10 
output  Softmax 
The both datasets, SVHN [Netzer et al.2011] and CIFAR10 [Krizhevsky and Hinton2009], consist of colour images in ten classes. For these experiments, we used a CNN, which used by [Laine and Aila2016, Miyato et al.2017] described in Table 2. In all layers, we applied batch normalization for SVHN and meanonly batch normalization [Salimans and Kingma2016] for CIFAR10 with momentum 0.999. All networks were trained using Adam [Kingma and Ba2014] with the momentum parameters of and , and the maximum learning rate 0.003. We ramped up the learning rate during the first 80 epochs using a Gaussian rampup curve , where advances linearly from zero to one during the rampup period. Additionally, we annealed the learning rate to zero and the Adam parameter, , to 0.5 during the last 50 epochs. The number of total epochs is set as 300. These learning setting are same with [Laine and Aila2016].
For adversarial dropout, we set the maximum value of regularization component weight, , as 1.0 for VAdD(KL) and 25.0 for VAdD(QE). We also ramped up the weight using the Gaussian rampup curve during the first 80 epochs. Additionally, we set as 0.05 and dropout probability as 1.0, which means dropping 6 units among the full hidden units. We set minibatch size as 100 for supervised learning and 32 labeled and 128 unlabeled data for semisupervised learning.
Appendix C Appendix C. Definition of Notation
In this section, we describe notations used over this paper.
Notat.  Description 

An input of a neural network  
A true label  
A set of parameters of a neural network  
A noise vector of additive Gaussian noise layer  
A binary noise vector of dropout layer  
A hyperparameter controlling the intensity of  
the adversarial perturbation  
A tradeoff parameter controlling the impact of  
a regularization term  
A nonnegative function that represents the  
distance between two output vectors:  
cross entropy(CE), KL divergence(KL), and  
quadratic error (QE)  
An output vector of a neural network with  
parameters () and an input ()  
An output vector of a neural network with  
parameters (), an input (), and  
noise ()  
A upper part of a neural network, ,  
of a adversarial dropout layer where  
A under part of a neural network, ,  
of a adversarial dropout layer 
Appendix D Appendix D. Performance Comparison with Other Models
D.1. CIFAR10 : Supervised classification results with additional baselines
We compared the reported performances of the additional close family of CNNbased classifier for the supervised learning. As we mentioned in the paper, we did not consider the recent advanced architectures, such as ResNet [He et al.2016] and DenseNet [Huang et al.2016].
Method  Error rate () 

Network in Network \shortcitelin2013network  8.81 
AllCNN \shortcitespringenberg2014striving  7.25 
Deep Supervised Net \shortcitelee2015deeply  7.97 
Highway Network \shortcitesrivastava2015highway  7.72 
model \shortcitelaine2016temporal  5.56 
Temportal ensembling \shortcitelaine2016temporal  5.60 
VAT \shortcitemiyato2017virtual  5.81 
model (our implementation)  5.77 0.11 
VAT (our implementation)  5.65 0.17 
AdD  5.46 0.16 
VAdD (KL)  5.27 0.10 
VAdD (QE)  5.24 0.12 
VAdD (KL) + VAT  4.40 0.12 
VAdD (QE) + VAT  4.73 0.04 
D.2. CIFAR10 : Semisupervised classification results with additional baselines
We compared the reported performances of additional baseline models for the semisupervised learning. Our implementation reproduced the closed performance from their reported results, and showed the performance improvement from adversarial dropout.
Method  Error rate () 

Ladder network \shortciterasmus2015semi  20.40 
CatGAN \shortcitespringenberg2015unsupervised  19.58 
GAN with feature matching \shortciteDBLP:journals/corr/SalimansGZCRC16  18.63 
model \shortcitelaine2016temporal  12.36 
Temportal ensembling \shortcitelaine2016temporal  12.16 
Sajjadi et al. \shortcitesajjadi2016regularization  11.29 
VAT \shortcitemiyato2017virtual  10.55 
model (our implementation)  12.62 0.29 
VAT (our implementation)  11.96 0.10 
VAdD (KL)  11.68 0.19 
VAdD (QE)  11.32 0.11 
VAdD (KL) + VAT  10.07 0.11 
VAdD (QE) + VAT  9.22 0.10 
Appendix E Appendix E. Proof of Linear Regression Regularization
In this section, we showed the detailed proof of regularization terms from adversarial training and adversarial dropout.
Linear Regression with Adversarial Training
Let be a data point and be a target where . The objective of linear regression is to find a that minimizes .
To express adversarial examples, we denote as the adversarial example of where utilizing the fast gradient sign method (FGSM) [Goodfellow, Shlens, and Szegedy2014], is a controlling parameter representing the degree of adversarial noises. With the adversarial examples, the objective function of adversarial training can be viewed as follows:
(22) 
This can be divided to
(23)  
where is the loss function without adversarial noise. Note that the gradient is , and . The above equation can be transformed as the following:
(24) 
where .
Linear Regression with Adversarial Dropout
To represent adversarial dropout, we denote as the adversarially dropped input of where with the hyperparameter, , controlling the degree of adversarial dropout. For simplification, we used one vector as the base condition of a adversarial dropout. If we applied our proposed algorithm, the adversarial dropout can be defined as follows:
(25) 
where is the lowest element of . This solution satisfies the constraint, . With this adversarial dropout condition, the objective function of adversarial dropout can be defined as the following:
(26) 
This can be divided to
(27)  
The second term of the right handside can be viewed as
(28) 
By defining a set , the second term can be transformed as the following.
(29) 
Note that the gradient is and is always negative when . The second term can be redefined as the following.
(30) 
Finally, the objective function of adversarial dropout is reorganized.
(31) 
where and .