Fraternal Dropout
Abstract
Recurrent neural networks (RNNs) are important class of architectures among neural networks useful for language modeling and sequential prediction. However, optimizing RNNs is known to be harder compared to feedforward neural networks. A number of techniques have been proposed in literature to address this problem. In this paper we propose a simple technique called fraternal dropout that takes advantage of dropout to achieve this goal. Specifically, we propose to train two identical copies of an RNN (that share parameters) with different dropout masks while minimizing the difference between their (presoftmax) predictions. In this way our regularization encourages the representations of RNNs to be invariant to dropout mask, thus being robust. We show that our regularization term is upper bounded by the expectationlinear dropout objective which has been shown to address the gap due to the difference between the train and inference phases of dropout. We evaluate our model and achieve stateoftheart results in sequence modeling tasks on two benchmark datasets – Penn Treebank and Wikitext2. We also show that our approach leads to performance improvement by a significant margin in image captioning (Microsoft COCO) and semisupervised (CIFAR10) tasks.
Fraternal Dropout
Konrad Żołna^{†}^{†}thanks: konrad.zolna@gmail.com , Devansh Arpit, Dendi Suhubdy & Yoshua Bengio 

Jagiellonian University 
MILA, Université de Montréal 
CIFAR Senior Fellow 
1 Introduction
Recurrent neural networks (RNNs) like long shortterm memory (LSTM; Hochreiter & Schmidhuber (1997)) networks and gated recurrent unit (GRU; Chung et al. (2014)) are popular architectures for sequence modeling tasks like language generation, translation, speech synthesis, and machine comprehension. However, they are harder to optimize compared to feedforward networks due to challenges like variable length input sequences, repeated application of the same transition operator at each time step, and largelydense embedding matrix that depends on the vocabulary size. Due to these aforementioned challenges in optimizing RNNs compared with feedforward networks, the application of batch normalization and its variants (layer normalization, recurrent batch normalization, recurrent normalization propagation) have not been as successful as their counterparts in feedforward networks (Laurent et al., 2016), although they do considerably provide performance gains. Similarly, naive application of dropout (Srivastava et al., 2014) has been shown to be ineffective in RNNs (Zaremba et al., 2014). Therefore, regularization techniques for RNNs is an active area of research.
To address these challenges, Zaremba et al. (2014) proposed to apply dropout only to the nonrecurrent connections in multilayer RNNs. Variational dropout (Gal & Ghahramani (2016)) uses the same dropout mask throughout a sequence during training. DropConnect (Wan et al., 2013) applies the dropout operation on the weight matrices. Zoneout (Krueger et al. (2016)), in a similar spirit with dropout, randomly chooses to use the previous time step hidden state instead of using the current one. Similarly as a substitute for batch normalization, layer normalization normalizes the hidden units within each sample to have zero mean and unit standard deviation. Recurrent batch normalization applies batch normalization but with unshared minibatch statistics for each time step (Cooijmans et al., 2016).
Merity et al. (2017a) and Merity et al. (2017b) on the other hand show that activity regularization (AR) and temporal activation regularization (TAR)^{1}^{1}1TAR and Zoneout are similar in their motivations because both leads to adjacent time step hidden states to be close on average. are also effective methods for regularizing LSTMs.
In this paper we propose a simple regularization based on dropout that we call fraternal dropout, where we minimize an equally weighted sum of prediction losses from two identical copies of the same LSTM with different dropout masks, and add as a regularization the difference between the predictions (presoftmax) of the two networks. We analytically show that our regularization objective is equivalent to minimizing the variance in predictions from different i.i.d. dropout masks. Our approach thus encourages the predictions to be invariant to dropout masks. We also discuss how our regularization is related to expectation linear dropout Ma et al. (2016), model Laine & Aila (2016) and activity regularization Merity et al. (2017b), and empirically show that our method provides nontrivial gains over these related methods which we explain furthermore in our ablation study (section 5).
2 Fraternal dropout
Dropout is a powerful regularization for neural networks. It is usually more effective on densely connected layers because they suffer more from overfitting compared with convolution layers where the parameters are shared. For this reason dropout is an important regularization for RNNs. However, dropout has a gap between its training and inferencing phase since the later phase assumes linear activations to correct for the factor by which the expected value of each activation would be different Ma et al. (2016). In addition, the prediction of models with dropout generally vary with different dropout mask. However, the desirable property in such cases would be to have final predictions invariant to the dropout mask.
As such, the idea behind fraternal dropout is to train a neural network model in a way that encourages the variance in predictions under different dropout masks to be as small as possible. Specifically, consider we have an RNN model denoted by that takes as input , where denotes the model parameters. Let be the prediction of the model for input sample at time , for dropout mask and current input , where is a function of and the hidden states corresponding to the previous time steps. Similarly, let be the corresponding time step loss value for the overall inputtarget sample pair .
Then in fraternal dropout, we simultaneously feedforward the input sample through two identical copies of the RNN that share the same parameters but with different dropout masks and at each time step . This yields two loss values at each time step given by , and . Then the overall loss function of fraternal dropout is given by,
(1) 
where is the regularization coefficient, is the dimensions of and is the fraternal dropout regularization given by,
(2) 
We use Monte Carlo sampling to approximate where and are the same as the one used to calculate values. Hence, the additional computation is negligible.
We note that the regularization term of our objective is equivalent to minimizing the variance in the prediction function with different dropout masks as shown below (proof in Appendix).
Remark 1.
Let and be i.i.d. dropout masks and be the prediction function as described above. Then,
(3) 
3 Related Work
3.1 Relation to Expectation Linear Dropout (ELD)
Ma et al. (2016) analytically showed that the expected (over samples) error between a model’s expected prediction over all dropout masks, and the prediction using the average mask, is upper bounded. Based on this result, they propose to explicitly minimize the difference (we have adapted their regularization to our notations),
(4) 
where is the dropout mask. However, due to feasibility consideration, they instead propose to use the following regularization in practice,
(5) 
Specifically, this is achieved by feedforwarding the input twice through the network, with and without dropout mask, and minimizing the main network loss (with dropout) along with the regularization term specified above (but without backpropagating gradients through the network without dropout). The goal of Ma et al. (2016) is to minimize the network loss along with the expected difference between the prediction from individual dropout mask and the prediction from the expected dropout mask. We note that our regularization objective is upper bounded by the expectationlinear dropout regularization as shown below (proof in Appendix).
Proposition 1.
.
This result shows that minimizing the ELD objective indirectly minimizes our regularization term. Finally as indicated above, they apply the target loss only on the network with dropout. In fact, in our own ablation studies (see section 5) we find that backpropagating target loss through the network (without dropout) makes optimizing the model harder. However, in our setting, simultaneously backpropagating target loss through both networks yields both performance gain as well as convergence gain. We believe convergence is faster for our regularization because network weights are more likely to get target based updates from backpropagation in our case. This is especially true in weight dropout (Wan et al., 2013) since in this case dropped weights do not get updated in the training iteration.
3.2 Relation to model
Laine & Aila (2016) propose model with the goal of improving performance on classification tasks in the semisupervised setting. They propose a model similar to ours (considering the equivalent deep feedforward version of our model) except they apply target loss only on one of the networks and use timedependent weighting function (while we use constant ). The intuition in their case is to leverage unlabeled data by using them to minimize the difference in prediction between the two copies of the network with different dropout masks. Further, they also test their model in the supervised setting but fail to explain the improvements they obtain by using this regularization.
We note that in our case we analytically show that minimizing our regularizer (also used in model) is equivalent to minimizing the variance in the model predictions (remark 1). Furthermore, we also show the relation of our regularizer to expectation linear dropout (proposition 1). In section 5, we study the effect of target based loss on both networks, which is not used in the model. We find that applying target loss on both the networks leads to significantly faster convergence. Finally, we bring to attention that temporal embedding (another model proposed by Laine & Aila (2016), claimed to be a better version of model for semisupervised, learning) is intractable in natural language processing applications because storing averaged predictions over all of the time steps would be memory exhaustive (since predictions are usually huge  tens of thousands values). On a final note, we argue that in the supervised case using a timedependent weighting function instead of a constant value is not needed. Since ground truth labels are known, we have not observed the problem mentioned by Laine & Aila (2016), that the network gets stuck in a degenerate solution when is too large in earlier epochs of training. We note that it is much easier to search for an optimal constant value, which is true in our case, as opposed to tuning the timedependent function.
4 Experiments
4.1 Language Models
In the case of language modeling we test our model on two benchmark datasets – Penn Treebank (PTB) dataset (Marcus et al., 1993) and WikiText2 (WT2) dataset (Merity et al., 2016). Preprocessing from Mikolov et al. (2010) (for PTB corpus) and Moses tokenizer (Koehn et al., 2007) (for the WT2 dataset) were used.
For both datasets we applied a AWDLSTM 3layer architecture described in Merity et al. (2017a)^{2}^{2}2We used the official Github repository code for this paper.. The number of parameters in the model used for PTB is 24 million as compared to 34 million in the case of WT2 because WT2 has a larger vocabulary size for which we use a larger embedding matrix. Apart from those differences, the architectures are identical.
Word level Penn Treebank (PTB)
Model  Parameters  Validation  Test 

Zaremba et al. (2014)  LSTM (medium)  10M  86.2  82.7 
Zaremba et al. (2014)  LSTM (large)  24M  82.2  78.4 
Gal & Ghahramani (2016)  Variational LSTM (medium)  20M  81.9  79.7 
Gal & Ghahramani (2016)  Variational LSTM (large)  66M  77.9  75.2 
Inan et al. (2016)  Variational LSTM  51M  71.1  68.5 
Inan et al. (2016)  Variational RHN  24M  68.1  66.0 
Zilly et al. (2016)  Variational RHN  23M  67.9  65.4 
Melis et al. (2017)  5layer RHN  24M  64.8  62.2 
Melis et al. (2017)  4layer skip connection LSTM  24M  60.9  58.3 
Merity et al. (2017a)  AWDLSTM 3layer  24M  60.0  57.3 
fraternal dropout + AWDLSTM 3layer  24M  58.9  56.8 
We evaluate our model using the perplexity metric and compare the results that we have obtained against the existing stateoftheart results. The results are reported in Table 1. Our approach achieves the stateoftheart performance compared with existing benchmarks.
Word level WikiText2
Model  Parameters  Validation  Test 

Merity et al. (2016)  Variational LSTM + Zoneout  20M  108.7  100.9 
Merity et al. (2016)  Variational LSTM  20M  101.7  96.3 
Inan et al. (2016)  Variational LSTM  28M  91.5  87.0 
Melis et al. (2017)  5layer RHN  24M  78.1  75.6 
Melis et al. (2017)  1layer LSTM  24M  69.3  65.9 
Melis et al. (2017)  2layer skip connection LSTM  24M  69.1  65.9 
Merity et al. (2017a)  AWDLSTM 3layer  34M  68.6  65.8 
fraternal dropout + AWDLSTM 3layer  34M  66.8  64.1 
In a case of the WikiText2 language modeling task we outperform the current stateoftheart by a significant margin. The final results are presented in Table 2.
More details about the experiment may be found in the subsection 5.4.
4.2 Image captioning
Model  BLEU1  BLEU2  BLEU3  BLEU4 

Show and Tell Xu et al. (2015)  66.6  46.1  32.9  24.6 
Baseline  68.8  50.8  36.1  25.6 
Fraternal dropout,  69.3  51.4  36.6  26.1 
Fraternal dropout,  69.3  51.5  36.9  26.3 
We also apply fraternal dropout on an image captioning task. We use the wellknown show and tell model as a baseline^{3}^{3}3We used PyTorch implementation with default hyperparameters from github.com/ruotianluo/neuraltalk2.pytorch. (Vinyals et al., 2014). We emphasize that in the image captioning task, the image encoder and sentence decoder architectures are usually learned together. Since we want to focus on the benefits of using fraternal dropout in RNNs we use frozen pretrained ResNet101 (He et al., 2015) model as our image encoder. It means that our results are not directly comparable with other stateoftheart methods, however we report results for the original methods so readers can see that our baseline performs well. The final results are presented in Table 3.
We argue that in this task smaller values are optimal because the image captioning encoder is given all information in the beginning and hence the variance of consecutive predictions is smaller that in unconditioned natural language processing tasks. Fraternal dropout may benefits here mainly due to averaging gradients for different mask and hence updating weights more frequently.
5 Ablation Studies
In this section, the goal is to study existing methods closely related to ours – expectation linear dropout Ma et al. (2016), model Laine & Aila (2016) and activity regularization Merity et al. (2017b). All of our experiments for ablation studies, which apply a single layer LSTM, use the same hyperparameters and model architecture^{4}^{4}4We use a batch size of 64, truncated backpropagation with 35 time steps, a constant zero state is provided as the initial state with probability 0.01 (similar to Melis et al. (2017)), SGD with learning rate 30 (no momentum) which is multiplied by 0.1 whenever validation performance does not improve ever during 20 epochs, weight dropout on the hidden to hidden matrix 0.5, dropout every word in a minibatch with probability 0.1, embedding dropout 0.65, output dropout 0.4 (final value of LSTM before embedding), input embedding size of 655, the input/output size of LSTM is same as embedding size 655 (and the embedding weights are tied), gradient clipping of 0.25 and weight decay ..
5.1 Expectationlinear dropout (ELD)
The relation with expectationlinear dropout Ma et al. (2016) has been discussed in the section 2. Here we perform experiments to study the difference in performance when using the ELD regularization versus our regularization (FD). In addition to ELD, we also study a modification (ELDM) of ELD which applies target loss to both copies of LSTMs in ELD similar to FD (notice in their case they only have dropout on one LSTM). Finally we have a baseline model without any of these regularizations. The learning dynamics curves are shown in Figure 1. Our regularization performs better in terms of convergence compared with other methods. In terms of generalization, we find that FD is similar to ELD, but baseline and ELDM are much worse. Interestingly, looking at the train and validation curves together, ELDM seems to be suffering from optimization problems.
5.2 model
Since model Laine & Aila (2016) is similar to our algorithm (even though it is designed for semisupervised learning in feedforward networks), we study the difference in performance with model^{5}^{5}5We used constant function since we want to focus on the difference when target loss is backpropagated through one or two networks. Additionally, we find tuning a function instead of a constant infeasible. both qualitatively and quantitatively to establish the advantage of our approach. First, we run both single layer LSTM and 3layer AWDLSTM on PTB task to check how their model compares with ours in the case of language modeling. The results are shown in Figure 1 and 3. We find that our model converges significantly faster than model. We believe this happens because we backpropagate the target loss through both networks (in contrast to model) that leads to weights getting updated using targetbased gradients more often.
Even though we designed our algorithm specifically to address problems in RNNs, to have a fair comparison, we compare with model on a semisupervised task which is their goal. Specifically, we used the CIFAR10 dataset that is consisted of images from 10 classes. Following the usual splits used in semisupervised learning literature, we use 4 thousand labeled and 41 thousand unlabeled samples for training, 5 thousand labeled samples for validation and 10 thousand labeled samples for test set. We use the original ResNet56 (He et al., 2015) architecture. We run grid search on , dropout rates in and leave the rest of the hyperparameters unchanged. We additionally check importance of using unlabeled data. The results are reported in Table 4. We find that our algorithm performs at par with model. When unlabeled data is not used, fraternal dropout provides slightly better results as compared to ordinal dropout.
Model  Dropout rate  Unlabeled data used  Validation  Test 

Ordinary  0.1  No  78.4 ( 0.25)  76.9 ( 0.31) 
None  0.0  No  78.8 ( 0.59)  77.1 ( 0.3) 
Fraternal ()  0.05  No  79.3 ( 0.38)  77.6 ( 0.35) 
Ordinary + model  0.1  Yes  80.2 ( 0.33)  78.5 ( 0.46) 
Fraternal ()  0.1  Yes  80.5 ( 0.18)  79.1 ( 0.37) 
5.3 Activity regularization and temporal activity regularization analysis
The authors of Merity et al. (2017b) study the importance of activity regularization (AR)^{6}^{6}6We used , where is the dropout mask, in our actual experiments with AR because it was implemented as such in the original paper’s Github repository Merity et al. (2017a). and temporal activity regularization (TAR) in LSTMs given as,
(6)  
(7) 
where is the LSTM’s output activation at time step (hence depends on both current input and the model parameters ). Notice that AR and TAR regularizations are applied on the output of the LSTM, while our regularization is applied on the presoftmax output of the LSTM. However, since our regularization can be decomposed as
(8)  
(9) 
and encapsulates an term along with the dot product term, we perform experiments to confirm that the gains in our approach is not due to the regularization alone. A similar argument goes for the TAR objective. We run a grid search on , , which include the hyperparameters mentioned in Merity et al. (2017a). For our regularization, we use . Furthermore, we also compare with a regularization (PR) that regularizes to further ruleout any gains only from regularization. Based on this grid search, we pick the best model on the validation set for all the regularizations, and additionally report a baseline model without any of these four mentioned regularizations. The learning dynamics is shown in Figure 4. Our regularization performs better both in terms of convergence and generalization compared with other methods. Average hidden state activation is reduced when any of the regularizer described is applied (see Figure 3).
5.4 Language modeling fair comparison
As mentioned in the subsection 4.1, influenced by Melis et al. (2017), we want to make sure that fraternal dropout outperform existing methods not simply because of extensive hyperparameter grid search. Hence, in our experiments we left a vast majority of hyperparameters mentioned in the original paper unchanged i.e. embedding and hidden states sizes, gradient clipping value, weight decay and the values used for all dropout layers (dropout on the word vectors, the output between LSTM layers, the output of the final LSTM, and embedding dropout). However, a few changes were necessary:

the coefficients for AR and TAR have to be altered because fraternal dropout also affects RNNs activation (as explained in the subsection 5.3) – we have not run grid search to obtain the best values but simply deactivated AR and TAR regularizers;

since we need twice as much memory, the batch size is halved so the model needs approximately the same amount of memory and hence fits on the same GPU.
The last change is altering ASGD nonmonotone interval hyperparameter. We run a grid search on and obtained very similar results for the largest values (40, 50 and 60). Hence, our model is trained longer using ordinary SGD optimizer as compared to the original model.
To double check that we do not obtain trivial gains, we run ten learning procedures for the original hyperparameters with different seeds (without finetuning) for PTB dataset to compute confidence intervals. The average best validation perplexity is with the minimum value equals . The same for test perplexity is and , respectively. Our score ( validation and test perplexity) beats ordinal dropout minimum values.
Due to lack of computational power, we ran a single training procedure for fraternal dropout on WT2 dataset. In this experiment, we decided to use the best hyperparameters found for PTB dataset (, nonmonotone interval and halved batch size).
We confirm that ASGD benefits when the finetuning step is used (Merity et al., 2017a). However, this is a very timeconsuming practice and since different hyperparameters may be used in this additional part of the learning procedure, the probability of obtaining better results due to the extensive grid search is higher. Hence, in our experiments we use the same finetuning procedure as implemented in the official repository (even fraternal dropout was not used). We present the importance of finetuning in Table 5.
PTB  WT2  
Dropout  Finetuning  Validation  Test  Validation  Test 
Ordinary  None  60.7  58.8  69.1  66.0 
Ordinary  One  60.0  57.3  68.6  65.8 
Fraternal  None  59.8  58.0  68.3  65.3 
Fraternal  One  58.9  56.8  66.8  64.1 
Fraternal  Two  58.5  56.2  –  – 
We argue that running grid search for all hyperparameters jointly may lead to better results (altering dropout rates may be especially beneficial since our method explicitly takes advantage of using dropout). However, our goal here is to rule out the possibility of outperforming just because of using better hyperparameters.
6 Conclusion
In this paper we propose a simple regularization method for RNNs called fraternal dropout that acts as a regularization by reducing the variance in model predictions across different dropout masks. We show that our model achieves stateoftheart results on benchmark language modeling tasks along with faster convergence. We also analytically study the relationship between our regularization and expectation linear dropout Ma et al. (2016). We perform a number of ablation studies to evaluate our model from different aspects and carefully compare it with related methods both qualitatively and quantitatively.
Acknowledgements
The authors would like to acknowledge the support of the following agencies for research funding and computing support: NSERC, CIFAR, and IVADO. We would like to thank Rosemary Nan Ke and Philippe Lacaille for their thoughts and comments throughout the project. We would also like to thank Stanisław Jastrzębski^{**}^{**}**equal contribution and Evan Racah^{** ‣ Acknowledgements} for useful discussions.
References
 Chung et al. (2014) Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
 Cooijmans et al. (2016) Tim Cooijmans, Nicolas Ballas, César Laurent, and Aaron C. Courville. Recurrent batch normalization. CoRR, abs/1603.09025, 2016. URL http://arxiv.org/abs/1603.09025.
 Gal & Ghahramani (2016) Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. In Advances in neural information processing systems, pp. 1019–1027, 2016.
 He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015. URL http://arxiv.org/abs/1512.03385.
 Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 Inan et al. (2016) Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classifiers: A loss framework for language modeling. arXiv preprint arXiv:1611.01462, 2016.
 Koehn et al. (2007) Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris CallisonBurch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, pp. 177–180. Association for Computational Linguistics, 2007.
 Krueger et al. (2016) David Krueger, Tegan Maharaj, János Kramár, Mohammad Pezeshki, Nicolas Ballas, Nan Rosemary Ke, Anirudh Goyal, Yoshua Bengio, Hugo Larochelle, Aaron Courville, et al. Zoneout: Regularizing rnns by randomly preserving hidden activations. arXiv preprint arXiv:1606.01305, 2016.
 Laine & Aila (2016) Samuli Laine and Timo Aila. Temporal ensembling for semisupervised learning. arXiv preprint arXiv:1610.02242, 2016.
 Laurent et al. (2016) César Laurent, Gabriel Pereyra, Philémon Brakel, Ying Zhang, and Yoshua Bengio. Batch normalized recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pp. 2657–2661. IEEE, 2016.
 Ma et al. (2016) Xuezhe Ma, Yingkai Gao, Zhiting Hu, Yaoliang Yu, Yuntian Deng, and Eduard Hovy. Dropout with expectationlinear regularization. arXiv preprint arXiv:1609.08017, 2016.
 Marcus et al. (1993) Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993.
 Melis et al. (2017) Gábor Melis, Chris Dyer, and Phil Blunsom. On the state of the art of evaluation in neural language models. arXiv preprint arXiv:1707.05589, 2017.
 Merity et al. (2016) Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. CoRR, abs/1609.07843, 2016. URL http://arxiv.org/abs/1609.07843.
 Merity et al. (2017a) Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing lstm language models. arXiv preprint arXiv:1708.02182, 2017a.
 Merity et al. (2017b) Stephen Merity, Bryan McCann, and Richard Socher. Revisiting activation regularization for language rnns. arXiv preprint arXiv:1708.01009, 2017b.
 Mikolov et al. (2010) Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernockỳ, and Sanjeev Khudanpur. Recurrent neural network based language model. In Interspeech, volume 2, pp. 3, 2010.
 Rasmus et al. (2015) Antti Rasmus, Harri Valpola, Mikko Honkala, Mathias Berglund, and Tapani Raiko. Semisupervised learning with ladder network. CoRR, abs/1507.02672, 2015. URL http://arxiv.org/abs/1507.02672.
 Sajjadi et al. (2016) Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen. Regularization with stochastic transformations and perturbations for deep semisupervised learning. In Advances in Neural Information Processing Systems, pp. 1163–1171, 2016.
 Srivastava et al. (2014) Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of machine learning research, 15(1):1929–1958, 2014.
 Vinyals et al. (2014) Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. CoRR, abs/1411.4555, 2014. URL http://arxiv.org/abs/1411.4555.
 Wan et al. (2013) Li Wan, Matthew Zeiler, Sixin Zhang, Yann L Cun, and Rob Fergus. Regularization of neural networks using dropconnect. In Proceedings of the 30th international conference on machine learning (ICML13), pp. 1058–1066, 2013.
 Xu et al. (2015) Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. CoRR, abs/1502.03044, 2015. URL http://arxiv.org/abs/1502.03044.
 Zaremba et al. (2014) Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.
 Zilly et al. (2016) Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutník, and Jürgen Schmidhuber. Recurrent highway networks. arXiv preprint arXiv:1607.03474, 2016.
Appendix
Remark 1.
Let and be i.i.d. dropout masks and be the prediction function as described above. Then,
(10) 
Proof.
For simplicity of notation, we omit the time index .
(11)  
(12)  
(13)  
(14)  
(15)  
(16) 
∎
Proposition 1.
.
Proof.
Let , then
(17)  
(18)  
(19) 
Then using Jensen’s inequality,
(20)  
(21) 
∎