On Fast Dropout and its Applicability to Recurrent Networks
Abstract
Recurrent Neural Networks (RNNs) are rich models for the processing of sequential data. Recent work on advancing the state of the art has been focused on the optimization or modelling of RNNs, mostly motivated by adressing the problems of the vanishing and exploding gradients. The control of overfitting has seen considerably less attention. This paper contributes to that by analyzing fast dropout, a recent regularization method for generalized linear models and neural networks from a backpropagation inspired perspective. We show that fast dropout implements a quadratic form of an adaptive, perparameter regularizer, which rewards large weights in the light of underfitting, penalizes them for overconfident predictions and vanishes at minima of an unregularized training loss. The derivatives of that regularizer are exclusively based on the training error signal. One consequence of this is the absence of a global weight attractor, which is particularly appealing for RNNs, since the dynamics are not biased towards a certain regime. We positively test the hypothesis that this improves the performance of RNNs on four musical data sets.
On Fast Dropout and its Applicability to Recurrent Networks
Justin Bayer, Christian Osendorfer, Daniela Korhammer, Nutan Chen, Sebastian Urban and Patrick van der Smagt Lehrstuhl für Robotik und Echtzeitsysteme Fakultät für Informatik Technische Universität München bayer.justin@googlemail.com, osendorf@in.tum.de, korhammd@in.tum.de, ntchen86@gmail.com, surban@tum.de, smagt@brml.org
1 Introduction
Recurrent Neural Networks are among the most powerful models for sequential data. The capabilty of representing any measurable sequence to sequence mapping to arbitrary accuracy (Hammer, 2000) makes them universal approximators. Nevertheless they were given only little attention in the last two decades due to the problems of vanishing and exploding gradients (Hochreiter, 1991; Bengio et al., 1994; Pascanu et al., 2012). Error signals either blowing up or decaying exponentially for events many time steps apart rendered them largely impractical for the exact problems they were supposed to solve. This made successful training impossible on many tasks up until recently without resorting to special architectures or abandoning gradientbased optimization. Successful application on tasks with longrange dependencies has thus relied on one of those two paradigms. The former ist to make use of long shortterm memory (LSTM) (Hochreiter and Schmidhuber, 1997). These approaches are among the best methods for the modelling of speech and handwriting (Graves et al., 2013, 2008; Graves, 2013). The latter is to to rely on sensible initializations leading to echostate networks (Jäger et al., 2003).
The publication of (Martens and Sutskever, 2011) can nowadays be considered a landmark, since it was shown that even standard RNNs can be trained with the right optimization method. While a sophisticated Hessianfree optimizer was employed initially, further research (Sutskever et al., 2013; Bengio et al., 2012) has shown that carefully designed firstorder methods can find optima of similar quality.
After all, the problem of underfitting standard RNNs can be dealt with to the extent that RNNs are practical in many areas, e.g., language modelling (Sutskever et al., 2011; Mikolov et al., 2010). In contrast, the problem of overfitting in standard RNNs has (due to the lack of necessity) been tackled only by few. As noted in (Pascanu et al., 2012), using priors with a single optima on the parameters may have detrimental effects on the representation capability of RNNs: a global attractor is constructed in parameter space. In the case of a prior with a mode at zero (e.g. an regularizer) this biases the network towards solutions which lets information die out exponentially fast in time, making it impossible to memorize events for an indefinite amount of time.
Graves (2011) proposes to stochastically and adaptively distort the weights of LSTMbased RNNs, which is justified from the perspective of variational Bayes and the minimum description length principle. Overfitting is practically nonexistent in the experiments conducted. It is untested whether this approach works well for standard RNNs–along the lines of the observations of Pachitariu and Sahani (2013) one might hypothesize that the injected noise disturbs the dynamics of RNNs too much and leads to divergence during training.
The deep neural network community has recently embraced a regularization method called dropout (Hinton et al., 2012). The gist is to randomly discard units from the network during training, leading to less interdependent feature detectors in the intermediary layers. Here, dropping out merely means to set the output of that unit to zero. An equivalent view is to set the complete outgoing weight vector to zero of which it is questionable whether a straight transfer of dropout to RNNs is possible. The resulting changes to the dynamics of an RNN during every forward pass are quite dramatic. This is the reason why Pachitariu and Sahani (2013) only use dropout on those parts of the RNN which are not dynamic., i.e. the connections feeding from the hidden into the output layer.
Our contribution is to show that using a recent smooth approximation to dropout (Wang and Manning, 2013) regularizes RNNs effectively. Since the approximation is deterministic, we may assert that all dynamic parts of the network operate in reasonable regimes. We show that fast dropout does not keep RNNs from reaching rich dynamics during training, which is not obvious due to the relation of classic dropout to L2 regularization (Wager et al., 2013).
The structure of the paper is as follows. We will first review RNNs and fast dropout (FD) (Wang and Manning, 2013). A novel analysis of the derivatives of fast dropout leads to an interpretation where we can perform a decomposition into a loss based on the average output of a network’s units and a regularizer based on its variance. We will discuss why this is a form that is well suited to RNNs and consequently conduct experiments that confirm our hypothesis.
2 Methods
In this section we will first review RNNs and fast dropout. We will then introduce a novel interpretation of what the fast dropout loss constitutes in section 2.2.2 and show relationships to two other regularizers.
2.1 Recurrent Neural Networks
We will define RNNs in terms of two components. For one, we are ultimately interested in an output , which we can calculate given the parameters of a network and some input . Secondly, we want to learn the parameters, which is done by the design and optimization of a function of the parameters , commonly dubbed loss, cost, or error function.
Calculating the Output of an RNN
Given an input sequence we produce an output which is done via an intermediary representation called the hidden state layer . , , and are the dimensionalities of the inputs, outputs, and hidden state at each time step. Each component of the layers is sometimes referred to as a unit or a “neuron”. Depending on the associated layer, these are then input, hidden, or output units. We will also denote the set of units which feed into some unit as the incoming units of . The units into which a unit feeds are called the outgoing units of . For a recurrent network with a single hidden layer, this is done via iteration of the following equations from to :
where are weight matrices and bias vectors. These form the set of parameters together with initial hidden state . The dimensionalities of all weight matrices, bias vectors, and initial hidden states are determined by the dimensionalities of the input sequences as well as desired hidden layer and output layer sizes. The functions and are socalled transfer functions and mostly coordinatewise applied nonlinearities. We will call the activations of units presynaptic before the application of and postsynaptic afterwards. Typical choices include the logistic sigmoid , tangent hyperbolicus and, more recently, the rectified linear (Zeiler et al., 2013). If we set the recurrent weight matrix to zero, we recover a standard neural network Bishop (1995) applied independently to each time step.
Loss Function and Adaption of Parameters
We will restrict ourselves to RNNs for the supervised case, where we are given a data set consisting of pairs with and . Here refers to the sequence length, which we assume to be constant over the data set. We are interested to adapt the parameters of the network in a way to let each of its outputs be close to . Closeness is typically formulated as a loss function, e.g. the mean squared error or the binary cross entropy . If a loss is locally differentiable, finding good parameters can be performed by gradientbased optimization, such as nonlinear conjugate gradients or stochastic gradient descent. The gradients can be calculated efficiently via backpropagation through time (BPTT) (Rumelhart et al., 1986).
2.2 Fast Dropout
In fast dropout (Wang and Manning, 2013), each unit in the network is assumed to be a random variable. To assure tractability, only the first and second moments of those random variables are kept, which suffices for a very good approximation. Since the presynaptic activation of each unit is a weighted sum of its incoming units (of which each is dropped out with a certain probability) we can safely assume Gaussianity for those inputs due to the central limit theorem. As we will see, this is sufficient to find efficient ways to propagate the mean and variance through the nonlinearity .
2.2.1 Forward propagation
We will now inspect the forward propagation for a layer into a single unit, that is
where denotes the elementwise product and is a nonlinear transfer function as before. Let the input layer to the unit be Gaussian distributed with diagonal covariance by assumption: . Furthermore, we have Bernoulli distributed variables indicating whether an incoming unit is not being dropped out organized in a vector with , being the complementary drop out rate. The weight vector is assumed to be constant.
A neural network will in practice consist of many such nodes, with some of them, the output units, directly contributing to the loss function . Others, the input units, will not stem from calculation but come from the data set. Each component of represents an incoming unit, which might be an external input to the network or a hidden unit. In general, will have a complex distribution depending highly on the nature of . Given that the input to a function is Gaussian distributed, we obtain the mean and variance of the output as follows:
Forward propagation through the nonlinearity for calculation of the postsynaptic activation can be approximated very well in the case of the logistic sigmoid and the tangent hyperbolicus and done exactly in case of the rectifier (for details, see Wang and Manning (2013)). While the rectifier has been previously reported to be a useful ingredient in RNNs (Bengio et al., 2012) we found that it leads to unstable learning behaviour in preliminary experiments and thus neglected it in this study, solely focusing on the tangent hyperbolicus. Other popular transfer functions, such as the softmax, need to be approximated either via sampling or an unscented transform (Julier and Uhlmann, 1997).
To obtain a Gaussian approximation for , we will use . The mean and variance of can be obtained as follows. Since and are independent it follows that
(1) 
For independent random variables and , . If we assume the components of to be independent, we can write ^{1}^{1}1In contrast to Wang and Manning (2013), we do not drop the variance not related to dropout. In their case, was used instead of Equation (2).
(2) 
Furthermore the independency assumption is necessary such that the Lyapunov condition is satisfied (Lehmann, 1999) for the the central limit theorem to hold, ensuring that is approximately Gaussian.
Propagating the mean and the variance through via and suffices for determining the presynaptic moments of the outgoing units. At the output of the whole model, we will simplify matters and ignore the variance. Some loss functions take the variance into account (e.g., a Gaussian loglikelihood as done in (Bayer et al., 2013)). Sampling can be a viable alternative as well.
Fast Dropout for RNNs
The extension of fast dropout to recurrent networks is straightforward from a technical perspective. First, we note that we can concatenate the input vector at time step , and the hidden state at the previous layer into a single vector: . We obtain a corresponding weight matrix by concatenation of the input to hidden and recurrent weight matrices and : . We can thus reduce the computation to the step from above.
2.2.2 Beyond the Backward Pass: A Regularization Term
Given the forward pass, we used automatic differentiation with Theano (Bergstra et al., 2010) to calculate the gradients. Nevertheless, we will contribute a close inspection of the derivatives. This will prove useful since it makes it possible to interpret fast dropout as an additional regularization term independent of the exact choice of loss function.
Consider a loss which is a function of the data and parameters ^{2}^{2}2We will frequently omit the explicit dependency on and where clear from context.. In machine learning, we wish this loss to be minimal under unseen data although we only have access to a training set . A typical approach is to optimize another loss as a proxy in the hope that a good minimum of it will correspond to a good minimum of for unseen data. Learning is often done by the optimization of , where is called a regularizer. A common example of a regularizer is to place a prior on the parameters, in which case it is a function of and corresponds to the loglikelihood of the parameters. For weight decay, this is a spherical Gaussian with inverse scale , i.e. . Regularizers can be more sophisticated, e.g. Rifai et al. (2011) determine directions in input space to which a model’s outputs should be invariant. More recently, dropout (i.e. nonfast dropout) for generalized linear models has been intepreted as a semisupervised regularization term encouraging confident predictions by Wager et al. (2013).
While it seems difficult to bring the objective function of fast dropout into the form of , it is possible with the derivatives of each node. For this, we perform backpropagation like calculations.
Let and be the pre and postsynaptic activations of a component of a layer in the network. First note that according to the chain rule. Since is a random variable, it will be described in one of two forms. In the case of a Gaussian approximation, we will summarize it in terms of its mean and variance; this approach is used if propagation through is possible in closed form. In the case of sampling, we will have a single instantiation of the random variable, which we can propagate through . An analysis of both cases is as follows.
Gaussian approximation
We find the derivative of with respect to one of its incoming weights to be
We know that and thus . This can be recognized as the standard backpropagation term if we consider the dropout variable as fixed. We will thus define
(3) 
and subsequently refer to it as the local derivative of the training loss. The second term can be analysed similarly. We apply the chainrule once more which yields
for which any further simplification of depends on the exact form of . The remaining two factors can be written down explicitly, i.e.
Setting
we conclude that
In alignment with Equation (3) this lets us arrive at
and offers an interpretation of fast dropout as an additive regularization term. An important and limiting aspect of this decomposition is that it only holds locally at .
We note that depending on the sign of the error signal , fast dropout can take on three different behaviours:

The error signal is zero and thus the variance of the unit considered to be optimal for the loss. The fast dropout term vanishes; this is especially true at optima of the overall loss.

The unit should increase its variance. The exact interpretation of this depends on the loss, but in many cases this is related to the expectation of the unit being quite erroneous and leads to an increase of scatter of the output. The fast dropout term encourages a quadratic growth of the weights.

The unit should decrease its variance. As before, this depends on the exact loss function but will mostly be related to the expectation of the unit being quite right which makes a reduction of scatter desirable. The fast dropout term encourages a quadratic shrinkage of the weights.
This behaviour can be illustrated for output units by numerically inspecting the values and gradients of the presynaptic moments given a loss. For that we consider a single unit and a loss measuring the divergence of its output to a target value . The presynaptic variance can enter the loss not at all or in one of two ways, respected by either the loss (see (Bayer et al., 2013)) or the transfer function. Three examples for this are

Squared loss on the mean, i.e. with ,

Gaussian loglikelihood on the moments, i.e. with ,

Negative Bernoulli cross entropy, i.e. with .
We visualize the presynaptic mean and variance, their gradients and their respective loss values in Figure 1. For the two latter cases, erroneous units first increase the variance, then move towards the correct mean and subsequently reduce the variance.
Sampling
An already mentioned alternative to the forward propagation of and through is to incarnate via sampling and calculate . Let and . We can then use explicitly and it follows that
Again, we recognize as the standard backpropagation formula with dropout variables.
The variance term can be written as
which, making use of results from earlier in the section is equivalent to
(4) 
The value of this is a zerocentred Gaussian random variable, since is Gaussian. The scale is independent of the current weight value and only determined by the postsynaptic moments of the incoming unit, the dropout rate and the error signal. We conclude, that also in this case, we can write
where is defined as in Equation (4) and essentially an adaptive noise term.
We want to stress the fact that in the approximation as well as in the sampling case the regularization term vanishes at any optima of the training loss. A consequence of this is that no global attractor is formed, which makes the method theoretically useful for RNNs. One might argue that fast dropout should not have a regularizing effect all. Yet, regularization is not only influencing the final solution but also the optimization, leading to different optima.
Relationship to Weight Decay
As already mentioned, imposing a Gaussian distribution centred at the origin with precision as a prior on the weights leads to a method called weight decay. It is not only probabilistically sound but also works well empirically, see e.g. (Bishop et al., 2006). Recalling that the derivative of weight decay is of the form we can reinterpret as a weight decay term where the coefficient is weightwise, dependent on the current activations and possibly negative. Weight decay will always be slightly wrong on the training set, since the derivative of the weight decay term has to match the one of the unregularized loss. In order for to be minimal, cannot be minimal unless so is .
Relationship to Adaptive Weight Noise
In this method (Graves, 2011) not the units but the weights are stochastic, which is in pratice implemented by performing Monte Carlo sampling. We can use a similar technique to FD to find a closed form approximation. In this case, a layer is , where we have no dropout variables and the weights are Gaussian distributed with , with covariance diagonal and organized into a vector. We assume Gaussian density for . Using similar algebra as above, we find that
(5)  
(6) 
It is desirable to determine whether fast dropout and “fast adaptive weight noise” are special cases of each other. Showing that the aproaches are different can be done by equating equations (1) and (5) and solving for . This shows that rescaling by suffices in the case of the expectation. It is however not as simple as that for the variance, i.e. for equations (2) and (6), where the solution depends on and and thus is not independent of the input to the network. Yet, both methods share the property that no global attractor is present in the loss: the “prior” is part of the optimization and not fixed.
2.3 Bag of Tricks
Throughout the experiments we will resort to several “tricks” that have been introduced recently for more stable and efficient optimization of neural networks and RNNs especially. First, we make use of rmsprop (Tieleman and Hinton, 2012), an optimizer which divides the gradient by an exponential moving average of its squares. This approach is similar to Adagrad (Duchi et al., 2011), which uses a window based average. We found that enhancing rmsprop with Nesterov’s accelerated gradient (Sutskever, 2013) greatly reduces the training time in preliminary experiments.
To initialize the RNNs to stable dynamics we followed the initialization protocol of (Sutskever et al., 2013) of setting the spectral radius to a specific value and the maximum amount of incoming connections of a unit to ; we did not find it necessary to centre the inputs and outputs. The effect of not only using the recurrent weight matrix for propagating the states through time but also its elementwise square for advancing the variances can be quantified. The stability of a network is coupled to the spectral radius of the recurrent weight matrix ; thus, the stability of forward propagating the variance is related to the spectral radius of its elementwise square . Since for nonnegative matrices and nonsingular matrices and (Horn and Johnson, 2012), setting to full rank and its spectral radius to assures , where denotes taking the absolute value elementwise. We also use the gradient clipping method introduced in (Pascanu et al., 2012), with a fixed threshold of 225.
Since the hiddentohidden connections and the hiddentooutput connections in an RNN can make use of hidden units in quite distinct ways, we found it beneficial to separate the dropout rates. Specifically, a hidden unit may have a different probability to be dropped out when feeding into the hidden layer at the next time step than when feeding into the output layer. Taking this one step further, we also consider networks in which we completely neglect fast dropout for the hiddentooutput connections; an ordinary forward pass is used instead. Note that this is not the same as setting the dropout rate to zero, since the variance of the incoming units is completely neglected. Whether this is done is treated as another hyperparameter for the experiment.
3 Experiments and Results
Data set  FD  plain  RNNNADE  Deep RNN 

Pianomidi.de  7.39  7.58  7.05  – 
Nottingham  3.09  3.43  2.31  2.95 
MuseData  6.75  6.99  5.60  6.59 
JSBChorales  8.01  8.58  5.19  7.92 
3.1 Musical Data
All experiments were done by performing a random search (Bergstra and Bengio, 2012) over the hyper parameters (see Table 2 in the Appendix for an overview), where 32 runs were performed for each data set. We report the test loss of the model with the lowest validation error over all training runs, using the same split as in (Bengio et al., 2012). To improve speed, we organize sequences into minibatches by first splitting all sequences of the training and validation set into chunks of length of 100. Zeros are prepended to those sequences which have less than 100 time steps. The test error is reported on the unsplitted sequences.
Training RNNs to generatively model polyphonic music is a valuable benchmark for RNNs due to its high dimensionality and the presense of long as well as short term dependencies. This data set has been evaluated previously by Bengio et al. (2012) where the model achieving the best results, RNNNADE (BoulangerLewandowski et al., 2013), makes specific assumptions about the data (i.e. binary observables). RNNs do not attach any assumptions to the inputs.
3.1.1 Setup
The data consists of four distinct data sets, namely Pianomidi.de (classical piano music), Nottingham (folk music), MuseData (orchestral) and JSBChorales (chorales by Johann Sebastian Bach). Each has a dimensionality of 88 per time step organized into different piano rolls which are sequences of binary vectors; each component of these vectors indicates whether a note is occuring at the given time step. We use the RNN’s output to model the sufficient statistics of a Bernoulli random variable, i.e.
which describes the probability that note is present at time step . The output nonlinearity of the network is a sigmoid which projects the points to the interval . We perform learning by the minimization of the average negative loglikelihood (NLL); in this case, this is the average binary crossentropy
where indices the training sample, the component of the target and the time step.
3.1.2 Results
Although a common metric for evaluating the performance of such benchmarks is that of accuracy (Bay et al., 2009) we restrict ourselves to that of the NLL–the measure of accuracy is not what is optimized and to which the NLL is merely a proxy. We present the results of FDRNNs compared with the various other methods in Table 1. Our method is only surpassed by methods which either incorporate more specific assumptions of the data or employ various forms of depth (BoulangerLewandowski et al., 2013; Pascanu et al., 2013). We want to stress that we performed only 32 runs for each data set once more. This shows the relative ease to obtain good results despite of the huge space of potential hyper parameters.
One additional observation is the range of the Eigenvalues of the recurrent weight matrix during training. We performed an additional experiment on JSBChorales where we inspected the Eigenvalues and and the test loss. We found that the spectral radius first increases sharply to a rather high value and then decreases slowly to settle to a specific value. We tried to replicate this behaviour in plain RNNs, but found that RNNs never exceeded a certain spectral radius at which they stuck. This stands in line with the observation from Section 2.2.2 that weights are encouraged to grow when the error is high and shrink during convergence to the optimum. See Figure 2 for a plot of the spectral radius over the training process stages.
4 Conclusion
We have contributed to the field of neural networks in two ways. First, we have analysed a fast approximation of the dropout regularization method by bringing its derivative into the same form as that of a loss regularized with an additive term. We have used this form to gain further insights upon the behaviour of fast dropout for neural networks in general and shown that this objective function does not bias the solutions to those which perform suboptimal on the unreguarized loss. Second, we have hypothesized that this is beneficial especially for RNNs We confirmed this hypothesis by conducting quantitative experiments on an already established benchmark used in the context of learning recurrent networks.
References
 Bay et al. (2009) Bay, M., Ehmann, A. F., and Downie, J. S. (2009). Evaluation of multiplef0 estimation and tracking systems. In ISMIR, pages 315–320.
 Bayer et al. (2013) Bayer, J., Osendorfer, C., Urban, S., et al. (2013). Training neural networks with implicit variance. In Proceedings of the 20th International Conference on Neural Information Processing , ICONIP2013.
 Bengio et al. (2012) Bengio, Y., BoulangerLewandowski, N., and Pascanu, R. (2012). Advances in optimizing recurrent networks. arXiv preprint arXiv:1212.0901.
 Bengio et al. (1994) Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning longterm dependencies with gradient descent is difficult. Neural Networks, IEEE Transactions on, 5(2):157–166.
 Bergstra and Bengio (2012) Bergstra, J. and Bengio, Y. (2012). Random search for hyperparameter optimization. The Journal of Machine Learning Research, 13:281–305.
 Bergstra et al. (2010) Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., WardeFarley, D., and Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy). Oral Presentation.
 Bishop (1995) Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford university press.
 Bishop et al. (2006) Bishop, C. M. et al. (2006). Pattern recognition and machine learning, volume 1. springer New York.
 BoulangerLewandowski et al. (2013) BoulangerLewandowski, N., Bengio, Y., and Vincent, P. (2013). Highdimensional sequence transduction. In ICASSP.
 Duchi et al. (2011) Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 999999:2121–2159.
 Graves (2011) Graves, A. (2011). Practical variational inference for neural networks. In Advances in Neural Information Processing Systems, pages 2348–2356.
 Graves (2013) Graves, A. (2013). Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850.
 Graves et al. (2008) Graves, A., Fernández, S., Liwicki, M., Bunke, H., and Schmidhuber, J. (2008). Unconstrained online handwriting recognition with recurrent neural networks. Advances in Neural Information Processing Systems, 20:1–8.
 Graves et al. (2013) Graves, A., Mohamed, A.r., and Hinton, G. (2013). Speech recognition with deep recurrent neural networks. arXiv preprint arXiv:1303.5778.
 Hammer (2000) Hammer, B. (2000). On the approximation capability of recurrent neural networks. Neurocomputing, 31(1):107–123.
 Hinton et al. (2012) Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R. (2012). Improving neural networks by preventing coadaptation of feature detectors. arXiv preprint arXiv:1207.0580.
 Hochreiter (1991) Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen netzen. Master’s thesis, Institut für Informatik, Technische Universität, München.
 Hochreiter and Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. (1997). Long shortterm memory. Neural computation, 9(8):1735–1780.
 Horn and Johnson (2012) Horn, R. A. and Johnson, C. R. (2012). Matrix analysis. Cambridge university press.
 Jäger et al. (2003) Jäger, H. et al. (2003). Adaptive nonlinear system identification with echo state networks. networks, 8:9.
 Julier and Uhlmann (1997) Julier, S. J. and Uhlmann, J. K. (1997). New extension of the kalman filter to nonlinear systems. In AeroSense’97, pages 182–193. International Society for Optics and Photonics.
 Lehmann (1999) Lehmann, E. L. (1999). Elements of largesample theory. Springer Verlag.
 Martens and Sutskever (2011) Martens, J. and Sutskever, I. (2011). Learning recurrent neural networks with hessianfree optimization. Proc. 28th Int. Conf. on Machine Learning.
 Mikolov et al. (2010) Mikolov, T., Karafiát, M., Burget, L., Cernocky, J., and Khudanpur, S. (2010). Recurrent neural network based language model. Proceedings of Interspeech.
 Pachitariu and Sahani (2013) Pachitariu, M. and Sahani, M. (2013). Regularization and nonlinearities for neural language models: when are they needed? arXiv preprint arXiv:1301.5650.
 Pascanu et al. (2013) Pascanu, R., Gulcehre, C., Cho, K., and Bengio, Y. (2013). How to construct deep recurrent neural networks. arXiv preprint arXiv:1312.6026.
 Pascanu et al. (2012) Pascanu, R., Mikolov, T., and Bengio, Y. (2012). On the difficulty of training recurrent neural networks. Technical report, Technical Report.
 Rifai et al. (2011) Rifai, S., Dauphin, Y. N., Vincent, P., Bengio, Y., and Muller, X. (2011). The manifold tangent classifier. In Advances in Neural Information Processing Systems, pages 2294–2302.
 Rumelhart et al. (1986) Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning representations by backpropagating errors. Nature, 323(6088):533–536.
 Sutskever (2013) Sutskever, I. (2013). Training Recurrent Neural Networks. PhD thesis, University of Toronto.
 Sutskever et al. (2013) Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013). On the importance of initialization and momentum in deep learning.
 Sutskever et al. (2011) Sutskever, I., Martens, J., and Hinton, G. (2011). Generating text with recurrent neural networks. Proceedings of the 2011 International Conference on Machine Learning (ICML2011).
 Tieleman and Hinton (2012) Tieleman, T. and Hinton, G. (2012). Lecture 6.5  rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning.
 Wager et al. (2013) Wager, S., Wang, S., and Liang, P. (2013). Dropout training as adaptive regularization. arXiv preprint arXiv:1307.1493.
 Wang and Manning (2013) Wang, S. and Manning, C. (2013). Fast dropout training. In Proceedings of the 30th International Conference on Machine Learning (ICML13), pages 118–126.
 Zeiler et al. (2013) Zeiler, M., Ranzato, M., Monga, R., Mao, M., Yang, K., Le, Q., Nguyen, P., Senior, A., Vanhoucke, V., Dean, J., et al. (2013). On rectified linear units for speech processing. ICASSP.
Appendix A Appendix
a.1 Hyper Parameters for Musical Data Experiments
We show the hyper parameters ranges for the musical data in Table 2. The ones from which the numbers in Table 1 resulted are given in Table 3.
Hyper parameter  Choices 

#hidden layers  
#hidden units  
Transfer function  tanh 
Use fast dropout for final layer  yes, no 
Step rate  
Momentum  
Decay  
or no 
Hyper parameter  Pianomidi.de  Nottingham  MuseData  JSBChorales 

#hidden units  
–  
Use fast dropout for final layer  no  yes  yes  yes 
Step rate  
Momentum  
Decay  
for  
for  


no  no 