Regularizing Recurrent Networks
Abstract
Advancements in parallel processing have lead to a surge in multilayer perceptrons’ (MLP) applications and deep learning in the past decades. Recurrent Neural Networks (RNNs) give additional representational power to feedforward MLPs by providing a way to treat sequential data. However, RNNs are hard to train using conventional error backpropagation methods because of the difficulty in relating inputs over many timesteps. Regularization approaches from MLP sphere, like dropout and noisy weight training, have been insufficiently applied and tested on simple RNNs. Moreover, solutions have been proposed to improve convergence in RNNs but not enough to improve the long term dependency remembering capabilities thereof.
In this study, we aim to empirically evaluate the remembering and generalization ability of RNNs on polyphonic musical datasets. The models are trained with injected noise, random dropout, normbased regularizers and their respective performances compared to wellinitialized plain RNNs and advanced regularization methods like fastdropout. We conclude with evidence that training with noise does not improve performance as conjectured by a few works in RNN optimization before ours.
1 Introduction
Recurrent Neural Networks are variations of multilayer perceptrons based function approximators, which are used to predict on timeseries data. Such data may be text information in various languages, a musical sequence, a video, or a trend analysis in the financial domain. As training for MLP goes, the most popular techniques are all based on some form of backpropagation of weight gradients (rumelhart1988learning). To train an RNN, the backpropagation of gradients is performed in time, on a timeunfolded representation of the network.
When such a time series network is trained by traditional backpropagation on error gradients, it suffers from one of two peculiar analytical problems—exploding gradients or vanishing gradients. When the error gradients are backpropagated through what is essentially a set of identical weight vectors, the gradients may grow smaller (vanishing gradients) or larger (exploding gradients) exponentially fast, until they become insignificant for training purpose or lead to instability. Conceptually, the problem of vanishing gradients exists in any deep neural network that relies on propagating its error downwards to train the weights. This issue is particularly harmful in case of RNN because it damages the capability of a network to learn properties of the problem that are longterm dependent. In simple terms, this means that due to its inherent nature of being timeseries, a recurrent network needs to store not only the state representation of the input at time, , but also of those seen at . This problem, in presence of vanishing gradients, becomes intractable for exceeding a few dozens.
Due to the unstable behaviour of RNNs in dynamic space, they were not touched upon extensively until some sophisticated secondorder optimization methods were introduced for feedforward neural networks (martens2010deep), that were extended to RNNs. Also groundbreaking have been the advances in form of structural solutions like Long ShortTerm Memory (LSTM) (hochreiter1997long) that established stateoftheart results on text prediction tasks, pathological tasks and such.
Till date, there have been no empirical studies on claims as the ones made in pascanu2012difficulty that regularization of recurrent weights by means of restricting the growth of will fail to prevent vanishing gradients. There have also not been evaluations on the standard regularizationforoverfitting techniques in MLP training applied to RNN for remembering long term dependencies. In this study, we aim to evaluate the effect of normbased regularization methods, artificial noise injection and dropout in weights before propagating derivatives on the ability of the network to remember long term dependencies as well as convergence.
2 Related Work
bengio2013advances present an experimental study that discusses the latest optimization trends in RNNs, including gradient clipping, second order optimization methods like Hessianfree, leaky integration units (LSTMs are also discussed as a part of this), momentum tricks in simple gradient descent (SGD), powerful output probability models based on deterministic variations of Restricted Boltzmann Machines and using sparse gradients as a regularization trick. The evaluations presented in the paper above are on the same music datasets that we use in our study, in addition to the Penn Treebank Corpus of text data.
maas2012recurrent describe deep recurrent networks that consist of denoising autoencoders (vincent2008extracting) at each timestep, to extract rich features out of audio signals by learning timeseries representations from deliberately noiseed input. The noise itself is not modelled by the autoencoder, which is the key idea behind learning a denoised input representation.
RNNs are typically described as a set of three transition functions, viz. inputtohidden, hiddentohidden and hiddentooutput. pascanu2013construct delve into the matter of “depth” in RNNs by describing and evaluating the workings of an RNN when one or more of these three transitions are made deeper than a single layer.
The study by hochreiter1997long is a solution to the long term dependency problem in RNNs. In this, the authors propose a structural variation of a conventional RNN where, by adding additional shortterm memory units that fire randomly, the long timedelay remembering capability of an RNN increases significantly. graves2013generating extended the study of LSTMs by applying the idea to generate complex sequences of words in a text corpus, and handwriting patterns learned from realvalued positional information in calligraphy. zaremba2014recurrent improved generalization in LSTMs by applying Dropout (hinton2012improving) only to the nonrecurrent connections.
murray1994enhanced present an analysis of noisy MLP training models, where the cost function is appended with a noise term to improve trajectory of the training curve, generalization of the network and increase fault tolerance from data. The results were shown to be particularly useful in the field of VLSI network design.
The study of jim1996analysis attempts to extend the noisy gradient descent model from feedforward networks to RNNs. The authors focus on convergence of RNNs, rather than the long term dependency problem. The noisy update model is applied to automata solving problems, which typically do not have pathologically long sequences that need to be remembered at arbitrary time delays.
In an analysis by schaefer2008learning, the authors claim that the widely discussed problem of longterm dependency identification in RNNs does not really exist. This claim is validated by working a pathological sequence task through an RNN, and demonstrating its performance on increasing time delays between the relevant input and output values. However, this study does not present results on standard audio, video or text corpus data that are used in other pertinent publications in RNN.
3 Formulation of RNNs
RNNs are semantically applicable to tasks that are based on temporal consistency. Other than universal function approximators, a way of looking at MLPs is as orthogonal representation of the input features. RNNs exploit this representation technique by duplicating hidden layers of MLP in timesteps and fully connecting the consecutive hidden layers in time. Therefore, we get an unfolded representation of RNNs in time as shown in Fig. 1.
We, hence, define an RNN as
(1)  
(2) 
At timestep , is the input, is the activation of the hidden layer, , and is the output of the network. The complete parameter set of the model is given by the inputtohidden weights, , hiddentohidden weights, , hiddentooutput weights, , hidden layer bias, , and output layer bias, . and are the nonlinear activation functions at the output and hidden layers, respectively.
4 Exploding Gradients and Effect on Longterm Dependencies
bengio1994learning and pascanu2012difficulty explain the dynamics of the weight training using backpropagation through time in RNNs.
Consider the error function, , applied on the outputs of RNN. Calculating error gradient
(3)  
(4) 
Where, is the concatenated matrix of , , , and .
It is clear from Eq. (4) that derivative of loss function at every timestep, , is affected by the the activations at timesteps .
Furthermore, consider the term on the right hand side of Eq. (4)
(5) 
The multiplication of the real valued derivatives at timesteps successively for all indices in Eq. (4) may lead to the norm of the product growing very large or vanishing to zero, exponentially fast in time. This is harmful as far as storing long term time dependencies goes, because by the time the error gradient at would have been propagated to , the norm explosion or vanishing may have made the training regime unsuitable for any meaningful updates.
This compounding of the error gradient can happen in one of two opposite directions, both depending on the largest eigenvalue (spectralradius), , of the recurrent weight matrix. If the spectral radius is much less than 1, the gradient might vanish over time (if using a sigmoidlike nonlinearity). On the other hand, if the spectral radius is bigger than 1, the gradient might explode over time.
5 Demonstration with a Simple Regime
Let us demonstrate the delicate nature of training a recurrent weight matrix, using an oversimplified architecture (a more expansive explanation, also from a dynamical systems perspective, can be found in pascanu2012understanding).
In Eq. 2, assume that there is no new input coming at every timestep, so that the second term with becomes unnecessary. Furthermore, assume that is a single dimension variable, which means that and have dimensions [1, 1] and [1] respectively.
Our objective, then, is to start with a zero value and reach a given target value, , in a set number of timesteps. Fig. 2 shows the training graph over 10000 different initialization sets of and . On the third axis, represents the squared loss of the model.
The steep wall perpendicular to the parameter space represents an explosion in gradients of the loss function. When the largest eigenvalue in the parameter matrix explodes, the curvature of the error surface compounds too, which is what the wall illustrates.
The thing most noteworthy is that when the search routine is at a point on the top surface of the error curve, it makes its next step in a direction perpendicular to the face of the wall. Depending on the learning rate, it might then fall to ground beyond the valley where the error reaches its minimum. This is not such a big problem, because the search must come back to the valley region, given itself to explore the ground region. Note, however, that is only until the search direction collides onto the wall again, at which point a small change in the norm of the update would take the search back to the top of the hill to repeat the entire search process.
The key, then, is to have a method that would smoothen the minima valley and decrease the slope of the steep wall so as to allow optimization to move in a less arbitrary fashion given a sufficiently small learning rate. A more acceptable routine may look like the one shown in Fig. 3.
6 Existing Solutions
6.1 Initialization and Momentum Tricks
Momentum (polyak1964some) with SGD method has the added advantage of preserving the directions of consistent change over multiple updates. The persistent change in directions can be thought of as the dominant velocity in which the update moves during the optimization process. sutskever2013importance describe Gradient descent with momentum as
(6)  
(7) 
Where, is the weight matrix after updates, is the update value, is the momentum, is the step rate of learning and is the partial derivative of the error function w.r.t. the parameter .
nesterov1983method introduced Nesterov Accelerated Gradient (NAG) method for effective velocity preservation in optimization process. In the manner of classical momentum, NAG can be formalized as
(8)  
(9) 
The small, but key, difference between classical momentum method and NAG is that in the latter, first a partial update to the parameters is done using the last update value, and then the gradient calculation is done for the next update.
The second trick presented by sutskever2013importance is related to the random initialization of the hiddentohidden and inputtohidden weight matrices. The sparseifying technique presented here is inspired by martens2010deep, where all but 15 (or some ) connection weights are set to zero, and the rest are sampled from a Gaussian distribution. The reasoning behind this weight setting has been that a sparse connection matrix would help to diversify the incoming connection from a lower layer.
As a second initialization step, the spectralradius is kept close to 1, so as to decrease the possibility of the gradients exploding or vanishing over a long time delay, when using sigmoid transfer function.
6.2 EchoState Networks
It has been argued by jaeger2004harnessing that a random draw from a predetermined distribution can be used to set the inputtohidden and hiddentohidden connection weights, instead of learning them iteratively. This method, however, is not applied to the hiddentooutput layer connections, which are trained using closed form solutions that involve calculating the pseudoinverse of a Hessian matrix.
A completely random draw without controlling the distribution parameters might be harmful for setting such weights, though. For instance, if the spectral radius of the hiddentohidden weight matrix is much higher or lower than 1, there is a clear possibility that the long term dependency effects are either intractable or vanish, respectively, over time. Hence, we follow the general rule that the spectral radius of the hiddentohidden weight matrix is restricted to be close to 1 (1.1, 0.9 etc.) and the inputtohidden weights are drawn with a small standard deviation of about 0.001.
6.3 Hessian Free Optimization
martens2010deep propose a second order HessianFree (HF) optimization method, inspired by Newton’s method, to train deep neural networks with random initializations. HF method obviates the need for pretraining in deep models, which was previously thought to be the most promising way of starting the optimization process, due to the presence of deep pathologies (hinton2006fast; hinton2006reducing).
With respect to the objective function, , HF concerns itself with optimizing a simpler subobjective of by finding local approximations to it. This is done as follows—for a parameter update from to , it optimizes a subobjective function
(10) 
The term, , represents a quadratic approximation to . Normally, is chosen to be the Taylorseries expansion of to secondorder terms. This is the same expansion term that is used for Newton’s optimization methods with the key difference that there are no additional assumptions like a lowrank matrix. This would, typically, make the optimization harder since it would involve an inversion of a large matrix. What differentiates HF from other second order optimization methods is that it is made possible to partially optimize by conjugate gradient method, instead of gradient descent.
The term is a regularization function that penalizes the solution as it moves farther away from (this modification to the HF method of martens2010deep was proposed by sutskever2013training).
6.4 Long ShortTerm Memory (LSTM)
While not particularly a solution to the exploding/vanishing gradients problem, LSTMs (hochreiter1997long) have been systematically proven (graves2013generating) to have stateoftheart performance on sequence generation and longrange time series prediction tasks. LSTM alleviates the temporal dependency preservation problem of plain RNNs by structurally modifying the naive neural nodes of the RNN model to produce a more complex LSTM memory cell.
LSTM cell consists of the following novel links, as in Fig. 4, in addition to the conventional hidden units

Input gate to control the inflow of an input vector into the hidden state. Takes a value from .

Output gate to control the outflow of a hidden state activation to the next layer of LSTMRNN. Takes a value from .

Forget gate to control the value retention of a memory cell. This link uses the input vector and hidden activation value to determine whether the activation is fed back to the unit for retention over longer time sequences. Takes a value from .
The original LSTM by hochreiter1997long uses SGD for training, but it suffers from the exploding gradient problem. In order to solve that, the solution of graves2013generating uses gradient clipping technique to limit the norm of the gradients and hence stop them from growing too large with time. Even so, the structural complexity of LSTM memory units makes it difficult to implement and harder to train on most systems that do not allow calculation of arbitrary gradients.
6.5 FastDropout RNNs
wang2013fast suggest an approximation for dropout (hinton2012improving) in deep neural networks. The suggestion is to treat every neuron as a random variable, whose incoming connections are randomly set to zero, with a probability of . It would be safe to assume that the nature of such a random variable would tend to be Gaussian over sufficiently large number (approximately 10, or more) of incoming connections. The resulting models had orders of magnitude better training times than a naive dropout approach, and the test results matched, and were sometimes better than those of plain MLPs.
bayer2013fast verified the validity of the fastdropout approach on RNNs. This was done by concatenating the inputtohidden and hiddentohidden weights into a single array, and applying the same approximation to the incoming connections as in wang2013fast. Fastdropout applied to RNNs, works as a regularizer, because the Gaussian approximation of the dropout term leads to a local derivative of the random variable representation of the node, that acts as an additive regularization term.
The results of Fastdropout, when applied with the initialization tricks of Sec. 6.1 on standard music datasets, produces stateoftheart results.
7 Normbased Regularizers
The first method of regularization in RNN that we evaluate is Tikhonov regularization (bishop1991improving) on inputtohidden, recurrent and hiddentooutput weight matrices. It has been claimed in previous RNN related works (pascanu2012difficulty) that L1 and L2 penalties on the weight matrix, when added to the cost function of the estimator, may work against improving the longterm dependency remembrance of the network and only partially alleviate the exploding gradients problem.
Using the same example for demonstration as in Sec. 5, we illustrate the effect of L1 and L2 regularizers on the training regime of a timeseries network.
8 Stochastic Noise Injection
Noise injection is used as a regularization method in feedforwardonly neural networks (bremermann1991brain, flower1993summed, jabri1992weight) to improve generalization. The motive behind adding stochastic noise of different natures to the synaptic weights is to improve fault tolerance in the input and gracefully handle unseen data during prediction.
Adding noise to the weights during optimization works as a regularizer by, essentially, converting the statespace search into a search in a more coarse region of the weight space than what would have been without the additional noise. This property of noisy training has been exploited for training the recurrent weights in RNNs too. By adjusting the weight space to a grainier region, not only are we promised faster convergence but also a cure for the exploding gradients problem. A detailed analysis of Gaussian noise injection in recurrent weight matrix and its behaviour as a regularizer is given in appendix A.
In RNNs, the work of jim1996analysis demonstrate application of stochastic noise to the recurrent layers, much the same way as feedforward MLPs. In the following subsections, we use the additive and multiplicative noise addition model by jim1996analysis to evaluate the performance of a recurrent network in terms of preserving long term dependencies in musical chord sequences. Our analysis of the noisy recurrent weight training model is followed by noisy inputtohidden weight model.
8.1 Noise in Recurrent Weights
The first type of noise injection we analyze is in the recurrent weight matrix. In all the analyzed noisy training methods, we restrict ourselves to noncumulative noise models. In noncumulative noise methods, the intensity of noise injected at each timestep, , is independent of the amount of noise injected at . As we saw earlier, backpropagationthroughtime in RNNs trains essentially the same set of weights in timespace and, hence, we postulate that cumulatively increasing the noise intensity in time space might decrease the convergence performance of the network.
Other than the cumulative nature of the recurrent weight noise, there are two main considerations for deciding the nature of noise that must be injected at each recurrent layer

Should the same noise vector be inserted at every timestep in the unrolled representation of the network (persequence noise) or a different noise vector be sampled for every timestep (pertimestep noise)?

Should the noise be a multiplicative factor of the state of weight vector (multiplicative noise) or simply an additive noise vector sampled from a given distribution (additive noise)?
Additive Noise
Additive noise in recurrent weights at timestep, , is given by
(11) 
is the modified version of after adding the noise term. The noise vector, is chosen from a standard normal distribution
In the pertimestep recurrent noise model, we sample a new noise vector, for every timestep in the unrolledrepresentation for every iteration of weight update in the optimization process. In the persequence recurrent noise model, we sample a new noise vector, for every iteration in the optimization process and add the same noise to each timestep in the network.
Multiplicative Noise
Multiplicative noise in recurrent weights, analogously, is given by
(12) 
The nature of is the same as before.
As with additive noise, multiplicative noise is also evaluated on the two variants of pertimestep noise and persequence noise models.
In both, additive and multiplicative noise models, the perturbation of the weight matrix is done only during the optimization period, and not during forward propagation. During weight training, the original values of the weight matrices are preserved even as noise is added for the gradient calculation for backpropagationthroughtime.
8.2 Noise in Feedforward layers
As with noise in the recurrent weight matrix, we would like to close the loop on experimentation by applying the noisy weights training on the feedforward connections too.
During training of feedforward connections with backpropagation of gradients, we use the following weight formulae for noisy weights
(13)  
(14) 
We only work with pertimestep noise model for feedforward layers.
9 Dropout as a Regularizer
Random dropout in MLP connections is used as a generalization technique (hinton2012improving), that works by preventing coadaptation of multiple features in the training set. A variation of dropout in the activation units is DropConnect (wan2013regularization), where random elements from the weight matrix are dropped instead.
We use the DropConnect model on the recurrent weight matrix to try to improve the longterm dependency preserving tendency of our network. As with stochastic noise reduction, dropout in recurrent weights can be applied in two different ways

A possibly unique set of weights are dropped out at every timestep (pertimestep dropout).

Same set of weights are dropped out at every timestep (persequence dropout).
After searching over the range 0–1, we find the best dropout rate suitable for the recurrent connections.
10 Experiments
10.1 Datasets
For evaluating the proposed regularization techniques, we use musical datasets. These are notes based representation of score sheets from four sources—JSB Chorales (harmonized chorales of J.S. Bach), Pianomidi.de (classical music from different sources), Nottingham (folk tunes) and MuseData (classical music).
The dimensionality at each timestep for all four datasets is 88. After dividing the original dataset into training, validation and testing sets (approximately 60%–20%–20% respectively), we split the training and validation samples into chunks of 100 timesteps each. We choose this number because in our experience, for a dataset such as music scores, a length of 100 is long enough to make remembering long term dependencies a necessity while at the same time not making it unreasonably difficult for a network to do so. For samples that are smaller than 100 steps long, we pad them with zeros at the front.
We do no such splitting or prefixing for the test dataset, and use the original sized data chunks for prediction.
10.2 Model Description
Our setup for all four polyphonic music datasets consists of one hidden layer of neurons at each timestep of the RNN. The number of hidden units in the layer is enumerated in the appendix B. The hidden units use the hyperbolictangent (tanh) nonlinearity and the output nodes use sigmoid. The model parameters are tasked with describing the random variable, , such that
Where denotes the state of note at timestep which, if present, is and otherwise.
The loss function which is optimized by this RNN is a mean crossentropy (CE) loss over all timesteps
denotes the note index, denotes the timestep and denotes the training sample index.
10.3 Results
On the four datasets, we report the average CE errors in Tab. 1. The results for RNN with normbased regularizer (RNNNBR), pertimestep noise (RNNN), persequence noise (RNNNS), multiplicative noise pertimestep (RNNMN), multiplicative noise persequence (RNNMNS), dropout pertimestep (RNNDO), dropout persequence (RNNDOS) and feedforward noise (RNNFF) are given compared to plain RNNs (with initialization in correct regime) and fast dropout RNN (RNNFD). Advanced training methods like fast dropout and RNNNADE (boulanger2012modeling) perform measurably better on this data.
We see that injecting stochastic noise or randomly dropping out weights in recurrent layers during training does not necessarily improve the performance of the RNN training or generalization to the test set. In fact, for most datasets, simply tuning the initialization parameters viz. standard deviation of the weight parameter sampling, sparsification of the weight matrix and spectral radius of the recurrent weight vectors, provides better test performance on the musical datasets, than using the noise injection techniques.
JSBC  Not.  Pmidi  Muse  

PlainRNN  8.58  3.43  7.58  6.99 
RNNFD  8.01  3.09  7.39  6.75 
RNNNBR  8.83  3.70  7.78  8.62 
RNNN  8.92  3.56  7.66  8.40 
RNNNS  8.96  3.58  7.74  8.40 
RNNMN  8.64  3.51  7.71  8.13 
RNNMNS  8.64  3.50  7.70  8.12 
RNNDO  8.48  3.49  7.65  7.98 
RNNDOS  8.55  3.57  7.67  8.00 
RNNFF  8.67  3.54  7.69  8.10 
As postulated by bayer2013fast, we observe too that the largest eigenvalue, when training with stochastic noise of dropout in recurrent weights, gets stuck at a lower spectral radius after a fixed number of epochs over multiple tries. There is less incentive for weight matrices with lower spectral radii to change their values by a bigger amount, due to the lack of error information that can be stored over longer time delays. This can be seen in Fig. 5 and Fig. 6. However, this is not the case with normbased regularizers where the spectral radius continues to grow, albeit very slowly (Fig. 7).
Tab. 2 in appendix B gives the range of values from which we generate for normbased regularizer. Fig. 8 shows the average logarithmic test errors over different for both, L1 and L2, regularizers.
Fig. 9 shows the average test errors over different (standard deviation) of additive stochastic noise. The general trend indicates that the network performance decreases as increases.
Fig. 10 shows the average test errors over different (probability that an incoming recurrent weight is set to zero) values, for uniform dropout persequence. The general trend indicates that the network performance improves as is increased.
11 Conclusion
Through an exhaustive set of experiments with noisy weight updates, random dropout and normbased regularization approach we have shown that conjectures about the inefficacy of MLP specific regularizers on RNNs are verifiable. pascanu2012difficulty conjectured that a normbased penalty on the loss function may reduce the training regime of an RNN to a single point attractor, since the length of the eigenvectors of the weight matrix never exceeded by more than a limited amount. A matrix of weights with such low spectral radius would not suffer from exploding or vanishing gradients at the cost of storing long term dependency effects. We can see this from the demonstration of a simple RNN (Fig. 3). In fact, the analytic presentation of the noisy weight training method shows that noise in weights can also be explained as a loss regularization term.
As the results of stochastic noise, L1 and L2 regularizers on RNNs have not been sufficiently tackled by past works in the field, we believe that we have closed a much needed empirical gap by showing that second order optimization methods, structural solutions or more sophisticated methods of training are indeed imperative to deal with the issues of vanishing gradients and long term dependency in recurrent networks.
References
Appendix A Analysis of Noisy Weights
In this section we attempt to show that adding stochastic noise to the weight matrix is equivalent to adding a regularization term to the loss function of the RNN.
Let us define the presynaptic activation of the incoming connections to one hidden unit as as . Then, upon adding multiplicative noise to the weight vector, we have
The noise, is drawn from a zero mean Gaussian (). Additionally, considering and as constants, we have –
(16) 
This shows that the expected value of is the same as the expected value of pre–synaptic signal that is not perturbed by noise.
For computing the variance of , we know that,
Therefore,
(17) 
The second and third terms on the right hand side of Eq. A are zero since the variance in question is that of a constant input, .
For the first term of Eq. A,
(18) 
Where is the standard deviation of the Gaussian noise matrix, .
Putting this back into Eq. A, we get
(19) 
The forms of and imply that,
(20) 
This means that if the multiplicative noise is assumed to have been sampled from a Gaussian distribution, it is equivalent to assume that the presynaptic activations are sampled from a Gaussian.
This equivalence to a sampling form brings us to the sampling form of presynaptic activation explained by bayer2013fast, instead of smooth Gaussian approximation.
In place of , let us use , which we define as –
Where, .
Using the above incarnation of to it’s sampling form, , we may define an effective loss function as follows –
(21) 
We will analyse the right hand side of Eq. 21 one at a time.
For a presynaptic activation, , Eq. 22 is similar to the usual backpropagation term w.r.t a loss function, . Therefore, we may simply use the following form of the gradient term –
(23) 
Consider now the second term of Eq. 21 –
(24) 
This is the same as the postsynaptic gradient term, scaled by the standard deviation of the noise, , and independent of the actual weight values.
Hence, we can write Eq. 21 as –
(25) 
Where the second term on the right hand side is the regularization term due to multiplicative noise addition to the synaptic weights.
Similar analysis can be done for dropout in recurrent weight matrix, where the Gaussian distribution of the noise vector can be replaced by a Bernoulli distribution approximation when choosing .
Appendix B Hyper Parameters for RNN Models
For each of the eight RNN models for which the results are listed in Tab. 1 we generate 50 experiments with model hyper parameters chosen from the ranges given in Tab. 2.
Initilization parameters  for  {1e3, 1, 1e4} 

for  {1e1, 1e2, 1e3}  
Sparsify  {15, 25, 50}  
limit  {0.9, 1.0, 1.1}  
Regularizer  Regularizer  {L1, L2} 
Regularizer  [10e2, 10e4]  
Dropout  [0.0, 1.0]  
Additive and multiplicative noise  for  [0.01, 0.1] 
Optimizer (rmsprop) parameters  Momentum  {0.9, 0.95, 0.99} 
Step rate  {1e2, 1e3, 1e4}  
Batch size  {27, 81} 
RNNNBR  RNNN  RNNNS  RNNMN  RNNMNS  RNNDO  RNNDOS  RNNFF  
Initilization  for  0.0001  0.001  0.001  0.0001  0.0001  0.001  0.001  0.001 
for  0.1  0.1  0.001  0.01  0.001  0.01  0.001  0.1  
Sparsify  15  50  50  50  25  25  50  15  
limit  1.1  0.9  1.0  0.9  0.9  1.0  1.0  0.9  
Regularizer  Regularizer  L2  –  –  –  –  –  –  – 
3.93  –  –  –  –  –  –  –  
Dropout  –  –  –  –  –  0.92  0.56    
Noise  for  –  0.01  0.04  0.06  0.01  –  –  0.09 
Optimizer  Momentum  0.9  0.99  0.90  0.95  0.90  0.95  0.95  0.90 
Step rate  0.001  0.0001  0.001  0.0001  0.0001  0.0001  0.0001  0.0001  
Batch size  81  27  27  81  27  81  81  81  
Hidden layer  # hidden  200  200  200  200  200  200  200  200 
RNNNBR  RNNN  RNNNS  RNNMN  RNNMNS  RNNDO  RNNDOS  RNNFF  
Initilization  for  0.0001  0.0001  0.001  0.0001  0.0001  0.001  0.001  0.0001 
for  0.1  0.1  0.01  0.001  0.001  0.01  0.1  0.001  
Sparsify  15  25  25  15  25  15  25  15  
limit  0.9  1.1  1.0  1.0  1.0  0.9  1.1  1.1  
Regularizer  Regularizer  L2  –  –  –  –  –  –  – 
3.77  –  –  –  –  –  –  –  
Dropout  –  –  –  –  –  0.36  0.78  –  
Noise  for  –  0.01  0.02  0.02  0.06  –  –  0.05 
Optimizer  Momentum  0.95  0.95  0.95  0.95  0.95  0.90  0.90  0.95 
Step rate  0.0001  0.0001  0.0001  0.0001  0.0001  0.001  0.001  0.0001  
Batch size  81  27  27  27  27  81  81  27  
Hidden layer  # hidden  200  200  200  200  200  200  200  200 
RNNNBR  RNNN  RNNNS  RNNMN  RNNMNS  RNNDO  RNNDOS  RNNFF  
Initialization  for  0.0001  0.001  0.001  0.0001  0.0001  0.001  0.0001  0.0001 
for  0.001  0.1  0.001  0.001  0.1  0.001  0.01  0.1  
Sparsify  15  25  15  15  15  15  25  50  
limit  0.90  1.0  1.0  1.0  0.90  1.0  0.90  0.90  
Regularizer  Regularizer  L2  –  –  –  –  –  –  – 
3.52  –  –  –  –  –  –  –  
Dropout  –  –  –  –  –  0.69  0.51  –  
Noise  for  –  0.05  0.04  0.04  0.02  –  –  0.08 
Optimizer  Momentum  0.95  0.95  0.99  0.90  0.95  0.90  0.95  0.90 
Step rate  0.0001  0.0001  0.0001  0.001  0.0001  0.0001  0.0001  0.0001  
Batch size  27  27  81  81  81  27  81  81  
Hidden layer  # hidden  100  100  100  100  100  100  100  100 
RNNNBR  RNNN  RNNNS  RNNMN  RNNMNS  RNNDO  RNNDOS  RNNFF  
Initialization  for  0.001  0.0001  0.0001  0.001  0.001  0.0001  0.0001  0.0001 
for  0.01  0.01  0.1  0.1  0.1  0.001  0.1  0.001  
Sparsify  25  50  15  50  50  50  25  15  
limit  1.0  0.9  1.1  1.0  1.0  1.1  1.0  0.9  
Regularizer  Regularizer  L1  –  –  –  –  –  –  – 
3.80  –  –  –  –  –  –  –  
Dropout  –  –  –  –  –  0.93  0.80  –  
Noise  for  –  0.02  0.02  0.04  0.09  –  –  0.01 
Optimizer  Momentum  0.90  0.99  0.95  0.95  0.90  0.90  0.90  0.95 
Step rate  0.0001  0.0001  0.0001  0.0001  0.0001  0.0001  0.0001  0.0001  
Batch size  81  27  27  81  81  27  81  81  
Hidden layer  # hidden  600  600  600  600  600  600  600  600 