Abstract
Training recurrent neural networks (RNNs) is a hard problem due to degeneracies in the optimization landscape, a problem also known as the vanishing/exploding gradients problem. Short of designing new RNN architectures, various methods for dealing with this problem that have been previously proposed usually boil down to orthogonalization of the recurrent dynamics, either at initialization or during the entire training period. The basic motivation behind these methods is that orthogonal transformations are isometries of the Euclidean space, hence they preserve (Euclidean) norms and effectively deal with the vanishing/exploding gradients problem. However, this idea ignores the crucial effects of nonlinearity and noise. In the presence of a nonlinearity, orthogonal transformations no longer preserve norms, suggesting that alternative transformations might be better suited to nonlinear networks. Moreover, in the presence of noise, norm preservation itself ceases to be the ideal objective. A more sensible objective is maximizing the signaltonoise ratio (SNR) of the propagated signal instead. Previous work has shown that in the linear case, recurrent networks that maximize the SNR display strongly nonnormal dynamics and orthogonal networks are highly suboptimal by this measure. Motivated by this finding, in this paper, we investigate the potential of nonnormal RNNs, i.e. RNNs with a nonnormal recurrent connectivity matrix, in sequential processing tasks. Our experimental results show that nonnormal RNNs significantly outperform their orthogonal counterparts in a diverse range of benchmarks. We also find evidence for increased nonnormality and hidden chainlike feedforward structures in trained RNNs initialized with orthogonal recurrent connectivity matrices.
oddsidemargin has been altered.
marginparsep has been altered.
topmargin has been altered.
marginparwidth has been altered.
marginparpush has been altered.
paperheight has been altered.
The page layout violates the ICML style.
Please do not change the page layout, or include packages like geometry,
savetrees, or fullpage, which change it for you.
We’re not able to reliably undo arbitrary changes to the style. Please remove
the offending package(s), or layoutchanging commands and try again.
Improved memory in recurrent neural networks with sequential nonnormal dynamics
Emin Orhan ^{0 } Xaq Pitkow ^{0 }^{0 }
\@xsect
Modeling longterm dependencies with recurrent neural networks (RNNs) is a hard problem due to degeneracies inherent in the optimization landscapes of these models, a problem also known as the vanishing/exploding gradients problem (Hochreiter, 1991; Bengio et al., 1994). One approach to addressing this problem has been designing new RNN architectures that are less prone to such difficulties, hence are better able to capture longterm dependencies in sequential data (Hochreiter & Schmidhuber, 1997; Cho et al., 2014; Chang et al., 2017; Bai et al., 2018). An alternative approach is to stick with the basic vanilla RNN architecture instead, but to constrain its dynamics in some way so as to eliminate or reduce the degeneracies that otherwise afflict the optimization landscape. Previous proposals belonging to this second category generally boil down to orthogonalization of the recurrent dynamics, either at initialization or during the entire training period (Le et al., 2015; Arjovsky et al., 2016; Wisdom et al., 2016). The basic idea behind these methods is that orthogonal transformations are isometries of the Euclidean space, hence they preserve distances and norms, which enables them to deal effectively with the vanishing/exploding gradients problem.
However, this idea ignores the crucial effects of nonlinearity and noise. Orthogonal transformations no longer preserve distances and norms in the presence of a nonlinearity, suggesting that alternative transformations might be better suited to nonlinear networks. Similarly, in the presence of noise, norm preservation itself ceases to be the ideal objective. One must instead maximize the signaltonoise ratio (SNR) of the propagated signal. In neural networks, noise comes in both through the stochasticity of the stochastic gradient descent (SGD) algorithm and sometimes also through direct noise injection for regularization purposes, as in dropout. Previous work has shown that even in the linear case, recurrent networks that maximize the SNR display strongly nonnormal dynamics and orthogonal networks are highly suboptimal by this measure (Ganguli et al., 2008). Motivated by these observations, in this paper, we investigate the potential of nonnormal RNNs, i.e. RNNs with a nonnormal recurrent connectivity matrix, in sequential processing tasks. Recall that a normal matrix is a matrix with an orthonormal set of eigenvectors, whereas a nonnormal matrix does not have an orthonormal set of eigenvectors. This property allows nonnormal systems to display interesting transient behaviors that are not available in normal systems. This kind of transient behavior, specifically a particular kind of transient amplification of the signal in certain nonnormal systems, underlies their superior memory properties (Ganguli et al., 2008), as will be discussed further below.
Our empirical results show that nonnormal vanilla RNNs significantly outperform their orthogonal counterparts in a diverse range of benchmarks.
Ganguli et al. (2008) studied memory properties of linear recurrent networks injected with a scalar temporal signal , and noise :
(1) 
The noise is assumed to be iid with . Ganguli, et al. (2008) then analyzed the Fisher memory matrix (FMM) of this system, defined as:
(2) 
For linear networks with Gaussian noise, it is easy to show that is, in fact, independent of the past signal history . Ganguli et al. (2008) specifically analyzed the diagonal of the FMM: , which can be written explicitly as:
(3) 
where is the noise covariance matrix, and the norm of can be roughly thought of as representing the signal strength. The total Fisher memory is the sum of over all past time steps :
(4) 
Intuitively, measures the information contained in the current state of the system, , about a signal that entered the system time steps ago, . is then a measure of the total information contained in the current state of the system about the entire past signal history, .
The main result in Ganguli et al. (2008) shows that for all normal matrices (including all orthogonal matrices), whereas in general , where is the network size. Remarkably, the memory upper bound can be achieved by certain highly nonnormal systems and several examples are explicitly given in Ganguli et al. (2008). Two of those examples are illustrated in Figure 1a (right): a unidirectional “chain” network and a chain network with feedback. In the chain network, the recurrent connectivity is given by and in the chain with feedback network, it is given by , where and are the feedforward and feedback connection weights, respectively, and is the Kronecker delta function. In addition, in order to achieve optimal memory, the signal must be fed at the source neuron in these networks, i.e. .
Figure 1b compares the Fisher memory curves, , of these nonnormal networks with the Fisher memory curves of two example normal networks, namely recurrent networks with identity or random orthogonal connectivity matrices. The two nonnormal networks have extensive memory capacity, i.e. , whereas for the normal examples, . The crucial property that enables extensive memory in nonnormal networks is transient amplification: after the signal enters the network, it is amplified supralinearly for a time of length before it eventually dies out (Figure 1c). This kind of transient amplification is not possible in normal networks.
The preceding analysis, due to Ganguli et al. (2008), is exact in linear networks. Analysis becomes more difficult in the presence of a nonlinearity. However, we now demonstrate that the nonnormal networks shown in Figure 1a have advantages that extend beyond the linear case. The advantages in the nonlinear case are due to reduced interference in these nonnormal networks between signals entering the network at different time points in the past. To demonstrate this, we will ignore the effect of noise and consider the effect of nonlinearity on the linear decodability of past signals from the current network activity. We thus consider deterministic nonlinear networks of the form:
(5) 
and ask how well we can linearly decode a signal that entered the network time steps ago, , from the current activity of the network, . Figure 2c compares the decoding performance in a nonlinear orthogonal network with the decoding performance in the nonlinear chain network. Just as in the linear case with noise (Figure 2b), the chain network outperforms the orthogonal network.
To understand intuitively why this is the case, consider a chain network with and . In this model, the responses of the neurons after time steps (at ) are given by , , …, , respectively, starting from the source neuron. Although the nonlinearity makes perfect linear decoding of the past signal impossible, one may still imagine being able to decode the past signal with reasonable accuracy as long as is not “too nonlinear”. A similar intuition holds for the chain network with feedback as well, as long as the feedforward connection weight, , is sufficiently stronger than the feedback connection strength, . A condition like this must already be satisfied if the network is to maintain its optimal memory properties and also be dynamically stable at the same time (Ganguli et al., 2008).
In normal networks, however, linear decoding is further degraded by interference from signals entering the network at different time points, in addition to the degradation caused by the nonlinearity. This is easiest to see in the identity network (a similar argument holds for the random orthogonal example too), where the responses of the neurons after time steps are identically given by , if one assumes . Linear decoding is harder in this case, because a signal is both distorted by multiple steps of nonlinearity and also mixed with signals entering at other time points.
Because assuming an a priori nonnormal structure for an RNN runs the risk of being too restrictive, in this paper, we instead explore the promise of nonnormal networks as initializers for RNNs. Throughout the paper, we will be primarily comparing the four RNN architectures schematically depicted in Figure 1a as initializers: two of them normal networks (identity and random orthogonal) and the other two nonnormal networks (chain and chain with feedback), the last two being motivated by their optimal memory properties in the linear case, as reviewed above. We provide PyTorch and Keras classes implementing the proposed nonnormal initializers at the following public repository: https://github.com/eminorhan/nonnormalinit.
Copy, addition, and permuted sequential MNIST tasks were commonly used as benchmarks in previous RNN studies (Arjovsky et al., 2016; Bai et al., 2018; Chang et al., 2017; Hochreiter & Schmidhuber, 1997; Le et al., 2015; Wisdom et al., 2016). We now briefly describe each of these tasks.
Copy task: The input is a sequence of integers of length . The first integers in the sequence define the target subsequence that is to be copied and consist of integers between and (inclusive). The next integers are set to . The integer after that is set to , which acts as the cue indicating that the model should start copying the target subsequence. The final integers are set to . The output sequence that the model is trained to reproduce consists of s followed by the target subsequence from the input that is to be copied. To make sure that the task requires a sufficiently long memory capacity, we used a large sequence length, , comparable to the largest sequence length considered in Arjovsky, et al. (2016) for the same task.
Addition task: The input consists of two sequences of length . The first one is a sequence of random numbers drawn uniformly from the interval . The second sequence is an indicator sequence with s at exactly two positions and s everywhere else. The positions of the two s indicate the positions of the numbers to be added in the first sequence. The target output is the sum of the two corresponding numbers. The position of the first is drawn uniformly from the first half of the sequence and the position of the second is drawn uniformly from the second half of the sequence. Again, to ensure that the task requires a sufficiently long memory capacity, we chose , which is the same as the largest sequence length considered in Arjovsky, et al. (2016) for the same task.
Permuted sequential MNIST (psMNIST): This is a sequential version of the standard MNIST benchmark where the pixels are fed to the model one pixel at a time. To make the task hard enough, we used the permuted version of the sequential MNIST task where a fixed random permutation is applied to the pixels to eliminate any spatial structure before they are fed into the model.
We used the elu nonlinearity for the copy and the permuted sequential MNIST tasks (Clevert et al., 2016), and the relu nonlinearity for the addition problem (because relu proved to be more natural for remembering positive numbers).
As mentioned above, the scaled identity and the scaled random orthogonal networks constituted the normal initializers. In the scaled identity initializer, the recurrent connectivity matrix was initialized as and the input matrix was initialized as . In the random orthogonal initializer, the recurrent connectivity matrix was initialized as , where is a random dense orthogonal matrix, and the input matrix was initialized in the same way as in the identity initializer.
The feedforward chain and the chain with feedback networks constituted the nonnormal initializers. In the chain initializer, the recurrent connectivity matrix was initialized as and the input matrix was initialized as , where denotes the dimensional identity matrix. In the chain with feedback initializer, the recurrent connectivity matrix was initialized as and the input matrix was initialized in the same way as in the chain initializer.
We used the rmsprop optimizer for all models, which we found to be the best method for this set of tasks. The learning rate of the optimizer was a hyperparameter which we tuned separately for each model and each task. The following learning rates were considered in the hyperparameter search: . We ran each model on each task times using the integers from to as random seeds.
In addition, the following modelspecific hyperparameters were searched over for each task:

Chain model: the feedforward connection weight,

Chain with feedback model: the feedback connection weight,

Scaled identity model: the scale,

Random orthogonal model: the scale,
This yields a total of different runs for each experiment in the nonnormal models and a total of different runs in the normal models. Note that we ran more extensive hyperparameter searches for the normal models than for the nonnormal models in this set of tasks.
Figure 3ac shows the validation losses for each model with the best hyperparameter settings. The nonnormal initializers generally outperform the normal initializers. Figure 3df shows for each model the number of “successful” runs that converged to a validation loss below a criterion level (which we set to be 50% of the loss for a baseline random model). The chain model outperformed all other models by this measure (despite having a smaller total number of runs than the normal models). In the copy task, for example, none of the runs for the normal models was able to achieve the criterion level, whereas 46 out of 462 runs for the chain model and 11 out of 462 runs for the feedback chain model reached the criterion loss.
To investigate if the benefits of nonnormal initializers extend to more realistic problems, we conducted experiments with three standard language modeling tasks: wordlevel Penn Treebank (PTB), characterlevel PTB, and characterlevel enwik8 benchmarks.
For the language modeling experiments in this subsection, we used the code base provided by Salesforce Research (Merity et al., 2018a; b): https://github.com/salesforce/awdlstmlm. We refer the reader to Merity et al. (2018a; 2018b) for a more detailed description of the benchmarks. For the experiments in this subsection, we generally preserved the model setup used in Merity et al. (2018a; 2018b), except for the following differences: 1) We replaced the gated RNN architectures (LSTMs and QRNNs) used in Merity et al. (2018a; 2018b) with vanilla RNNs; 2) We observed that vanilla RNNs require weaker regularization than gated RNN architectures. Therefore, in the wordlevel PTB task, we set all dropout rates to . In the characterlevel PTB task, all dropout rates except dropoute were set to , which was set to . In the enwik8 benchmark, all dropout rates were set to ; 3) We trained the wordlevel PTB models for 60 epochs, the characterlevel PTB models for 500 epochs and the enwik8 models for 35 epochs.
We compared the same four models described in the previous subsection. As in Merity et al. (2018a), we used the Adam optimizer and thus only optimized the , , hyperparameters for the experiments in this subsection. For the hyperparameter in the chain model and the hyperparameter in the scaled identity and random orthogonal models, we searched over values uniformly spaced between and (inclusive); whereas for the chain with feedback model, we set the feedforward connection weight, , to the optimal value it had in the chain model and searched over values uniformly spaced between and (inclusive). In addition, we repeated each experiment 3 times using different random seeds, yielding a total of 63 runs for each model and each benchmark.
The results are shown in Figure 4 and in Table 1. Figure 4 shows the validation loss over the course of training in units of bits per character (bpc). Table 1 reports the test losses at the end of training. The nonnormal models outperform the normal models on the wordlevel and characterlevel PTB benchmarks. The differences between the models are less clear on the enwik8 benchmark. However, in terms of the test loss, the nonnormal feedback chain model significantly outperforms the other models on all three benchmarks (Table 1).
Model  PTB word  PTB char.  enwik8 

Identity  6.550 0.002  1.312 0.000  1.783 0.003 
Ortho.  6.557 0.002  1.312 0.001  1.843 0.046 
Chain  6.514 0.001  1.308 0.000  1.803 0.017 
Fb. chain  6.510 0.001  1.307 0.000  1.774 0.002 
3layer LSTM  5.878  1.175  1.232 
We note that the vanilla RNN models perform significantly worse than the gated RNN architectures considered in Merity et al. (2018a; 2018b). We conjecture that this is because gated architectures are generally better at modeling contextual dependencies, hence they have inductive biases better suited to language modeling tasks. The primary benefit of nonnormal dynamics, on the other hand, is enabling a longer memory capacity. Below, we will discuss whether nonnormal dynamics can be used in gated RNN architectures to improve performance as well.
Next, we conducted experiments with an RL agent trained in the car racing environment CarRacingv0 in OpenAI Gym. Specifically, we used the model introduced in Ha & Schmidhuber (2018) for this environment. For the experiments reported in this subsection, we also used the code base provided by the authors: https://github.com/hardmaru/WorldModelsExperiments. Briefly, in this model, the agent first collects a large number of rollouts from the environment using a random policy. These random rollouts are then used as training data for a variational autoencoder (VAE), learning a compact, lowdimensional representation, , of the agent’s highdimensional observations. Then, a predictive model of this latent representation is learned via an RNN. More specifically, at each time step, the RNN takes as input the current action of the agent, , and the current latent state of the environment, , and predicts the next latent state, . Using an RNN as a predictive model enables the agent to learn potentially complex dependencies between the histories of the agent’s actions and of the state of the environment. In the final step, using the hidden state of the predictive RNN model and the latent state of the environment, , a simple linear controller is trained to perform the actual car racing task. Ha & Schmidhuber (2018) train the predictive RNN model and the controller separately (i.e. the entire model is not trained endtoend), thus we only consider the training of the RNN in our experiments and ignore the training of the controller. Accordingly, the loss values reported below are the validation losses (i.e. negative loglikelihoods) for the predictive model only. For further details, we refer the reader to Ha & Schmidhuber (2018). We essentially use the same setup that they use except for a few differences: 1) We replace the LSTM with a vanilla RNN (with the same number of units) as the predictive model; 2) We use a smaller number of random rollouts (300 vs. 10000); 3) We use the Adam optimizer with a learning rate of 0.0005, instead of the rmsprop optimizer.
Model  Validation loss 

Identity  1.409 0.004 
Chain  1.392 0.005 
For the experiments in this subsection, we only compared RNNs initialized with a scaled identity matrix with RNNs initialized with a chain structure. The hyperparameter searches conducted were identical to the searches described above for the language modeling experiments. Table 2 shows the results. The chain model outperformed the identity model in terms of the final validation loss for the predictive model.
We observed that training made vanilla RNNs initialized with orthogonal recurrent connectivity matrices nonnormal. We quantified the nonnormality of the trained recurrent connectivity matrices using a measure introduced by Henrici (1962): , where denotes the Frobenius norm and is the th eigenvalue of . This measure equals for all normal matrices and is positive for nonnormal matrices. We found that became positive for all successfully trained RNNs initialized with orthogonal recurrent connectivity matrices. Table 3 reports the aggregate statistics of for orthogonally initialized RNNs trained on the toy benchmarks.
Task  Identity  Orthogonal 

Addition750  2.33 1.02  2.74 0.07 
psMNIST  1.01 0.12  2.72 0.08 
Although increased nonnormality in trained RNNs is an interesting observation, the Henrici index, by itself, does not tell us what structural features in trained RNNs contribute to this increased nonnormality. Given the benefits of chainlike feedforward nonnormal structures in RNNs for improved memory, we hypothesized that training might have installed hidden chainlike feedforward structures in trained RNNs and that these feedforward structures were responsible for their increased nonnormality.
To uncover these hidden feedforward structures, we performed an analysis suggested by Rajan et al. (2016). In this analysis, we first injected a unit pulse of input to the network at the beginning of the trial and let the network evolve for time steps afterwards according to its recurrent dynamics with no direct input. We then ordered the recurrent units by the time of their peak activity (using a small amount of jitter to break potential ties between units) and plotted the mean recurrent connection weights, , as a function of the order difference between two units, . Positive values correspond to connections from earlier peaking units to later peaking units, and vice versa for negative values. In trained RNNs, the mean recurrent weight profile as a function of had an asymmetric peak, with connections in the “forward” direction being, on average, stronger than those in the opposite direction. Figure 5 shows examples with orthogonally initialized RNNs trained on the addition and the permuted sequential MNIST tasks. Note that for a purely feedforward chain, the weight profile would have a single peak at and would be zero elsewhere. Although the weight profiles for trained RNNs are not this extreme, the prominent asymmetric bump with a peak at a positive value indicates a hidden chainlike feedforward structure in these networks.
So far, we have only considered vanilla RNNs. An important question is whether the benefits of nonnormal dynamics demonstrated above for vanilla RNNs also extend to gated RNN architectures like LSTMs or GRUs (Hochreiter & Schmidhuber, 1997; Cho et al., 2014). Gated RNN architectures have better inductive biases than vanilla RNNs in many practical tasks of interest such as language modeling (e.g. see Table 1 for a comparison of vanilla RNN architectures with an LSTM architecture of similar size in the language modeling benchmarks), thus it would be practically very useful if their performance could be improved through an inductive bias for nonnormal dynamics.
Model  PTB word  PTB char.  enwik8 

Ortho.  5.937 0.002  1.230 0.001  1.583 0.001 
Chain  5.935 0.001  1.230 0.001  1.586 0.000 
Plain  5.949 0.007  1.245 0.001  1.584 0.002 
Mixed  5.944 0.004  1.227 0.000  1.577 0.001 
To address this question, we treated the input, forget, output and update gates of the LSTM architecture as analogous to vanilla RNNs and initialized the recurrent and input matrices inside these gates in the same way as in the chain or the orthogonal initialization of vanilla RNNs above. We also compared these with a more standard initialization scheme where all the weights were drawn from a uniform distribution where is the reciprocal of the hidden layer size (labeled plain in Table 4). This is the default initializer for the LSTM weight matrices in PyTorch: https://pytorch.org/docs/stable/nn.html#lstm.
We compared these initializers in the language modeling benchmarks. The chain initializer did not perform better than the orthogonal initializer (Table 4), suggesting that nonnormal dynamics in gated RNN architectures may not be as helpful as it is in vanilla RNNs. In hindsight, this is not too surprising, because our initial motivation for introducing nonnormal dynamics heavily relied on the vanilla RNN architecture and gated RNNs can be dynamically very different from vanilla RNNs.
When we looked at the trained LSTM weight matrices more closely, we found that, although still nonnormal, the recurrent weight matrices inside the input, forget, and output gates (i.e. the sigmoid gates) did not have the same signatures of hidden chainlike feedforward structures observed in vanilla RNNs. Specifically, the weight profiles in the LSTM recurrent weight matrices inside these three gates did not display the asymmetric bump characteristic of a prominent chainlike feedforward structure, but were instead monotonic functions of (Figure 6ac), suggesting a qualitatively different kind of dynamics where the individual units are more persistent over time. The recurrent weight matrix inside the update gate (the tanh gate), on the other hand, did display the signature of a hidden chainlike feedforward structure (Figure 6d). When we incorporated these two different structures in different gates of the LSTMs, by using a chain initializer for the update gate and a monotonically increasing recurrent weight profile for the other gates (labeled mixed in Table 4), the resulting initializer outperformed the other initializers on the characterlevel PTB and enwik8 benchmarks.
Motivated by their optimal memory properties in a simplified linear setting (Ganguli et al., 2008), in this paper, we investigated the potential benefits of certain highly nonnormal chainlike RNN architectures in capturing longterm dependencies in sequential tasks. Our results clearly demonstrate an advantage for such nonnormal architectures as initializers for vanilla RNNs, compared to the commonly used orthogonal initializers. We further found evidence for the induction of such chainlike feedforward structures in trained vanilla RNNs even when these RNNs are initialized with orthogonal recurrent connectivity matrices.
The benefits of these chainlike nonnormal initializers do not directly carry over to more complex, gated RNN architectures such as LSTMs and GRUs. In some important practical problems such as language modeling, the gains from using these kinds of gated architectures seem to far outweigh the gains obtained from the nonnormal initializers in vanilla RNNs (see Table 1). However, we also uncovered important regularities in trained LSTM weight matrices, namely that the recurrent weight profiles of the input, forget, and output gates (the sigmoid gates) in trained LSTMs display a monotonically increasing pattern, whereas the recurrent matrix inside the update gate (the tanh gate) displays a chainlike feedforward structure similar to that observed in vanilla RNNs (Figure 6). We showed that these regularities can be exploited to improve the training and/or generalization performance of these gated RNN architectures by introducing them as useful inductive biases to these models.
There is a close connection between the identity initialization of RNNs (Le et al., 2015) and the widely used identity skip connections (or residual connections) in deep feedforward networks (He et al., 2016). Given the superior performance of chainlike nonnormal initializers over the identity initialization demonstrated in the context of vanilla RNNs in this paper, it could be interesting to look for similar chainlike nonnormal architectural motifs that could be used in deep feedforward networks in place of the identity skip connections.
References
 Arjovsky et al. (2016) Arjovsky, M., Shah, A., and Bengio, Y. Unitary evolution recurrent neural networks. In Proceedings of the 33rd International Conference on Machine Learning, 2016.
 Bai et al. (2018) Bai, S., Kolter, J. Z., and Koltun, V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv:1803.01271, 2018.
 Bengio et al. (1994) Bengio, Y., Simard, P., and Frasconi, P. Learning longterm dependencies with gradient descent is difficult. IEEE Trans. Neural. Netw., 5:157–66, 1994.
 Chang et al. (2017) Chang, S., Zhang, Y., Han, W., Yu, M., Guo, X., Tan, W., Cui, X., Witbrock, M., HasegawaJohnson, M., and Huang, T. Dilated recurrent neural networks. In Advances in Neural Information Processing Systems 30, 2017.
 Cho et al. (2014) Cho, K., van Merriënboer, B., Gülçehre, Ç., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734, 2014.
 Clevert et al. (2016) Clevert, D.A., Unterthiner, T., and Hochreiter, S. Fast and accurate deep network learning by exponential linear units (elus). In International Conference on Learning Representations (ICLR), 2016.
 Ganguli et al. (2008) Ganguli, S., Huh, D., and Sompolinsky, H. Memory traces in dynamical systems. PNAS, 105(48):18970–18975, 2008.
 Ha & Schmidhuber (2018) Ha, D. and Schmidhuber, J. Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems 31, pp. 2455–2467, 2018.
 He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
 Henrici (1962) Henrici, P. Bounds for iterates, inverses, spectral variation and fields of values of nonnormal matrices. Numerische Mathematik, 4:24–40, 1962.
 Hochreiter (1991) Hochreiter, S. Untersuchungen zu dynamischen neuronalen Netzen. PhD thesis, Institut f. Informatik, Technische Univ. Munich, 1991.
 Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long shortterm memory. Neural Computation, 9(8):1735–1780, 1997.
 Le et al. (2015) Le, Q., Jaitly, N., and Hinton, G. A simple way to initialize recurrent networks of rectified linear units. 2015. URL https://arxiv.org/abs/1504.00941.
 Merity et al. (2018a) Merity, S., Keskar, N. S., and Socher, R. An analysis of neural language modeling at multiple scales. arXiv:1803.08240, 2018a.
 Merity et al. (2018b) Merity, S., Keskar, N. S., and Socher, R. Regularizing and optimizing lstm language models. In International Conference on Learning Representations (ICLR), 2018b.
 Rajan et al. (2016) Rajan, K., Harvey, C. D., and Tank, D. W. Recurrent network models of sequence generation and memory. Neuron, 90(1):128–142, 2016.
 Wisdom et al. (2016) Wisdom, S., Powers, T., Hershey, J., Roux, J. L., and Atlas, L. Fullcapacity unitary recurrent neural networks. In Advances in Neural Information Processing Systems 29, 2016.