# Improved memory in recurrent neural networks with sequential non-normal dynamics

## Abstract

Training recurrent neural networks (RNNs) is a hard problem due to degeneracies in the optimization landscape, a problem also known as vanishing/exploding gradients. Short of designing new RNN architectures, previous methods for dealing with this problem usually boil down to orthogonalization of the recurrent dynamics, either at initialization or during the entire training period. The basic motivation behind these methods is that orthogonal transformations are isometries of the Euclidean space, hence they preserve (Euclidean) norms and effectively deal with vanishing/exploding gradients. However, this ignores the crucial effects of non-linearity and noise. In the presence of a non-linearity, orthogonal transformations no longer preserve norms, suggesting that alternative transformations might be better suited to non-linear networks. Moreover, in the presence of noise, norm preservation itself ceases to be the ideal objective. A more sensible objective is maximizing the signal-to-noise ratio (SNR) of the propagated signal instead. Previous work has shown that in the linear case, recurrent networks that maximize the SNR display strongly non-normal, sequential dynamics and orthogonal networks are highly suboptimal by this measure. Motivated by this finding, here we investigate the potential of non-normal RNNs, i.e. RNNs with a non-normal recurrent connectivity matrix, in sequential processing tasks. Our experimental results show that non-normal RNNs outperform their orthogonal counterparts in a diverse range of benchmarks. We also find evidence for increased non-normality and hidden chain-like feedforward motifs in trained RNNs initialized with orthogonal recurrent connectivity matrices.

## 1 Introduction

Modeling long-term dependencies with recurrent neural networks (RNNs) is a hard problem due to degeneracies inherent in the optimization landscapes of these models, a problem also known as the vanishing/exploding gradients problem (Hochreiter, 1991; Frasconi, 1994). One approach to addressing this problem has been designing new RNN architectures that are less prone to such difficulties, hence are better able to capture long-term dependencies in sequential data (Hochreiter and Schmidhuber, 1997; Cho et al., 2014; Chang et al., 2017; Bai et al., 2018). An alternative approach is to stick with the basic vanilla RNN architecture instead, but to constrain its dynamics in some way so as to eliminate or reduce the degeneracies that otherwise afflict the optimization landscape. Previous proposals belonging to this second category generally boil down to orthogonalization of the recurrent dynamics, either at initialization or during the entire training period (Le et al., 2015; Arjovsky et al., 2016; Wisdom et al., 2016). The basic idea behind these methods is that orthogonal transformations are isometries of the Euclidean space, hence they preserve distances and norms, which enables them to deal effectively with the vanishing/exploding gradients problem.

However, this idea ignores the crucial effects of non-linearity and noise. Orthogonal transformations no longer preserve distances and norms in the presence of a non-linearity, suggesting that alternative transformations might be better suited to non-linear networks (this point was noted by Pennington et al. (2017) and Chen et al. (2018) before, where isometric initializations that take the non-linearity into account were proposed). Similarly, in the presence of noise, norm preservation itself ceases to be the ideal objective. One must instead maximize the signal-to-noise ratio (SNR) of the propagated signal. In neural networks, noise comes in both through the stochasticity of the stochastic gradient descent (SGD) algorithm and sometimes also through direct noise injection for regularization purposes, as in dropout (Srivastava et al., 2014). Previous work has shown that even in a simple linear setting, recurrent networks that maximize the SNR display strongly non-normal, sequential dynamics and orthogonal networks are highly suboptimal by this measure (Ganguli et al., 2008).

Motivated by these observations, in this paper, we investigate the potential of non-normal RNNs, i.e. RNNs with a non-normal recurrent connectivity matrix, in sequential processing tasks. Recall that a normal matrix is a matrix with an orthonormal set of eigenvectors, whereas a non-normal matrix does not have an orthonormal set of eigenvectors. This property allows non-normal systems to display interesting transient behaviors that are not available in normal systems. This kind of transient behavior, specifically a particular kind of transient amplification of the signal in certain non-normal systems, underlies their superior memory properties (Ganguli et al., 2008), as will be discussed further below. Our empirical results show that non-normal vanilla RNNs significantly outperform their orthogonal counterparts in a diverse range of benchmarks.^{1}

## 2 Background

### 2.1 Memory in linear recurrent networks with noise

Ganguli et al. (2008) studied memory properties of linear recurrent networks injected with a scalar temporal signal , and noise :

(1) |

The noise is assumed to be i.i.d. with . Ganguli et al. (2008) then analyzed the Fisher memory matrix (FMM) of this system, defined as:

(2) |

For linear networks with Gaussian noise, it is easy to show that is, in fact, independent of the past signal history . Ganguli et al. (2008) specifically analyzed the diagonal of the FMM: , which can be written explicitly as:

(3) |

where is the noise covariance matrix, and the norm of can be roughly thought of as representing the signal strength. The total Fisher memory is the sum of over all past time steps :

(4) |

Intuitively, measures the information contained in the current state of the system, , about a signal that entered the system time steps ago, . is then a measure of the total information contained in the current state of the system about the entire past signal history, .

The main result in Ganguli et al. (2008) shows that for all normal matrices (including all orthogonal matrices), whereas in general , where is the network size. Remarkably, the memory upper bound can be achieved by certain highly non-normal systems and several examples are explicitly given in Ganguli et al. (2008). Two of those examples are illustrated in Figure 1a (right): a uni-directional “chain” network and a chain network with feedback. In the chain network, the recurrent connectivity is given by and in the chain with feedback network, it is given by , where and are the feedforward and feedback connection weights, respectively (here denotes the Kronecker delta function). In addition, in order to achieve optimal memory, the signal must be fed at the source neuron in these networks, i.e. .

Figure 1b compares the Fisher memory curves, , of these non-normal networks with the Fisher memory curves of two example normal networks, namely recurrent networks with identity or random orthogonal connectivity matrices. The two non-normal networks have extensive memory capacity, i.e. , whereas for the normal examples, . The crucial property that enables extensive memory in non-normal networks is transient amplification: after the signal enters the network, it is amplified supralinearly for a time of length before it eventually dies out (Figure 1c). This kind of transient amplification is not possible in normal networks.

### 2.2 A toy non-linear example: Non-linearity and noise induce similar effects

The preceding analysis by Ganguli et al. (2008) is exact in linear networks. Analysis becomes more difficult in the presence of a non-linearity. However, we now demonstrate that the non-normal networks shown in Figure 1a have advantages that extend beyond the linear case. The advantages in the non-linear case are due to reduced interference in these non-normal networks between signals entering the network at different time points in the past.

To demonstrate this with a simple example, we will ignore the effect of noise for now and consider the effect of non-linearity on the linear decodability of past signals from the current network activity. We thus consider deterministic non-linear networks of the form (see Appendix A for additional details):

(5) |

and ask how well we can linearly decode a signal that entered the network time steps ago, , from the current activity of the network, . Figure 2c compares the decoding performance in a non-linear orthogonal network with the decoding performance in the non-linear chain network. Just as in the linear case with noise (Figure 2b), the chain network outperforms the orthogonal network.

To understand intuitively why this is the case, consider a chain network with and . In this model, the responses of the neurons after time steps (at ) are given by , , …, , respectively, starting from the source neuron. Although the non-linearity makes perfect linear decoding of the past signal impossible, one may still imagine being able to decode the past signal with reasonable accuracy as long as is not “too non-linear”. A similar intuition holds for the chain network with feedback as well, as long as the feedforward connection weight, , is sufficiently stronger than the feedback connection strength, . A condition like this must already be satisfied if the network is to maintain its optimal memory properties and also be dynamically stable at the same time (Ganguli et al., 2008).

In normal networks, however, linear decoding is further degraded by interference from signals entering the network at different time points, in addition to the degradation caused by the non-linearity. This is easiest to see in the identity network (a similar argument holds for the random orthogonal example too), where the responses of the neurons after time steps are identically given by , if one assumes . Linear decoding is harder in this case, because a signal is both distorted by multiple steps of non-linearity and also mixed with signals entering at other time points.

## 3 Results

### 3.1 Experiments

Because assuming an a priori fixed non-normal structure for an RNN runs the risk of being too restrictive, in this paper, we instead explore the promise of non-normal networks as initializers for RNNs. Throughout the paper, we will be primarily comparing the four RNN architectures schematically depicted in Figure 1a as initializers: two of them normal networks (identity and random orthogonal) and the other two non-normal networks (chain and chain with feedback), the last two being motivated by their optimal memory properties in the linear case, as reviewed above.

#### Copy, addition, permuted sequential MNIST

Copy, addition, and permuted sequential MNIST tasks were commonly used as benchmarks in previous RNN studies (Arjovsky et al., 2016; Bai et al., 2018; Chang et al., 2017; Hochreiter and Schmidhuber, 1997; Le et al., 2015; Wisdom et al., 2016). We now briefly describe each of these tasks.

Copy task: The input is a sequence of integers of length . The first integers in the sequence define the target subsequence that is to be copied and consist of integers between and (inclusive). The next integers are set to . The integer after that is set to , which acts as the cue indicating that the model should start copying the target subsequence. The final integers are set to . The output sequence that the model is trained to reproduce consists of s followed by the target subsequence from the input that is to be copied. To make sure that the task requires a sufficiently long memory capacity, we used a large sequence length, , comparable to the largest sequence length considered in Arjovsky et al. (2016) for the same task.

Addition task: The input consists of two sequences of length . The first one is a sequence of random numbers drawn uniformly from the interval . The second sequence is an indicator sequence with s at exactly two positions and s everywhere else. The positions of the two s indicate the positions of the numbers to be added in the first sequence. The target output is the sum of the two corresponding numbers. The position of the first is drawn uniformly from the first half of the sequence and the position of the second is drawn uniformly from the second half of the sequence. Again, to ensure that the task requires a sufficiently long memory capacity, we chose , which is the same as the largest sequence length considered in Arjovsky et al. (2016) for the same task.

Permuted sequential MNIST (psMNIST): This is a sequential version of the standard MNIST benchmark where the pixels are fed to the model one pixel at a time. To make the task hard enough, we used the permuted version of the sequential MNIST task where a fixed random permutation is applied to the pixels to eliminate any spatial structure before they are fed into the model.

We used vanilla RNNs with recurrent units in the psMNIST task and recurrent units in the copy and addition tasks. We used the elu nonlinearity for the copy and the psMNIST tasks (Clevert et al., 2016), and the relu nonlinearity for the addition problem (because relu proved to be more natural for remembering positive numbers). Batch size was 16 in all tasks.

As mentioned above, the scaled identity and the scaled random orthogonal networks constituted the normal initializers. In the scaled identity initializer, the recurrent connectivity matrix was initialized as and the input matrix was initialized as . In the random orthogonal initializer, the recurrent connectivity matrix was initialized as , where is a random dense orthogonal matrix, and the input matrix was initialized in the same way as in the identity initializer.

The feedforward chain and the chain with feedback networks constituted our non-normal initializers. In the chain initializer, the recurrent connectivity matrix was initialized as and the input matrix was initialized as , where denotes the -dimensional identity matrix. Note that this choice of is a natural generalization of the the source injecting input vector that was found to be optimal in the linear case with scalar signals to multi-dimensional inputs (as long as ). In the chain with feedback initializer, the recurrent connectivity matrix was initialized as and the input matrix was initialized in the same way as in the chain initializer.

We used the rmsprop optimizer for all models, which we found to be the best method for this set of tasks. The learning rate of the optimizer was a hyperparameter which we tuned separately for each model and each task. The following learning rates were considered in the hyper-parameter search: . We ran each model on each task times using the integers from to as random seeds.

In addition, the following model-specific hyperparameters were searched over for each task:

Chain: feedforward connection weight, .

Chain with feedback: feedback connection weight, .

Scaled identity: scale, .

Random orthogonal: scale, .

This yields a total of different runs for each experiment in the non-normal models and a total of different runs in the normal models. Note that we ran more extensive hyper-parameter searches for the normal models than for the non-normal models in this set of tasks.

Figure 3a-c shows the validation losses for each model with the best hyper-parameter settings. The non-normal initializers generally outperform the normal initializers. Figure 3d-f shows for each model the number of “successful” runs that converged to a validation loss below a criterion level (which we set to be 50% of the loss for a baseline random model). The chain model outperformed all other models by this measure (despite having a smaller total number of runs than the normal models). In the copy task, for example, none of the runs for the normal models was able to achieve the criterion level, whereas 46 out of 462 runs for the chain model and 11 out of 462 runs for the feedback chain model reached the criterion loss (see Appendices B & C for further results and discussion).

#### Language modeling experiments

To investigate if the benefits of non-normal initializers extend to more realistic problems, we conducted experiments with three standard language modeling tasks: word-level Penn Treebank (PTB), character-level PTB, and character-level enwik8 benchmarks.

For the language modeling experiments in this subsection, we used the code base provided by Salesforce Research (Merity et al., 2018a, b): https://github.com/salesforce/awd-lstm-lm. We refer the reader to Merity et al. (2018a, b) for a more detailed description of the benchmarks. For the experiments in this subsection, we generally preserved the model setup used in Merity et al. (2018a, b), except for the following differences: 1) We replaced the gated RNN architectures (LSTMs and QRNNs) used in Merity et al. (2018a, b) with vanilla RNNs; 2) We observed that vanilla RNNs require weaker regularization than gated RNN architectures. Therefore, in the word-level PTB task, we set all dropout rates to . In the character-level PTB task, all dropout rates except dropoute were set to , which was set to . In the enwik8 benchmark, all dropout rates were set to ; 3) We trained the word-level PTB models for 60 epochs, the character-level PTB models for 500 epochs and the enwik8 models for 35 epochs.

We compared the same four models described in the previous subsection. As in Merity et al. (2018a), we used the Adam optimizer and thus only optimized the , , hyper-parameters for the experiments in this subsection. For the hyper-parameter in the chain model and the hyper-parameter in the scaled identity and random orthogonal models, we searched over values uniformly spaced between and (inclusive); whereas for the chain with feedback model, we set the feedforward connection weight, , to the optimal value it had in the chain model and searched over values uniformly spaced between and (inclusive). In addition, we repeated each experiment 3 times using different random seeds, yielding a total of 63 runs for each model and each benchmark.

The results are shown in Figure 4 and in Table 1. Figure 4 shows the validation loss over the course of training in units of bits per character (bpc). Table 1 reports the test losses at the end of training. The non-normal models outperform the normal models on the word-level and character-level PTB benchmarks. The differences between the models are less clear on the enwik8 benchmark. However, in terms of the test loss, the non-normal feedback chain model outperforms the other models on all three benchmarks (Table 1).

Model | PTB word | PTB char. | enwik8 |
---|---|---|---|

Identity | 6.550 0.002 | 1.312 0.000 | 1.783 0.003 |

Ortho. | 6.557 0.002 | 1.312 0.001 | 1.843 0.046 |

Chain | 6.514 0.001 | 1.308 0.000 | 1.803 0.017 |

Fb. chain | 6.510 0.001 | 1.307 0.000 | 1.774 0.002 |

3-layer LSTM | 5.878 | 1.175 | 1.232 |

We note that the vanilla RNN models perform significantly worse than the gated RNN architectures considered in Merity et al. (2018a, b). We conjecture that this is because gated architectures are generally better at modeling contextual dependencies, hence they have inductive biases better suited to language modeling tasks. The primary benefit of non-normal dynamics, on the other hand, is enabling a longer memory capacity. Below, we will discuss whether non-normal dynamics can be used in gated RNN architectures to improve performance as well.

### 3.2 Hidden feedforward structures in trained RNNs

We observed that training made vanilla RNNs initialized with orthogonal recurrent connectivity matrices non-normal. We quantified the non-normality of the trained recurrent connectivity matrices using a measure introduced by Henrici (1962): , where denotes the Frobenius norm and is the -th eigenvalue of . This measure equals for all normal matrices and is positive for non-normal matrices. We found that became positive for all successfully trained RNNs initialized with orthogonal recurrent connectivity matrices. Table 2 reports the aggregate statistics of for orthogonally initialized RNNs trained on the toy benchmarks.

Task | Identity | Orthogonal |
---|---|---|

Addition-750 | 2.33 1.02 | 2.74 0.07 |

psMNIST | 1.01 0.12 | 2.72 0.08 |

Although increased non-normality in trained RNNs is an interesting observation, the Henrici index, by itself, does not tell us what structural features in trained RNNs contribute to this increased non-normality. Given the benefits of chain-like feedforward non-normal structures in RNNs for improved memory, we hypothesized that training might have installed hidden chain-like feedforward structures in trained RNNs and that these feedforward structures were responsible for their increased non-normality.

To uncover these hidden feedforward structures, we performed an analysis suggested by Rajan et al. (2016). In this analysis, we first injected a unit pulse of input to the network at the beginning of the trial and let the network evolve for time steps afterwards according to its recurrent dynamics with no direct input. We then ordered the recurrent units by the time of their peak activity (using a small amount of jitter to break potential ties between units) and plotted the mean recurrent connection weights, , as a function of the order difference between two units, . Positive values correspond to connections from earlier peaking units to later peaking units, and vice versa for negative values. In trained RNNs, the mean recurrent weight profile as a function of had an asymmetric peak, with connections in the “forward” direction being, on average, stronger than those in the opposite direction. Figure 5 shows examples with orthogonally initialized RNNs trained on the addition and the permuted sequential MNIST tasks. Note that for a purely feedforward chain, the weight profile would have a single peak at and would be zero elsewhere. Although the weight profiles for trained RNNs are not this extreme, the prominent asymmetric bump with a peak at a positive value indicates a hidden chain-like feedforward structure in these networks.

### 3.3 Do benefits of non-normal dynamics extend to gated RNN architectures?

So far, we have only considered vanilla RNNs. An important question is whether the benefits of non-normal dynamics demonstrated above for vanilla RNNs also extend to gated RNN architectures like LSTMs or GRUs (Hochreiter and Schmidhuber, 1997; Cho et al., 2014). Gated RNN architectures have better inductive biases than vanilla RNNs in many practical tasks of interest such as language modeling (e.g. see Table 1 for a comparison of vanilla RNN architectures with an LSTM architecture of similar size in the language modeling benchmarks), thus it would be practically very useful if their performance could be improved through an inductive bias for non-normal dynamics.

Model | PTB word | PTB char. | enwik8 |
---|---|---|---|

Ortho. | 5.937 0.002 | 1.230 0.001 | 1.583 0.001 |

Chain | 5.935 0.001 | 1.230 0.001 | 1.586 0.000 |

Plain | 5.949 0.007 | 1.245 0.001 | 1.584 0.002 |

Mixed | 5.944 0.004 | 1.227 0.000 | 1.577 0.001 |

To address this question, we treated the input, forget, output, and update gates of the LSTM architecture as analogous to vanilla RNNs and initialized the recurrent and input matrices inside these gates in the same way as in the chain or the orthogonal initialization of vanilla RNNs above. We also compared these with a more standard initialization scheme where all the weights were drawn from a uniform distribution where is the reciprocal of the hidden layer size (labeled plain in Table 3). This is the default initializer for the LSTM weight matrices in PyTorch: https://pytorch.org/docs/stable/nn.html#lstm. We compared these initializers in the language modeling benchmarks. The chain initializer did not perform better than the orthogonal initializer (Table 3), suggesting that non-normal dynamics in gated RNN architectures may not be as helpful as it is in vanilla RNNs. In hindsight, this is not too surprising, because our initial motivation for introducing non-normal dynamics heavily relied on the vanilla RNN architecture and gated RNNs can be dynamically very different from vanilla RNNs.

When we looked at the trained LSTM weight matrices more closely, we found that, although still non-normal, the recurrent weight matrices inside the input, forget, and output gates (i.e. the sigmoid gates) did not have the same signatures of hidden chain-like feedforward structures observed in vanilla RNNs. Specifically, the weight profiles in the LSTM recurrent weight matrices inside these three gates did not display the asymmetric bump characteristic of a prominent chain-like feedforward structure, but were instead approximately monotonic functions of (Figure 6a-c), suggesting a qualitatively different kind of dynamics where the individual units are more persistent over time. The recurrent weight matrix inside the update gate (the tanh gate), on the other hand, did display the signature of a hidden chain-like feedforward structure (Figure 6d). When we incorporated these two structures in different gates of the LSTMs, by using a chain initializer for the update gate and a monotonically increasing recurrent weight profile for the other gates (labeled mixed in Table 3), the resulting initializer outperformed the other initializers on character-level PTB and enwik8 tasks.

## 4 Discussion

Motivated by their optimal memory properties in a simplified linear setting (Ganguli et al., 2008), in this paper, we investigated the potential benefits of certain highly non-normal chain-like RNN architectures in capturing long-term dependencies in sequential tasks. Our results demonstrate an advantage for such non-normal architectures as initializers for vanilla RNNs, compared to the commonly used orthogonal initializers. We further found evidence for the induction of such chain-like feedforward structures in trained vanilla RNNs even when these RNNs were initialized with orthogonal recurrent connectivity matrices.

The benefits of these chain-like non-normal initializers do not directly carry over to more complex, gated RNN architectures such as LSTMs and GRUs. In some important practical problems such as language modeling, the gains from using these kinds of gated architectures seem to far outweigh the gains obtained from the non-normal initializers in vanilla RNNs (see Table 1). However, we also uncovered important regularities in trained LSTM weight matrices, namely that the recurrent weight profiles of the input, forget, and output gates (the sigmoid gates) in trained LSTMs display a monotonically increasing pattern, whereas the recurrent matrix inside the update gate (the tanh gate) displays a chain-like feedforward structure similar to that observed in vanilla RNNs (Figure 6). We showed that these regularities can be exploited to improve the training and/or generalization performance of gated RNN architectures by introducing them as useful inductive biases.

A concurrent work to ours also emphasized the importance of non-normal dynamics in RNNs (Kerg et al., 2019). The main difference between Kerg et al. (2019) and our work is that we explicitly introduce sequential motifs in RNNs at initialization as a useful inductive bias for improved long-term memory (motivated by the optimal memory properties of these motifs in simpler cases), whereas their approach does not constrain the shape of the non-normal part of the recurrent connectivity matrix, hence does not utilize sequential non-normal dynamics as an inductive bias. In some of their tasks, Kerg et al. (2019) also uncovered a feedforward, chain-like motif in trained vanilla RNNs similar to the one reported in this paper (Figure 5).

There is a close connection between the identity initialization of RNNs (Le et al., 2015) and the widely used identity skip connections (or residual connections) in deep feedforward networks (He et al., 2016). Given the superior performance of chain-like non-normal initializers over the identity initialization demonstrated in the context of vanilla RNNs in this paper, it could be interesting to look for similar chain-like non-normal architectural motifs that could be used in deep feedforward networks in place of the identity skip connections.

## Appendix A Details and extensions of the linear decoding experiments

This appendix contains the details of the linear decoding experiments in section 2.2 and reports the results of additional linear decoding experiments. The experiments in section 2.2 compare the signal propagation properties of vanilla RNNs with either random orthogonal or chain connectivity matrices. In both cases, the overall scale of the recurrent connectivity matrices is set to . The input weight vector is for the chain model and for the random orthogonal model (thus the overall scales of both the feedforward and the recurrent inputs are identical in the two models). The RNNs themselves are not trained in these experiments. At each time point, an i.i.d. random scalar signal is fed into the network as input (Equation 5). We simulate 250 trials for each model and ask how well we can linearly decode the signal at the first time step, , from the recurrent activities at time step 100, . We do this by linearly regressing on (using the 250 simulated samples) and report the value for the linear regression in Figure 2.

In simulations with noise (Figure 2b), an additional i.i.d. random noise term, , is added to each recurrent neuron at each time step . The standard deviation of the noise, , is set to in the experiments shown in Figure 2b. To show that the results are not sensitive to the noise scale, we ran additional experiments with lower () and higher () levels of noise (Figure 7). In both cases, the chain network still outperforms the orthogonal network. Note that these “linear + noise” experiments satisfy the conditions of the analytical theory in Ganguli et al. (2008), so these results are as expected from the theory.

As mentioned in the main text, the “non-linear + no noise” experiments reported in Figure 2c used the elu non-linearity. To show that the results are not sensitive to the choice of the non-linearity, we also ran additional experiments with tanh and relu non-linearities (Figure 8). As with the elu non-linearity, the chain network outperforms the orthogonal network with the tanh and relu non-linearities as well, suggesting that the results are not sensitive to the choice of the non-linearity.

## Appendix B The effect of the feedback strength parameter () in the chain with feedback model

In this appendix, we consider the effect of the feedback strength parameter, , for the chain with feedback model in the context of the experiments reported in section 3.1.1. We focus on the psMNIST task specifically, because this is the only task where the feedback chain model converges to a low loss solution for a sufficiently large number of hyper-parameter configurations. For the addition and copy tasks, there are not enough successful hyper-parameter configurations to draw reliable inferences about the effect of (see Figure 3d-f). Figure 9 shows the validation loss at the end of training as a function of in the psMNIST task. In this figure, we considered all networks that achieved a validation loss lower than the random baseline model (i.e. ) at the end of training (an overwhelming majority of the networks satisfied this criterion). Figure 9 shows that the final validation loss is a monotonically increasing function of in this task, suggesting that large feedback strengths are harmful for the model performance.

## Appendix C Comparison with previous models

In this appendix, we compare our results with those obtained by previous models, focusing specifically on the experiments in section 3.1.1 (because the tasks in this section are commonly used as RNN benchmarks).

uRNN: We first note that our copy and addition tasks use the largest sequence lengths considered in Arjovsky et al. (2016) for the same tasks ( for the copy task and for the addition task). Hence our results are directly comparable to those reported in Arjovsky et al. (2016) (the random baselines shown by the dashed lines in Figure 3a-b are identical to those in Arjovsky et al. (2016) for the same conditions). The unitary evolution RNN (uRNN) model proposed in Arjovsky et al. (2016) comfortably learns the copy-500 task (with 128 recurrent units), quickly reaching a near-zero loss (see their Figure 1, bottom right); however, it struggles with the addition task, barely reaching the half-baseline criterion even with 512 recurrent units (see their Figure 2, bottom right). This difference in the behavior of the uRNN model in the copy and addition tasks is predicted by Henaff et al. (2016), where it is shown that random orthogonal and near-identity recurrent connectivity matrices have much better inductive biases in the copy and addition tasks, respectively. Because of its parametrization, uRNN behaves more similarly to a random orthogonal RNN than a near-identity RNN.

In contrast, our non-normal RNNs, especially the chain model, comfortably clear the half-baseline criterion both in copy-500 and addition-750 tasks (with 100 recurrent units), quickly achieving very small loss values in both tasks with the optimal hyper-parameter configurations (Figure 3a-b). Note that this is despite the fact that our models use fewer recurrent units than the uRNN model in Arjovsky et al. (2016) (100 vs. 128 or 512 recurrent units).

nnRNN: Kerg et al. (2019) report results for the copy () and psMNIST tasks only. They have not reported training success for longer variants of the copy task (specifically for ). Kerg et al. (2019) also have not reported successful training in the addition task, whereas our non-normal RNNs showed training success both in copy-500 and addition-750 tasks (Figure 3a-b).

We conclude that our non-normal initializers for vanilla RNNs perform comparably to, or better than, the uRNN and nnRNN models in standard long-term memory benchmarks. One of the biggest strengths of our proposal compared to these previous models is its much greater simplicity. Both uRNN and nnRNN require a complete re-parametrization of the vanilla RNN model (nnRNN even requires a novel optimization method). Our method, on the other hand, proposes much simpler, easy-to-implement, plug-and-play type sequential initializers that keep the standard parametrization of RNNs intact.

critical RNN: Chen et al. (2018) note that the conditions for dynamical isometry in vanilla RNNs are identical to those in fully-connected feed-forward networks studied in Pennington et al. (2017). Pennington et al. (2017), in turn, note that dynamical isometry is not achievable exactly in networks with relu activation, but it is achievable in networks with tanh activation, where it essentially boils down to initializing the weights to small values. Pennington et al. (2017) give a specific example of a dynamically isometric tanh network (with , , and ). We set up a similar tanh RNN model, but were not able to train it successfully in the copy or addition tasks. Again, as with the nnRNN results, this shows the challenging nature of these two tasks and suggests that dynamical isometry may not be enough for successful training in these tasks. A possible reason for this is that although critical initialization takes the non-linearity into account, it still does not take the noise into account (i.e. it is not guaranteed to maximize the SNR).

### Footnotes

- Code available at: https://github.com/eminorhan/nonnormal-init

### References

- Unitary evolution recurrent neural networks. In Proceedings of the 33rd International Conference on Machine Learning, Cited by: Figure 10, Appendix C, Appendix C, Appendix C, §1, §3.1.1, §3.1.1, §3.1.1.
- An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv:1803.01271. Cited by: §1, §3.1.1.
- Dilated recurrent neural networks. In Advances in Neural Information Processing Systems 30, Cited by: §1, §3.1.1.
- Dynamical isometry and a mean field theory of rnns: gating enables signal propagation in recurrent neural networks. In International Conference on Machine Learning, pp. 872–881. Cited by: Appendix C, §1.
- Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734. Cited by: §1, §3.3.
- Fast and accurate deep network learning by exponential linear units (elus). In International Conference on Learning Representations (ICLR), Cited by: Figure 2, §3.1.1.
- Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural. Netw. 5, pp. 157–66. Cited by: §1.
- Memory traces in dynamical systems. PNAS 105 (48), pp. 18970–18975. Cited by: Appendix A, §1, §1, Figure 2, §2.1, §2.1, §2.1, §2.2, §2.2, §4.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.
- Recurrent orthogonal networks and long-memory tasks. In 33rd International Conference on Machine Learning, pp. 2978–2986. Cited by: Appendix C.
- Bounds for iterates, inverses, spectral variation and fields of values of non-normal matrices. Numerische Mathematik 4, pp. 24–40. Cited by: §3.2.
- Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: §1, §3.1.1, §3.3.
- Untersuchungen zu dynamischen neuronalen netzen. Ph.D. Thesis, Institut f. Informatik, Technische Univ. Munich. Cited by: §1.
- Non-normal recurrent neural network (nnRNN): learning long time dependencies while improving expressivity with transient dynamics. arXiv preprint arXiv:1905.12080. Cited by: Figure 10, Appendix C, §4.
- A simple way to initialize recurrent networks of rectified linear units. External Links: 1504.00941, Link Cited by: §1, §3.1.1, §4.
- An analysis of neural language modeling at multiple scales. arXiv:1803.08240. Cited by: §3.1.2, §3.1.2, §3.1.2, Table 1, Table 3.
- Regularizing and optimizing lstm language models. In International Conference on Learning Representations (ICLR), Cited by: §3.1.2, §3.1.2, Table 1, Table 3.
- Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. In Advances in Neural Information Processing Systems (NIPS), pp. 4785–4795. Cited by: Appendix C, §1.
- Recurrent network models of sequence generation and memory. Neuron 90 (1), pp. 128–142. Cited by: §3.2.
- Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §1.
- Full-capacity unitary recurrent neural networks. In Advances in Neural Information Processing Systems 29, Cited by: §1, §3.1.1.