Architectural Complexity Measures of Recurrent Neural Networks

# Architectural Complexity Measures of Recurrent Neural Networks

Saizheng Zhang, Yuhuai Wu, Tong Che, Zhouhan Lin,
Roland Memisevic, Ruslan Salakhutdinov and Yoshua Bengio
MILA, Université de Montréal, University of Toronto, Carnegie Mellon University,
Institut des Hautes Études Scientifiques, France, CIFAR
Equal contribution.

# Architectural Complexity Measures of Recurrent Neural Networks

Saizheng Zhang, Yuhuai Wu, Tong Che, Zhouhan Lin,
Roland Memisevic, Ruslan Salakhutdinov and Yoshua Bengio
MILA, Université de Montréal, University of Toronto, Carnegie Mellon University,
Institut des Hautes Études Scientifiques, France, CIFAR
Equal contribution.

# Supplementary Materials: Architectural Complexity Measures of Recurrent Neural Networks

Saizheng Zhang, Yuhuai Wu, Tong Che, Zhouhan Lin,
Roland Memisevic, Ruslan Salakhutdinov and Yoshua Bengio
MILA, Université de Montréal, University of Toronto, Carnegie Mellon University,
Institut des Hautes Études Scientifiques, France, CIFAR
Equal contribution.
###### Abstract

In this paper, we systematically analyze the connecting architectures of recurrent neural networks (RNNs). Our main contribution is twofold: first, we present a rigorous graph-theoretic framework describing the connecting architectures of RNNs in general. Second, we propose three architecture complexity measures of RNNs: (a) the recurrent depth, which captures the RNN’s over-time nonlinear complexity, (b) the feedforward depth, which captures the local input-output nonlinearity (similar to the “depth” in feedforward neural networks (FNNs)), and (c) the recurrent skip coefficient which captures how rapidly the information propagates over time. We rigorously prove each measure’s existence and computability. Our experimental results show that RNNs might benefit from larger recurrent depth and feedforward depth. We further demonstrate that increasing recurrent skip coefficient offers performance boosts on long term dependency problems.

Architectural Complexity Measures of
Recurrent Neural Networks

Saizheng Zhangthanks: Equal contribution., Yuhuai Wu, Tong Che, Zhouhan Lin, Roland Memisevic, Ruslan Salakhutdinov and Yoshua Bengio MILA, Université de Montréal, University of Toronto, Carnegie Mellon University, Institut des Hautes Études Scientifiques, France, CIFAR

\@float

noticebox[b]29th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\end@float

## 1 Introduction

Recurrent neural networks (RNNs) have been shown to achieve promising results on many difficult sequential learning problems [graves2013generating, bahdanau2014neural, sutskever2014sequence, nitish_video, kiros_skipthought]. There is also much work attempting to reveal the principles behind the challenges and successes of RNNs, including optimization issues [martens2011learning, pascanu2013difficulty], gradient vanishing/exploding related problems [hochreiter1991untersuchungen, bengio1994learning], analysing/designing new RNN transition functional units like LSTMs, GRUs and their variants [hochreiter1997long, greff2015lstm, cho2014learning, jozefowicz2015empirical].

This paper focuses on another important theoretical aspect of RNNs: the connecting architecture. Ever since [schmidhuber1992learning, el1996hierarchical] introduced different forms of “stacked RNNs”, researchers have taken architecture design for granted and have paid less attention to the exploration of other connecting architectures. Some examples include [raiko2012deep, graves2013generating, hermans2013training] who explored the use of skip connections; [pascanu2013construct] who pointed out the distinction of constructing a “deep” RNN from the view of the recurrent paths and the view of the input-to-hidden and hidden-to-output maps. However, they did not rigorously formalize the notion of “depth” and its implications in “deep” RNNs. Besides “deep” RNNs, there still remains a vastly unexplored field of connecting architectures. We argue that one barrier for better understanding the architectural complexity is the lack of a general definition of the connecting architecture. This forced previous researchers to mostly consider the simple cases while neglecting other possible connecting variations. Another barrier is the lack of quantitative measurements of the complexity of different RNN connecting architectures: even the concept of “depth” is not clear with current RNNs.

In this paper, we try to address these two barriers. We first introduce a general formulation of RNN connecting architectures, using a well-defined graph representation. Observing that the RNN undergoes multiple transformations not only feedforwardly (from input to output within a time step) but also recurrently (across multiple time steps), we carry out a quantitative analysis of the number of transformations in these two orthogonal directions, which results in the definitions of recurrent depth and feedforward depth. These two depths can be viewed as general extensions of the work of [pascanu2013construct]. We also explore a quantity called the recurrent skip coefficient which measures how quickly information propagates over time. This quantity is strongly related to vanishing/exploding gradient issues, and helps deal with long term dependency problems. Skip connections crossing different timescales have also been studied by [Lin-ieeetnn96, el1996hierarchical, sutskever2010temporal, koutnik2014clockwork]. Instead of specific architecture design, we focus on analyzing the graph-theoretic properties of recurrent skip coefficients, revealing the fundamental difference between the regular skip connections and the ones which truly increase the recurrent skip coefficients. We rigorously prove each measure’s existence and computability under the general framework.

We empirically evaluate models with different recurrent/feedforward depths and recurrent skip coefficients on various sequential modelling tasks. We also show that our experimental results further validate the usefulness of the proposed definitions.

## 2 General Formulations of RNN Connecting Architectures

RNNs are learning machines that recursively compute new states by applying transition functions to previous states and inputs. Its connecting architecture describes how information flows between different nodes. In this section, we formalize the concept of the connecting architecture by extending the traditional graph-based illustration to a more general definition with a finite directed multigraph and its unfolded version. Let us first define the notion of the RNN cyclic graph that can be viewed as a cyclic graphical representation of RNNs. We attach “weights” to the edges in the cyclic graph that represent time delay differences between the source and destination node in the unfolded graph.

###### Definition 2.1.

Let be a weighted directed multigraph 111A directed multigraph is a directed graph that allows multiple directed edges connecting two nodes., in which is a finite nonempty set of nodes, is a finite set of directed edges. Each denotes a directed weighted edge pointing from node to node with an integer weight . Each node is  labelled by an integer tuple . denotes the time index of the given node, where is the period number of the RNN, and , where is a finite set of node labels. We call the weighted directed multigraph an RNN cyclic graph, if (1) For every edge , let and denote the time index of node and , then for some . (2) There exists at least one directed cycle 222A directed cycle is a closed walk with no repetitions of edges. in . (3) For any closed walk , the sum of all the along is not zero.

Condition (1) assures that we can get a periodic graph (repeating pattern) when unfolding the RNN through time. Condition (2) excludes feedforward neural networks in the definition by forcing to have at least one cycle in the cyclic graph. Condition (3) simply avoids cycles after unfolding. The cyclic representation can be seen as a time folded representation of RNNs, as shown in Figure 1(a). Given an RNN cyclic graph , we unfold over time by the following procedure:

###### Definition 2.2 (Unfolding).

Given an RNN cyclic graph  , we define a new infinite set of nodes . The new set of edges is constructed as follows:   if and only if there is an edge such that , and . The new directed graph is called the unfolding of . Any infinite directed graph that can be constructed from an RNN cyclic graph through unfolding is called an RNN unfolded graph.

###### Lemma 2.1.

The unfolding of any RNN cyclic graph is a directed acyclic graph (DAG).

Figure 1(a) shows an example of two graph representations and of a given RNN. Consider the edge from node going to node in . The fact that it has weight 1 indicates that the corresponding edge in travels one time step, . Note that node also has a loop with weight 2. This loop corresponds to the edge . The two kinds of graph representations we presented above have a one-to-one correspondence. Also, any graph structure on is naturally mapped into a graph structure on . Given an edge tuple in , stands for the number of time steps crossed by ’s covering edges in , i.e., for every corresponding edge , must start from some time index to . Hence corresponds to the “time delay” associated with . In addition, the period number in Definition 2.1 can be interpreted as the time length of the entire non-repeated recurrent structure in its unfolded RNN graph . In other words, shifting the through time by time steps will result in a DAG which is identical to , and is the smallest number that has such property for . Most traditional RNNs have , while some special structures like hierarchical or clockwork RNN [el1996hierarchical, koutnik2014clockwork] have . For example, Figure 1(a) shows that the period number of this specific RNN is 2.

The connecting architecture describes how information flows among RNN units. Assume is a node in , let denotes the set of incoming nodes of , . In the forward pass of the RNN, the transition function takes outputs of nodes as inputs and computes a new output. For example, vanilla RNNs units with different activation functions, LSTMs and GRUs can all be viewed as units with specific transition functions. We now give the general definition of an RNN:

###### Definition 2.3.

An RNN is a tuple , in which is the unfolding of RNN cyclic graph , and is the set of transition functions. In the forward pass, for each hidden and output node , the transition function takes all incoming nodes of as the input to compute the output.

An RNN is homogeneous if all the hidden nodes share the same form of the transition function.

## 3 Measures of Architectural Complexity

In this section, we develop different measures of RNNs’ architectural complexity, focusing mostly on the graph-theoretic properties of RNNs. To analyze an RNN solely from its architectural aspect, we make the mild assumption that the RNN is homogeneous. We further assume the RNN to be unidirectional. For a bidirectional RNN, it is more natural to measure the complexities of its unidirectional components.

### 3.1 Recurrent Depth

Unlike feedforward models where computations are done within one time frame, RNNs map inputs to outputs over multiple time steps. In some sense, an RNN undergoes transformations along both feedforward and recurrent dimensions. This fact suggests that we should investigate its architectural complexity from these two different perspectives. We first consider the recurrent perspective.

The conventional definition of depth is the maximum number of nonlinear transformations from inputs to outputs. Observe that a directed path in an unfolded graph representation corresponds to a sequence of nonlinear transformations. Given an unfolded RNN graph , , let be the length of the longest path from any node at starting time to any node at time . From the recurrent perspective, it is natural to investigate how changes over time. Generally speaking, increases as increases for all . Such increase is caused by the recurrent structure of the RNN which keeps adding new nonlinearities over time. Since approaches as approaches ,333Without loss of generality, we assume the unidirectional RNN approaches positive infinity. to measure the complexity of , we consider its asymptotic behaviour, i.e., the limit of as . Under a mild assumption, this limit exists. The following theorem prove such limit’s computability and well-definedness:

###### Theorem 3.2 (Recurrent Depth).

Given an RNN and its two graph representation and , we denote to be the set of directed cycles in . For , let denote the length of and denote the sum of edge weights along . Under a mild assumption444See a full treatment of the limit in general cases in Theorem A.1 and Proposition A.1.1 in Appendix.,

 dr=limn→+∞Di(n)n=maxϑ∈C(Gc)l(ϑ)σs(ϑ). (1)

More intuitively, is a measure of the average maximum number of nonlinear transformations per time step as gets large. Thus, we call it recurrent depth:

###### Definition 3.1 (Recurrent Depth).

Given an RNN and its two graph representations and , we call , defined in Eq.(1), the recurrent depth of the RNN.

In Figure 1(a), one can easily verify that , , , Thus , , , ., which eventually converges to as . As increases, most parts of the longest path coincides with the path colored in red. As a result, coincides with the number of nodes the red path goes through per time step. Similarly in , observe that the red cycle achieves the maximum () in Eq.(1). Usually, one can directly calculate from . It is easy to verify that simple RNNs and stacked RNNs share the same recurrent depth which is equal to 1. This reveals the fact that their nonlinearities increase at the same rate, which suggests that they will behave similarly in the long run. This fact is often neglected, since one would typically consider the number of layers as a measure of depth, and think of stacked RNNs as “deep” and simple RNNs as “shallow”, even though their discrepancies are not due to recurrent depth (which regards time) but due to feedforward depth, defined next.

### 3.3 Feedforward Depth

Recurrent depth does not fully characterize the nature of nonlinearity of an RNN. As previous work suggests [sutskever2014sequence], stacked RNNs do outperform shallow ones with the same hidden size on problems where a more immediate input and output process is modeled. This is not surprising, since the growth rate of only captures the number of nonlinear transformations in the time direction, not in the feedforward direction. The perspective of feedforward computation puts more emphasis on the specific paths connecting inputs to outputs. Given an RNN unfolded graph , let be the length of the longest path from any input node at time step  to any output node at time step . Clearly, when is small, the recurrent depth cannot serve as a good description for . In fact. it heavily depends on another quantity which we call feedforward depth. The following proposition guarantees the existence of such a quantity and demonstrates the role of both measures in quantifying the nonlinearity of an RNN.

###### Proposition 3.3.1 (Input-Output Length Least Upper Bound).

Given an RNN with recurrent depth , we denote , the supremum exists and thus we have the following upper bound for :

 D∗i(n)≤n⋅dr+df.

The above upper bound explicitly shows the interplay between recurrent depth and feedforward depth: when is small, is largely bounded by ; when is large, captures the nature of the bound (). These two measures are equally important, as they separately capture the maximum number of nonlinear transformations of an RNN in the long run and in the short run.

###### Definition 3.2.

(Feedforward Depth) Given an RNN with recurrent depth and its two graph representations and , we call , defined in Proposition 3.3.1, the feedforward depth555Conventionally, an architecture with depth 1 is a three-layer architecture containing one hidden layer. But in our definition, since it goes through two transformations, we count the depth as 2 instead of 1. This should be particularly noted with the concept of feedforward depth, which can be thought as the conventional depth plus 1. of the RNN.

The following theorem proves ’s computability:

###### Theorem 3.4 (Feedforward Depth).

Given an RNN and its two graph representations and , we denote the set of directed paths that start at an input node and end at an output node in . For , denote the length and the sum of along . Then we have:

 df=supi,n∈ZD∗i(n)−n⋅dr=maxγ∈ξ(Gc)l(γ)−σs(γ)⋅dr,

where is the period number and is the recurrent depth of the RNN.

For example, in Figure 1(a), one can easily verify that . Most commonly, is the same as , i.e., the maximum length from an input to its current output.

### 3.5 Recurrent Skip Coefficient

Depth provides a measure of the complexity of the model. But such a measure is not sufficient to characterize behavior on long-term dependency tasks. In particular, since models with large recurrent depths have more nonlinearities through time, gradients can explode or vanish more easily. On the other hand, it is known that adding skip connections across multiple time steps may help improve the performance on long-term dependency problems [Lin-ieeetnn96, sutskever2010temporal]. To measure such a “skipping” effect, we should instead pay attention to the length of the shortest path from time to time . In , , let be the length of the shortest path. Similar to the recurrent depth, we consider the growth rate of .

###### Theorem 3.6 (Recurrent Skip Coefficient).

Given an RNN and its two graph representations and , under mild assumptions666See Proposition A.3.1 in Appendix.

 j=limn→+∞di(n)n=minϑ∈C(Gc)l(ϑ)σs(ϑ). (2)

Since it is often the case that is smaller or equal to 1, it is more intuitive to consider its reciprocal.

###### Definition 3.3.

(Recurrent Skip Coefficient)777One would find this definition very similar to the definition of the recurrent depth. Therefore, we refer readers to examples in Figure 1 for some illustrations.. Given an RNN and corresponding and , we define , whose reciprocal is defined in Eq.(2), as the recurrent skip coefficient of the RNN.

With a larger recurrent skip coefficient, the number of transformations per time step is smaller. As a result, the nodes in the RNN are more capable of “skipping” across the network, allowing unimpeded information flow across multiple time steps, thus alleviating the problem of learning long term dependencies. In particular, such effect is more prominent in the long run, due to the network’s recurrent structure. Also note that not all types of skip connections can increase the recurrent skip coefficient. We will consider specific examples in our experimental results section.

## 4 Experiments and Results

In this section we conduct a series of experiments to investigate the following questions: (1) Is recurrent depth a trivial measure? (2) Can increasing depth yield performance improvements? (3) Can increasing the recurrent skip coefficient improve the performance on long term dependency tasks? (4) Does the recurrent skip coefficient suggest something more compared to simply adding skip connections? We show our evaluations on both RNNs and LSTMs.

### 4.1 Tasks and Training Settings

PennTreebank dataset: We evaluate our models on character level language modelling using the PennTreebank dataset [marcus1993building]. It contains 5059k characters for training, 396k for validation and 446k for test, and has a alphabet size of 50. We set each training sequence to have the length of 50. Quality of fit is evaluated by the bits-per-character (BPC) metric, which is of perplexity.

text8 dataset: Another dataset used for character level language modelling is the text8 dataset, which contains characters from Wikipedia with an alphabet size of 27. We follow the setting from [mikolov2012subword] and each training sequence has length of 180.

adding problem: The adding problem (and the following copying memory problem) was introduced in [hochreiter1997long]. For the adding problem, each input has two sequences with length of where the first sequence are numbers sampled from uniform[0, 1] and the second sequence are all zeros except two elements which indicates the position of the two elements in the first sequence that should be summed together. The output is the sum. We follow the most recent results and experimental settings in [arjovsky2015unitary] (same for copying memory).

copying memory problem: Each input sequence has length of , where the first values are random integers between to . The model should remember them after steps. The rest of the sequence are all zeros, except for the last 11 entries in the sequence, which starts with as a marker indicating that the model should begin to output its memorized values. The model is expected to give zero outputs at every time step except the last 10 entries, where it should generate (copy) the values in the same order as it has seen at the beginning of the sequence. The goal is to minimize the average cross entropy of category predictions at each time step.

sequential MNIST dataset: Each MNIST image data is reshaped into a sequence, turning the digit classification task into a sequence classification one with long-term dependencies [le2015simple, arjovsky2015unitary]. A slight modification of the dataset is to permute the image sequences by a fixed random order beforehand (permuted MNIST). Results in [le2015simple] have shown that both tanh RNNs and LSTMs did not achieve satisfying performance, which also highlights the difficulty of this task.

For all of our experiments we use Adam [kingma2014adam] for optimization, and conduct a grid search on the learning rate in . For RNNs, the parameters are initialized with samples from a uniform distribution. For LSTM networks we adopt a similar initialization scheme, while the forget gate biases are chosen by the grid search on . We employ early stopping and the batch size was set to .

### 4.2 Recurrent Depth is Non-trivial

To investigate the first question, we compare 4 similar connecting architectures: 1-layer (shallow) “”, 2-layers stacked “”, 2-layers stacked with an extra bottom-up connection “”, and 2-layers stacked with an extra top-down connection “”, as shown in Figure 2(a), left panel. Although the four architectures look quite similar, they have different recurrent depths: sh, st and bu have , while td has . Note that the specific construction of the extra nonlinear transformations in td is not conventional. Instead of simply adding intermediate layers in hidden-to-hidden connection, as reported in [pascanu2013construct], more nonlinearities are gained by a recurrent flow from the first layer to the second layer and then back to the first layer at each time step (see the red path in Figure 2a, left panel).

We first evaluate our architectures using RNN on PennTreebank, where sh has hidden-layer size of . Next, we evaluate four different models for text8 which are RNN-small, RNN-large, LSTM-small, LSTM large, where the model’s sh architecture has hidden-layer size of 512, 2048, 512, 1024 respectively. Given the architecture of the sh model, we set the remaining three architectures to have the same number of parameters. Table 1, left panel, shows that the td architecture outperforms all the other architectures for all the different models. Specifically, td in RNN achieves a test BPC of 1.49 on PennTreebank, which is comparable to the BPC of 1.48 reported in [krueger2015regularizing] using stabilization techniques. Similar improvements are shown for LSTMs, where td architecture in LSTM-large achieves BPC of 1.49 on text8, outperforming the BPC of 1.54 reported in [mikolov2012subword] with Multiplicative RNN (MRNN). It is also interesting to note the improvement we obtain when switching from bu to td. The only difference between these two architectures lies in changing the direction of one connection (see Figure 2(a)), which also increases the recurrent depth. Such a fundamental difference is by no means self-evident, but this result highlights the necessity of the concept of recurrent depth.

### 4.3 Comparing Depths

From the previous experiment, we found some evidence that with larger recurrent depth, the performance might improve. To further investigate various implications of depths, we carry out a systematic analysis for both recurrent depth and feedforward depth on text8 and sequential MNIST datasets. We build models in total with and , respectively (as shown in Figure 2(b)). We ensure that all the models have roughly the same number of parameters (e.g., the model with and has a hidden-layer size of ).

Table 1, right panel, displays results on the text8 dataset. We observed that when fixing feedforward depth (or fixing recurrent depth ), increasing recurrent depth from to (or increasing feedforward depth from to ) does improve the model performance. The best test BPC is achieved by the architecture with . This suggests that reasonably increasing and can aid in better capturing the over-time nonlinearity of the input sequence. However, for too large (or ) like or , increasing (or ) only hurts models performance. This can potentially be attributed to the optimization issues when modelling large input-to-output dependencies (see Appendix B.4 for more details). With sequential MNIST dataset, we next examined the effects of and when modelling long term dependencies (more in Appendix B.4). In particular, we observed that increasing does not bring any improvement to the model performance, and increasing might even be detrimental for training. Indeed, it appears that only captures the local nonlinearity and has less effect on the long term prediction. This result seems to contradict previous claims [hermans2013training] that stacked RNNs (, ) could capture information in different time scales and would thus be more capable of dealing with learning long-term dependencies. On the other hand, a large indicates multiple transformations per time step, resulting in greater gradient vanishing/exploding issues [pascanu2013construct], which suggests that should be neither too small nor too large.

### 4.4 Recurrent Skip Coefficients

To investigate whether increasing a recurrent skip coefficient improves model performance on long term dependency tasks, we compare models with increasing on the adding problem, the copying memory problem and the sequential MNIST problem (without/with permutation, denoted as MNIST and MNIST). Our baseline model is the shallow architecture proposed in [le2015simple]. To increase the recurrent skip coefficient , we add connections from time step to time step for some fixed integer , shown in Figure 2(a), right panel. By using this specific construction, the recurrent skip coefficient increases from 1 (i.e., baseline) to and the new model with extra connection has hidden matrices (one from to and the other from to ).

For the adding problem, we follow the same setting as in [arjovsky2015unitary]. We evaluate the baseline LSTM with 128 hidden units and an LSTM with and 90 hidden units (roughly the same number of parameters as the baseline). The results are quite encouraging: as suggested in [arjovsky2015unitary] baseline LSTM works well for input sequence lengths but fails when . On the other hand, we observe that the LSTM with learns perfectly when , and even if we increase to 1000, LSTM with still works well and the loss reaches to zero.

For the copying memory problem, we use a single layer RNN with 724 hidden units as our basic model, and 512 hidden units with skip connections. So they have roughly the same number of parameters. Models with a higher recurrent skip coefficient outperform those without skip connections by a large margin. When , test set cross entropy (CE) of a basic model only yields 0.2409, but with it is able to reach a test set cross entropy of 0.0975. When , a model with yields a test set CE of 0.1328, while its baseline could only reach 0.2025. We varied the sequence length () and recurrent skip coefficient () in a wide range (where varies from 100 up to 300, and from 10 up to 50), and found that this kind of improvement persists.

For the sequential MNIST problem, the hidden-layer size of the baseline model is set to and models with have hidden-layer sizes of .

The results in Table 2, top-left panel, show that RNNs with recurrent skip coefficient larger than could improve the model performance dramatically. Within a reasonable range of , test accuracy increases quickly as becomes larger. We note that our model is the first RNN model that achieves good performance on this task, even improving upon the method proposed in [le2015simple]. In addition, we also formally compare with the previous results reported in [le2015simple, arjovsky2015unitary], where our model (referred to as s) has a hidden-layer size of , which is about the same number of parameters as in the model of [arjovsky2015unitary]. Table 2, bottom-left panel, shows that our simple architecture improves upon the RNN by on MNIST, and achieves almost the same performance as LSTM on the MNIST dataset with only number of parameters  [arjovsky2015unitary]. Note that obtaining good performance on sequential MNIST requires a larger than that for MNIST (see Appendix B.4 for more details). LSTMs also showed performance boost and much faster convergence speed when using larger , as displayed in Table 2, top-right panel. LSTM with already performs quite well and increasing did not result in any significant improvement, while in MNIST, the performance gradually improves as increases from to . We also observed that the LSTM network performed worse on permuted MNIST compared to a RNN. Similar result was also reported in [le2015simple].

### 4.5 Recurrent Skip Coefficients vs. Skip Connections

We also investigated whether the recurrent skip coefficient can suggest something more than simply adding skip connections. We design 4 specific architectures shown in Figure 2(b), right panel. (1) is the baseline model with a 2-layer stacked architecture, while the other three models add extra skip connections in different ways. Note that these extra skip connections all cross the same time length . In particular, (2) and (3) share quite similar architectures. However, ways in which the skip connections are allocated makes big differences on their recurrent skip coefficients: (2) has , (3) has and (4) has . Therefore, even though (2), (3) and (4) all add extra skip connections, the fact that their recurrent skip coefficients are different might result in different performance.

We evaluated these architectures on the sequential MNIST and MNIST datasets. The results show that differences in  indeed cause big performance gaps regardless of the fact that they all have skip connections (see Table 2, bottom-right panel). Given the same , the model with a larger performs better. In particular, model (3) is better than model (2) even though they only differ in the direction of the skip connections. It is interesting to see that for MNIST (unpermuted), the extra skip connection in model (2) (which does not really increase the recurrent skip coefficient) brings almost no benefits, as model (2) and model (1) have almost the same results. This observation highlights the following point: when addressing the long term dependency problems using skip connections, instead of only considering the time intervals crossed by the skip connection, one should also consider the model’s recurrent skip coefficient, which can serve as a guide for introducing more powerful skip connections.

## 5 Conclusion

In this paper, we first introduced a general formulation of RNN architectures, which provides a solid framework for the architectural complexity analysis. We then proposed three architectural complexity measures: recurrent depth, feedforward depth, and recurrent skip coefficients capturing both short term and long term properties of RNNs. We also found empirical evidences that increasing recurrent depth and feedforward depth might yield performance improvements, increasing feedforward depth might not help on long term dependency tasks, while increasing the recurrent skip coefficient can largely improve performance on long term dependency tasks. These measures and results can provide guidance for the design of new recurrent architectures for particular learning tasks.

## Acknowledgments

The authors acknowledge the following agencies for funding and support: NSERC, Canada Research Chairs, CIFAR, Calcul Quebec, Compute Canada, Samsung, ONR Grant N000141310721, ONR Grant N000141512791 and IARPA Raytheon BBN Contract No. D11PC20071. The authors thank the developers of Theano [team2016theano] and Keras [chollet2015], and also thank Nicolas Ballas, Tim Cooijmans, Ryan Lowe, Mohammad Pezeshki, Roger Grosse and Alex Schwing for their insightful comments.

\cb@dobiblio

1recdepth.bbl

## Appendix A Proofs

To show theorem 3.2, we first consider the most general case in which is defined (Theorem A.1). Then we discuss the mild assumptions under which we can reduce to the original limit (Proposition A.1.1). Additionally, we introduce some notations that will be used throughout the proof. If is a node in the unfolded graph, it has a corresponding node in the folded graph, which is denoted by .

###### Theorem A.1.

Given an RNN cyclic graph and its unfolded representation , we denote the set of directed cycles in . For , denote the length of and the sum of along . Write .999 is not defined when there does not exist a path from time to time . We simply omit undefined cases when we consider the limsup. In a more rigorous sense, it is the limsup of a subsequence of , where is defined. we have :

• The quantity is periodic, in the sense that .

• Let , then

 dr=maxϑ∈C(Gc)l(ϑ)σs(ϑ) (3)
###### Proof.

The first statement is easy to prove. Because of the periodicity of the graph, any path from time step   to corresponds to an isomorphic path from time step to . Passing to limit, and we can deduce the first statement.

Now we prove the second statement. Write . First we prove that . Let be a node such that if we denote the image of on the cyclic graph, we have . Consider the subsequence of . From the definition of and the fact that is a directed circle, we have , by considering the path on corresponding to following -times. So we have

 dr≥limsupk→+∞Di(n)n≥limsupk→+∞D¯¯¯¯t1(kσs(ϑ0))kσs(ϑ0)≥kl(ϑ0)kσs(ϑ0)=l(ϑ0)σs(ϑ0)

Next we prove . It suffices to prove that, for any , there exists , such that for any path with , we have . We denote as the image of on the cyclic graph. is a walk with repeated nodes and edges. Also, we assume there are in total nodes in cyclic graph .

We first decompose into a path and a set of directed cycles. More precisely, there is a path and a sequence of directed cycles on such that:

• The starting and end nodes of is the same as . (If starts and ends at the same node, take as empty.)

• The catenation of the sequences of directed edges is a permutation of the sequence of edges of .

The existence of such a decomposition can be proved iteratively by removing directed cycles from . Namely, if is not a paths, there must be some directed cycles on . Removing from , we can get a new walk . Inductively apply this removal, we will finally get a (possibly empty) path and a sequence of directed cycles. For a directed path or loop , we write the distance between the ending node and starting node when travel through once. We have

 D(γ0):=¯¯¯¯¯¯tnγ−¯¯¯¯t0+|γ0|∑i=1σ(ei)

where is all the edges of . denotes the module of : .

So we have:

 |D(γ0)|≤m+Γ⋅maxe∈Gcσ(e)=M

For convenience, we denote to be the length of path and directed cycles . Obviously we have:

 nγ=w∑i=0li

And also, we have

 tnγ−t1=w∑i=1σs(Ci)+D(γ0)

So we have:

 nγtnγ−t1=l0tnγ−t1+w∑i=1litnγ−t1≤ΓN+w∑i=1litnγ−t1

In which we have for all :

 litnγ−t1=liσs(Ci)⋅σs(Ci)tnγ−t1≤l(ϑ0)σs(ϑ0)σs(Ci)tnγ−t1

So we have:

 w∑i=1litnγ−t1≤l(ϑ0)σs(ϑ0)[1−D(γ0)tnγ−t1]≤l(ϑ0)σs(ϑ0)+M′N

in which and are constants depending only on the RNN .

Finally we have:

 nγtnγ−t1≤l(ϑ0)σs(ϑ0)+M′+ΓN

take , we can prove the fact that .

###### Proposition A.1.1.

Given an RNN and its two graph representations and , if such that achieves the maximum in Eq.(3) and the corresponding path of in visits nodes at every time step, then we have

 dr=maxi∈Z(limsupn→+∞Di(n)n)=limn→+∞Di(n)n
###### Proof.

We only need to prove, in such a graph, for all we have

 liminfn→+∞Di(n)n≥maxi∈Z(limsupn→+∞Di(n)n)=dr

Because it is obvious that

 liminfn→+∞Di(n)n≤dr

Namely, it suffice to prove, for all , for all , there is an , such that when , we have . On the other hand, for , if we assume , then according to condition we have

 Di(n)n≥k⋅l(ϑ)(k+1)σs(ϑ)=l(ϑ)σs(ϑ)−l(ϑ)σs(ϑ)1k+1

We can see that if we set , the inequality we wanted to prove.

We now prove Proposition 3.3.1 and Theorem 3.4 as follows.

###### Proposition A.1.2.

Given an RNN with recurrent depth , we denote

 df=supi,n∈ZD∗i(n)−n⋅dr.

The supremum exists and we have the following least upper bound:

 D∗i(n)≤n⋅dr+df.
###### Proof.

We first prove that . Write . It is easy to verify is periodic, so it suffices to prove for each , . Hence it suffices to prove

 limsupn→∞(D∗i(n)−n⋅dr)<+∞.

From the definition, we have So we have

 D∗i(n)−n⋅dr≤Di(n)−n⋅dr.

From the proof of Theorem A.1,  there exists two constants and depending only on the RNN , such that

 Di(n)n≤dr+M′+Γn.

So we have

 limsupn→∞(D∗i(n)−n⋅dr)≤limsupn→∞(Di(n)−n⋅dr)≤M′+Γ.

Also, we have , so for any ,

 df≥D∗i(n)−n⋅dr.

###### Theorem A.2.

Given an RNN and its two graph representations and , we denote the set of directed path that starts at an input node and ends at an output node in . For , denote the length and the sum of along . Then we have:

 df=supi,n∈ZD∗i(n)−n⋅dr=maxγ∈ξ(Gc)l(γ)−σs(γ)⋅dr.
###### Proof.

Let be a path in from an input node to an output node , where and . We denote as the image of on the cyclic graph. From the proof of Theorem A.1, for each in , we can decompose it into a path and a sequence of directed cycles on satisfying those properties listed in Theorem A.1. We denote to be the length of path and directed cycles . We know for all by definition. Thus,

 lk≤ dr⋅σs(Ck) w∑k=1lk≤ dr⋅w∑k=1σs(Ck)

Note that . Therefore,

 l(γ)−n⋅dr= l0+w∑k=1lk−n⋅dr ≤ l0+dr⋅(w∑k=1σs(Ck)−n) = l0−dr⋅σs(γ0)

for all time step and all integer . The above inequality suggests that in order to take the supremum over all paths in , it suffices to take the maximum over a directed path in . On the other hand, the equality can be achieved simply by choosing the corresponding path of in . The desired conclusion then follows immediately.

Lastly, we show Theorem 3.6.

###### Theorem A.3.

Given an RNN cyclic graph and its unfolded representation , we denote the set of directed cycles in . For , denote the length of and the sum of along . Write . We have :

• The quantity is periodic, in the sense that .

• Let , then

 dr=minϑ∈C(Gc)l(ϑ)σs(ϑ). (4)
###### Proof.

The proof is essentially the same as the proof of the first theorem. So we omit it here. ∎

###### Proposition A.3.1.

Given an RNN and its two graph representations and , if such that achieves the minimum in Eq.(4) and the corresponding path of in visits nodes at every time step, then we have

 s=mini∈Z(liminfn→+∞di(n)n)=limn→+∞di(n)n.
###### Proof.

The proof is essentially the same as the proof of the Proposition A.1.1. So we omit it here. ∎

## Appendix B Experiment Details

### b.1 RNNs with tanh

In this section we explain the functional dependency among nodes in RNNs with in detail.

The transition function for each node is the function. The output of a node is a vector . To compute the output for a node, we simply take all incoming nodes as input, and sum over their affine transformations and then apply the function (we omit the bias term for simplicity).

 hv=tanh⎛⎝∑u∈In(v)W(u)hu⎞⎠,

where represents a real matrix.

As a more concrete example, consider the “bottom-up” architecture in Figure 3, with which we did the experiment described in Section 4.2. To compute the output of node ,

 hv=tanh(W(u)hu+W(p)hp+W(q)hq). (5)

### b.2 LSTMs

In this section we explain the Multidimensional LSTM (introduced by [Graves2007]) which we use for experiments with LSTMs.

The output of a node of the LSTM is a 2-tuple (,), consisting of a cell memory state and a hidden state . The transition function is applied to each node indistinguishably. We describe the computation of below in a sequential manner (we omit the bias term for simplicity).

 z =g⎛⎝∑u∈In(v)Wz(u)hu⎞⎠ block input i =σ⎛⎝∑u∈In(v)Wi(u)hu⎞⎠ input gate o =σ⎛⎝∑u∈In(v)Wo(u)hu⎞⎠ output gate {fu} =⎧⎨⎩σ⎛⎝∑u′∈In(v)Wfu(u′)hu⎞⎠|u∈In(v)⎫⎬⎭ A set of forget gates cv =i⊙z+∑u∈In(v)fu⊙cu cell state hv =o⊙cv hidden state

Note that the Multidimensional LSTM includes the usual definition of LSTM as a special case, where the extra forget gates are 0 (i.e., bias term set to -) and extra weight matrices are 0. We again consider the architecture in Fig. 3. We first compute the block input, the input gate and the output gate by summing over all affine transformed outputs of , and then apply the activation function. For example, to compute the input gate, we have

 i=σ(Wi(u)hu+Wi(p)hp+Wi(q)hq).

Next, we compute one forget gate for each pair of