Lowrank passthrough neural networks
Abstract
Deep learning consists in training neural networks to perform computations that sequentially unfold in many steps over a time dimension or an intrinsic depth dimension. Effective learning in this setting is usually accomplished by specialized network architectures that are designed to mitigate the vanishing gradient problem of naive deep networks. Many of these architectures, such as LSTMs, GRUs, Highway Networks and Deep Residual Network, are based on a single structural principle: the state passthrough.
We observe that these architectures, hereby characterized as Passthrough Networks, in addition to the mitigation of the vanishing gradient problem, enable the decoupling of the network state size from the number of parameters of the network, a possibility that is exploited in some recent works but not thoroughly explored.
In this work we propose simple, yet effective, lowrank and lowrank plus diagonal matrix parametrizations for Passthrough Networks which exploit this decoupling property, reducing the data complexity and memory requirements of the network while preserving its memory capacity. We present competitive experimental results on synthetic tasks and a near state of the art result on sequential randomlypermuted MNIST classification, a hard task on natural data.
Lowrank passthrough neural networks
Antonio Valerio Miceli Barone ^{†}^{†}thanks: Work partially done while affiliated with University of Pisa. 

School of Informatics 
The University of Edinburgh 
amiceli@inf.ed.ac.uk 
1 Overview
Deep neural networks can perform nontrivial computation by the repeated the application of parametric nonlinear transformation layers to vectorial (or, more generally, tensorial) data. This staging of many computation steps can be done over a time dimension for tasks involving sequential inputs or outputs of varying length, yielding a recurrent neural network, or over an intrinsic circuit depth dimension, yielding a deep feedforward neural network, or both. Training these deep models is complicated by the exploding and vanishing gradient problems (Hochreiter, 1991; Bengio et al., 1994).
Starting from the original LSTM of Hochreiter & Schmidhuber (1997), various network architectures have been proposed to ameliorate the vanishing gradient problem in the recurrent neural network setting, such as the modern LSTM (Graves & Schmidhuber, 2005), the GRU (Cho et al., 2014b) and other variants (Greff et al., 2015; Józefowicz et al., 2015). These architectures led to a number of breakthroughs in different tasks such as speech recognition (Graves et al., 2013), machine translation (Cho et al., 2014a; Bahdanau et al., 2014), natural language parsing (Vinyals et al., 2014), question answering (Iyyer et al., 2014) and many others. More recently, similar methods have been applied in the feedforward neural network setting yielding state of the art results with architectures such as Highway Networks (Srivastava et al., 2015), Deep Residual Networks (He et al., 2015) and Grid LSTM^{1}^{1}1which also generalize to networks which are deep in both an intrinsic dimension and a time dimension, or even in multiple additional dimensions. (Kalchbrenner et al., 2015). All these architectures are based on a single structural principle which, in this work, we will refer to as the state passthrough. We will thus refer to these architectures as Passthrough Networks.
Another difficulty in training neural networks is the tradeoff between the network representation power and its number of trainable parameters, which affects its data complexity during training in addition to its implementation memory requirements. More specifically, the number of parameters influences the representation power in two ways: on one hand, it can be thought as the number of tunable ”knobs” or ”switches” that need to be set to represent a given computable function. On the other hand, however, the number of parameters constrains, in most neural architectures, the size of the partial results that are propagated inside the network: its internal memory capacity.
In typical ”fully connected” neural architectures, a layer acting on a dimensional state vector has parameters stored in one or more matrices. Since a sufficiently complex function requires a large number of bits to be represented regardless of architectural details, we can’t hope to find lowdimensional representation for really hard learning tasks, but there can be many functions of practical interest that are simple enough to be represented by a relatively small number of bits while still requiring some sizable amount of memory to be computed. Therefore, representing these functions on a fully connected neural network can be wasteful in terms of number of parameters. For some tasks, this quadratic dependency between state size and parameter number can cause a model going from underfitting the training set to overfitting it just by the addition of a single state component. For this reason, a number of neural lowdimensional layer parametrization have been proposed, such as convolutional layers (LeCun et al., 2004; Krizhevsky et al., 2012) which impose a sparse, local, periodic structure on the parameter matrices, or multiplicative matrix decompositions, notably the Unitary Evolution RNNs (Arjovsky et al., 2015) (which also addresses the vanishing gradient problem) and others (Le et al., 2013; Moczulski et al., 2015).
In this work we observe that the state passthrough allows for a systematic decoupling of the network state size from the number of parameters: since by default the state vector passes mostly unaltered through the layers, each layer can be made simple enough to be described only by a small number of parameters without affecting the overall memory capacity of the network. This effectively spreads the computation over the depth or time dimension of the network, but without making the network ”thin” (as proposed, for instance, by Srivastava et al. (2015)).
To the best of our knowledge, this systematic decoupling has not been described in a systematic way, although it has been exploited by some convolutional passthrough architectures for image recognition (Srivastava et al., 2015; He et al., 2015) or algorithmic tasks (Kaiser & Sutskever, 2015), or architectures with addressable readwrite memory (Graves et al., 2014; Gregor et al., 2015; Neelakantan et al., 2015; Kurach et al., 2015; Danihelka et al., 2016).
In this work we introduce an unified view of passthrough architectures, describe their state sizeparameter size decoupling property, propose simple but effective lowdimensional parametrizations that exploit this decoupling based on lowrank or lowrank plus diagonal matrix decompositions. Our approach extends the LSTM architecture with a single projection layer by Sak et al. (2014) which has been applied to speech recognition, natural language modeling (Józefowicz et al., 2016), video analysis (Sun et al., 2015) et cetera. We provide experimental evaluation of our approach on GRU and Highway Network architectures on various machine learning tasks, including a near state of the art result for the hard task of sequential randomlypermuted MNIST image recognition (Le et al., 2015).
2 Model
In this section we will introduce a notation to describe various neural network architectures, then we will formally describe passthrough architectures and finally will introduce our lowdimensional parametrizations for these architectures.
A neural network can be described as a dynamical system that transforms an input into an output over multiple time steps . At each step the network has a dimensional state vector defined as
(1) 
where is a state initialization function, is a state transition function and is vector of trainable parameters. The output
(2) 
is generated by an output function , where denotes the whole sequence of states visited during the execution.
In a feedforward neural network with constant hidden layer width , the input and the output are vectors of fixed dimension and respectively, is a model hyperparameter and the functions above can be simplified as
(3)  
highlighting the dependence of the different layers on different subsets of parameters.
In a recurrent neural network the input is typically a list of dimensional vectors for where is variable, the output is either a single dimensional vector or a list of such vectors. The model functions can be written as
(4)  
where for a fixeddimensional output we assume that only is meaningful.
Other neural architectures, such as ”seq2seq” transducers without attention (Cho et al., 2014a), can be also described with this framework.
2.1 Passthrough networks
Passthough networks can be defined as networks where the state transition function has a special form such that, at each step the state vector (or a subvector ) is propagated to the next step modified only by some (nearly) linear, elementwise transformations.
Let the state vector be the concatenation of and with (where can be equal to zero). We define a network to have a state passthrough on if evolves as
(5) 
where is the next state proposal function, is the transform function, is the carry function and denotes elementwise vector multiplication.
The rest of the state vector , if present, evolves according to some other function . In practice is only used in LSTM variants, while in other passthrough architectures .
We denote the state passthrough as additive if . This choice is used in the original LSTM of Hochreiter & Schmidhuber (1997) and in the Deep Residual Network^{2}^{2}2the Deep Residual Network does not exactly fit this definition of passthrough network due to the ReLU nonlinearities applied between the layers, but it is similar enough that it can be considered to be based on the same principle of He et al. (2015).
We denote the state passthrough as convex if . This choice is used in GRUs (Cho et al., 2014b) and Highway Networks (Srivastava et al., 2015). Modern LSTM variants (Greff et al., 2015) typically use a transform function (”forget gate”) and carry function (”input gate”) independent of each other.
As concrete example, we can describe a fully connected Highway Network as
(6)  
where is an elementwise activation function, usually the ReLU (Glorot et al., 2011) or the hyperbolic tangent, is the elementwise logistic sigmoid, and , the parameters and are matrices in and and are vectors in . Dependence on the input occurs only though the initialization function, which is modelspecific and is omitted here, as is the output function.
2.2 Lowrank passthrough networks
In fully connected architectures there are matrices that act on the state vector, such as the and matrices of the Highway Network of eq. 6. Each of these matrices has entries, thus for large , the entries of these matrices can make up the majority of independently trainable parameters of the model.
As discussed in the previous section, this parametrization can be wasteful. Specifically, this parameterization implies that, at each step, all the information in each state component can affect all the information in any state component at the next step. That is, the computation performed at each step is essentially fully global. Classical physical systems, however, consist of spatially separated parts with primarily local interactions, longdistance interactions are possible but they tend to be limited by propagation delays, bandwidth and noise. Therefore it may be beneficial to bias our model class towards models that tend to adhere to these physical constraints by using a parametrization which reduces the number of parameters required to represent them.
We can accomplish this lowdimensional parametrization by imposing some constraints on the matrices that parametrize the state transitions. One way of doing this is to impose a convolutional structure on these matrices, which corresponds to strict locality and periodicity constraints as in a cellular automaton. These constraints may work well in certain domains such as vision, but may be overly restrictive in other domains.
We propose instead to impose a lowrank constraint on these matrices. This is easily accomplished by rewriting each of these matrices as the product of two matrices where the inner dimension is a model hyperparameter. For instance, in the case of the Highway Network of eq. 6 we can redefine
(7)  
where and . When this result in a reduction of the number of independent parameters of the model.
This lowrank constraint can be thought as a bandwidth constraint on the computation performed at each step: the matrices first project the state into a smaller subspace, extracting the information needed for that specific step, then the matrices project it back to the original state space, spreading the selected information to all the state components that need to be updated.
Note that if we were to apply this constraint to a nonpassthrough architecture, such as a MultiLayer Perceptron or a Elman’s Recurrent Neural Network, it would create an information bottleneck within each layer, effectively reducing the memory capacity of the model. But in a passthrough architecture the memory capacity is unaffected since the state passthrough takes care of propagating all the information that does not need to be updated during one step to the next step. Therefore we exploit the decoupling property of the state passthrough. A similar approach has been proposed for the LSTM architecture by Sak et al. (2014), although they force the the matrices to be the same for all the functions of the state transition, while we allow each parameter matrix to be parametrized independently by a pair of and matrices.
Lowrank passthrough architectures are universal in that they retain the same representation classes of their parent architectures. This is trivially true if the inner dimension is allowed to be in the worst case, and for some architectures even if is held constant. For instance, it is easily shown that for any Highway Network with state size and hidden layers and for any , there exist a Lowrank Highway Network with , state size at most and at most layers that computes the same function within an margin of error.
2.3 Lowrank plus diagonal passthrough networks
As we show in the experimental section, on some tasks the lowrank constraint may prove to be excessively restrictive if the goal is to train a model with fewer parameters than one with arbitrary matrices. A simple extension is to add to each lowrank parameter matrix a diagonal parameter matrix, yielding a matrix that is fullrank but still parametrized in a lowdimensional space. For instance, for the Highway Network architecture we modify eq. 7 to
(8)  
where are trainable diagonal parameter matrices.
Lowrank plus diagonal decompositions have been used for over a century in factor analysis in statistics (Spearman, 1904), system identification (Kalman, 1982) and other applications. They arise naturally in the estimation of linear relationships between variables from noisy measurements, under certain independence assumptions on the measurement noise. Refer to Saunderson et al. (2012) and Ning et al. (2015) for a review.
At first, it may seem that adding diagonal parameter matrices is redundant in passthrough networks. After all, the state passthrough itself can be considered as a diagonal matrix applied to the state vector, which is then additively combined to the new proposed state computed by the function. However, since the state passthrough completely skips over all nonlinear activation functions (except in the Residual Network architecture where it only skips over some of them), these formulations are not equivalent. In particular, the lowrank plus diagonal parametrization may help in recurrent neural networks which receive input at each time step, since they allow each component of the state vector to directly control how much input signal is inserted into it at each step. We demonstrate the effectiveness of this model in the sequence copy tasks described in the experiments section.
3 Experiments
In this section we report a preliminary experiment on Lowrank Highway Networks on the MNIST dataset and several experiments on Lowrank GRUs.
3.1 Lowrank Highway Networks
We applied the lowrank and lowrank plus diagonal Highway Network architecture to the classic benchmark task of handwritten digit classification on the MNIST dataset.
We used the lowrank architecture described by equations 6 and 7, with hidden layers, ReLU activation function, state dimension and maximum rank (internal dimension) . The inputtostate layer is a dense matrix followed by a (biased) ReLU activation and the statetooutput layer is a dense matrix followed by a (biased) identity activation. We did not use any convolution layer, pooling layer or data augmentation technique.
We used dropout (Srivastava et al., 2014) in order to achieve regularization. We applied standard dropout layers with dropout probability just before the inputtostate layer and just before the statetooutput layer. We also applied dropout inside each hidden layer in the following way: we inserted dropout layers with inside both the proposal function and the transform function, immediately before both the matrices and the matrices, totaling to four dropout layers per hidden layer, although the random dropout matrices are shared between proposal and transform functions. Dropout applied this way does not disrupt the state passthrough, thus it does not cause a reduction of memory capacity during training. We further applied L2regularization with coefficient per example on the hiddentooutput parameter matrix.
We also used batch normalization (Ioffe & Szegedy, 2015) after the inputtostate matrix and after each parameter matrix in the hidden layers.
Parameter matrices are randomly initialized using an uniform distribution with scale equal to where is the input dimension. Initial bias vectors are all initialized at zero except for those of the transform functions in the hidden layers, which are initialized at .
We trained to minimize the sum of the perclass L2hinge loss plus the L2regularization cost (Tang, 2013). Optimization was performed using Adam (Kingma & Ba, 2014) with standard hyperparameters, learning rate starting at halving every three epochs without validation improvements. Minibatch size was equal to . Code is available online^{3}^{3}3https://github.com/Avmb/lowrankhighwaynetwork.
We ran our experiments on a machine with a 24 core Intel(R) Xeon(R) CPU X5670 2.93GHz, 24 GB of RAM. We did not use a GPU. Training took approximately 4 hours .
We obtained perfect training accuracy and test accuracy. While this result does not reach the state of the art for this task ( test accuracy with unsupervised dimensionality reduction reported by Tang (2013)), it is still relatively close.
We also tested the lowrank plus diagonal Highway Network architecture of eq. 8 with the same settings as above, obtaining a test accuracy of . The inclusion of diagonal parameter matrices does not seem to help in this particular task.
3.2 Lowrank GRUs
We applied the Lowrank and Lowrank plus diagonal GRU architectures to a subset of sequential benchmarks described in the Unitary Evolution Recurrent Neural Networks article by Arjovsky et al. (2015), specifically the memory task, the addition task and the sequential randomly permuted MNIST task. For the memory tasks, we also considered two different variants proposed by Danihelka et al. (2016) and Henaff et al. (2016) which are hard for the uRNN architecture.
We chose to compare against the uRNN architecture because it set state of the art results in terms of both data complexity and accuracy and because it is an architecture with similar design objectives as lowrank passthrough architectures, namely a lowdimensional parametrization and the mitigation of the vanishing gradient problem, but it is based on quite different principles (it does not use a state passthrough as defined in this work, instead it relies on the reversibility and normpreservation properties of unitary matrices in order preserve state information between time steps, and uses a multiplicative unitary decomposition in order to achieve lowdimensional parametrization).
The GRU architecture (Cho et al., 2014b) is a passthrough recurrent neural network defined as
(9)  
Note that with respect of the definition of the Highway Network architecture of eq. 6, the initial state is a model parameter, there is an additional function (the ”reset” gate), parameters don’t depend on time and input is included in the computation at each step though the matrices. We have also defined the transform function in terms of the carry function rather than vice versa for consistency with the literature, although the two formulations are isomorphic.
We turn this architecture into the Lowrank GRU architecture by redefining each of the matrices as the product of two matrices with inner dimension . For the memory tasks, which turned out to be difficult for the lowrank parametrization, we also consider the lowrank plus diagonal parametrization. We also applied the lowrank plus diagonal parametrization for the sequential permuted MNIST task.
In our experiments we optimized using RMSProp (Tieleman & Hinton, 2012) with gradient component clipping at . Code is available online^{4}^{4}4https://github.com/Avmb/lowrankgru. Our code is based on the published uRNN code^{5}^{5}5https://github.com/amarshah/complex_RNN (specifically, on the LSTM implementation) by the original authors for the sake of a fair comparison. In order to achieve convergence on the memory task however, we had to slightly modify the optimization procedure, specifically we changed gradient component clipping with gradient norm clipping (with NaN detection and recovery), and we added a small term in the parameter update formula. No modifications of the original optimizer implementation were required for the other tasks.
We ran our experiments on the same machine as the experiments described in the previous section, with the exception of the largest sequential permuted MNIST experiment (lowrank plus diagonal GRU with , which was run on a machine with a Geforce GTX TITAN X GPU).
We will now present a short description of each task, the experimental details and results.
3.2.1 Memory task
The input of an instance of this task is a sequence of discrete symbols in a ten symbol alphabet , encoded as onehot vectors. The first symbols in the sequence are ”data” symbols i.i.d. sampled from , followed by ”blank” symbols, then a distinguished ”run” symbol , followed by more ”blank” symbols. The desired output sequence consists of ”blank” symbols followed by the ”data” symbols as they appeared in the input sequence. Therefore the model has to remember the ”data” symbol string over the temporal gap of size , which is challenging for a recurrent neural network when is large. In our experiment we set , which is the hardest setting explored in the uRNN work. The training set consists of training examples and validation/test examples.
The architecture is described by eq. (9), with an additional output layer with a dense matrix followed a (biased) softmax. We train to minimize the crossentropy loss.
We were able to solve this task using a GRU with full recurrent matrices with state size , learning rate , minibatch size , initial bias of the carry functions (the ”update” gates) , however this model has many more parameters, nearly in the recurrent layer only, than the uRNN work which has about , and it converges much more slowly than the uRNN.
We were not able to achieve convergence with a pure lowrank model without exceeding the number of parameters of the fully connected model, but we achieved fast convergence with a lowrank plus diagonal model with , with other hyperparameters set as above. This model has still more parameters ( in the recurrent layer, total) than the uRNN model and converges more slowly but still reasonably fast, reaching test crossentropy nats and almost perfect classification accuracy in less than updates.
We also consider two variants of this task which are difficult for the uRNN model. For both these tasks we used the same settings as above except that the task size parameter is set at for consistency with the works that introduced these variants.
In the variant of Danihelka et al. (2016), the length of the sequence to be remembered is randomly sampled between and for each sequence. They manage to achieve fast convergence with their Associative LSTM architecture with parameters, and slower convergence with standard LSTM models. Our lowrank plus diagonal GRU architecture, which has less parameters than their Associative LSTM, performs comparably or better, reaching test crossentropy nats and almost perfect classification accuracy in less than updates.
In the variant of Henaff et al. (2016), the length of the sequence to be remembered is fixed at but the model is expected to copy it after a variable number of time steps randomly chosen, for each sequence, between and . The authors achieve slow convergence with a standard LSTM model, while our lowrank plus diagonal GRU architecture achieves fast convergence, reaching test crossentropy nats and almost perfect classification accuracy in less than updates, and perfect test accuracy in updates.
3.2.2 Addition task
For each instance of this task, the input sequence has length and consists of two realvalued components, at each step the first component is independently sampled from the interval with uniform probability, the second component is equal to zero everywhere except at two randomly chosen time step, one in each half of the sequence, where it is equal to one. The result is a single real value computed from the final state which we want to be equal to the sum of the two elements of the first component of the sequence at the positions where the second component was set at one. In our experiment we set . The training set consists of training examples and validation/test examples.
We use a Lowrank GRU with input matrix, output matrix and (biased) identity output activation. We train to minimize the mean squared error loss. We use the following hyperparameter configuration: State size , maximum rank . This results in approximately parameters in the recurrent hidden layer. Learning rate was set at , minibatch size , initial bias of the carry functions (the ”update” gates) was set to .
We trained on minibatches, obtaining a mean squared error on the test set of , which is a better result than the one reported in the uRNN article, in terms of training time and final accuracy.
3.2.3 Sequential MNIST task
This task consists of handwritten digit classification on the MNIST dataset with the caveat that the input is presented to the model one pixel value at time, over time steps. To further increase the difficulty of the task, the inputs are reordered according to a random permutation (fixed for all the task instances).
We use a Lowrank GRU with input matrix, output matrix and (biased) softmax output activation.
Learning rate was set at , minibatch size , initial bias of the carry functions (the ”update” gates) was set to .
We considered two hyperparameter configurations:

State size , maximum rank .

State size , maximum rank .
Configuration 1 reaches a validation accuracy of in iterations. Final test accuracy is . The reported uRNN accuracy is . Our model however takes to reach a validation accuracy comparable to the final accuracy of the uRNN model, which is instead reached in about iterations.
Configuration 2 reaches a validation accuracy of in iterations, with test accuracy of . Note that even with the rather extreme bottleneck of , this model performs well.
For this task, we also consider three lowrank plus diagonal parametrizations. We report the best validation accuracy and test accuracy results, in addition to the results for a fullrank baseline GRU:

State size , maximum rank . Validation accuracy: , test accuracy: .

State size , maximum rank . Validation accuracy: , test accuracy: .

State size , fullrank. Validation accuracy: , test accuracy: .

State size , maximum rank . Validation accuracy: , test accuracy: .
Note that the lowrank plus diagonal GRU is more accurate than the full rank GRU with the same state size, while the lowrank GRU is slightly less accurate, indicating the utility of the diagonal component of the parametrization for this task.
These results surpass the uRNN and are on par with more complex architectures with timeskip connections (Zhang et al., 2016) (reported test set accuracy ). To our knowledge, at the time of this writing, the best result on this task is the LSTM with recurrent batch normalization by Cooijmans et al. (2016) (reported test set accuracy ). The architectural innovations of these works are orthogonal to our own and in principle they can be combined to it.
4 Conclusions and future work
We presented a framework that unifies the description various types of recurrent and feedforward neural networks as passthrough neural networks.
We proposed lowdimensional parametrizations for passthrough neural networks based on lowrank or lowrank plus diagonal decompositions of the matrices that occur in the hidden layers.
We experimentally compared our models with state of the art models, obtaining competitive results including a state of the art for the randomlypermuted sequential MNIST task.
Our parametrizations are alternative to convolutional parametrizations explored by Srivastava et al. (2015); He et al. (2015); Kaiser & Sutskever (2015). We note that the two approaches can be combined in at least two ways:

A lowrank (plus diagonal) decompostion (with a suitable axis reshaping) can be applied to convolutional filter banks when the number of channels is large.

The ”local” state acted on by the convolutional passthrough filters can be paired with a ”global” state acted on by lowrank (plus diagonal) passthrough matrices. The global state is replicated on additional channels to update the local state and the local state is pooled to update the global state. This arrangement may be useful in particular in the Neural GPU (Kaiser & Sutskever, 2015) in order to augment the cellular automaton with ”global variables”, which would otherwise need to be replicated on the cell states and threaded over the computation.
Lowrank and lowrank plus diagonal parametrizations are linear, alternative parametrizations could include nonlinear activation functions, effectively replacing each hidden parameter matrix with a MLP, similar to the networkinnetwork approach of Lin et al. (2013).
We leave the exploration of these extensions to future work.
Acknowledgments
We thank Giuseppe Attardi and the Department of Computer Science of University of Pisa for letting us use their machines to run the experiments presented in this paper.
References
 Arjovsky et al. (2015) Arjovsky, Martin, Shah, Amar, and Bengio, Yoshua. Unitary evolution recurrent neural networks. CoRR, abs/1511.06464, 2015. URL http://arxiv.org/abs/1511.06464.
 Bahdanau et al. (2014) Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014. URL http://arxiv.org/abs/1409.0473.
 Bengio et al. (1994) Bengio, Yoshua, Simard, Patrice, and Frasconi, Paolo. Learning longterm dependencies with gradient descent is difficult. Neural Networks, IEEE Transactions on, 5(2):157–166, 1994.
 Cho et al. (2014a) Cho, Kyunghyun, van Merriënboer, Bart, Bahdanau, Dzmitry, and Bengio, Yoshua. On the properties of neural machine translation: Encoderdecoder approaches. arXiv preprint arXiv:1409.1259, 2014a.
 Cho et al. (2014b) Cho, Kyunghyun, van Merrienboer, Bart, Gulcehre, Caglar, Bougares, Fethi, Schwenk, Holger, and Bengio, Yoshua. Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014b.
 Cooijmans et al. (2016) Cooijmans, T., Ballas, N., Laurent, C., Gülçehre, Ç., and Courville, A. Recurrent Batch Normalization. ArXiv eprints, March 2016.
 Danihelka et al. (2016) Danihelka, I., Wayne, G., Uria, B., Kalchbrenner, N., and Graves, A. Associative Long ShortTerm Memory. ArXiv eprints, February 2016.
 Glorot et al. (2011) Glorot, Xavier, Bordes, Antoine, and Bengio, Yoshua. Deep sparse rectifier neural networks. In International Conference on Artificial Intelligence and Statistics, pp. 315–323, 2011.
 Graves & Schmidhuber (2005) Graves, Alex and Schmidhuber, Jürgen. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks, 18(5):602–610, 2005.
 Graves et al. (2013) Graves, Alex, Mohamed, Abdelrahman, and Hinton, Geoffrey E. Speech recognition with deep recurrent neural networks. CoRR, abs/1303.5778, 2013. URL http://arxiv.org/abs/1303.5778.
 Graves et al. (2014) Graves, Alex, Wayne, Greg, and Danihelka, Ivo. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
 Greff et al. (2015) Greff, Klaus, Srivastava, Rupesh Kumar, Koutník, Jan, Steunebrink, Bas R, and Schmidhuber, Jürgen. Lstm: A search space odyssey. arXiv preprint arXiv:1503.04069, 2015.
 Gregor et al. (2015) Gregor, Karol, Danihelka, Ivo, Graves, Alex, and Wierstra, Daan. Draw: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623, 2015.
 He et al. (2015) He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
 Henaff et al. (2016) Henaff, M., Szlam, A., and LeCun, Y. Orthogonal RNNs and LongMemory Tasks. ArXiv eprints, February 2016.
 Hochreiter (1991) Hochreiter, Sepp. Untersuchungen zu dynamischen neuronalen netzen. Diploma, Technische Universität München, 1991.
 Hochreiter & Schmidhuber (1997) Hochreiter, Sepp and Schmidhuber, Jürgen. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 Ioffe & Szegedy (2015) Ioffe, Sergey and Szegedy, Christian. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 Iyyer et al. (2014) Iyyer, Mohit, BoydGraber, Jordan, Claudino, Leonardo, Socher, Richard, and Daumé III, Hal. A neural network for factoid question answering over paragraphs. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 633–644, 2014.
 Józefowicz et al. (2015) Józefowicz, Rafal, Zaremba, Wojciech, and Sutskever, Ilya. An empirical exploration of recurrent network architectures. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 611 July 2015, pp. 2342–2350, 2015. URL http://jmlr.org/proceedings/papers/v37/jozefowicz15.html.
 Józefowicz et al. (2016) Józefowicz, Rafal, Vinyals, Oriol, Schuster, Mike, Shazeer, Noam, and Wu, Yonghui. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.
 Kaiser & Sutskever (2015) Kaiser, Lukasz and Sutskever, Ilya. Neural gpus learn algorithms. CoRR, abs/1511.08228, 2015. URL http://arxiv.org/abs/1511.08228.
 Kalchbrenner et al. (2015) Kalchbrenner, Nal, Danihelka, Ivo, and Graves, Alex. Grid long shortterm memory. arXiv preprint arXiv:1507.01526, 2015.
 Kalman (1982) Kalman, R.E. et al. System Identification from Noisy Data. Defense Technical Information Center, 1982. URL https://books.google.it/books?id=TdCNwAACAAJ.
 Kingma & Ba (2014) Kingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Krizhevsky et al. (2012) Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.
 Kurach et al. (2015) Kurach, Karol, Andrychowicz, Marcin, and Sutskever, Ilya. Neural randomaccess machines. CoRR, abs/1511.06392, 2015. URL http://arxiv.org/abs/1511.06392.
 Le et al. (2013) Le, Quoc, Sarlós, Tamás, and Smola, Alex. Fastfoodapproximating kernel expansions in loglinear time. In Proceedings of the international conference on machine learning, 2013.
 Le et al. (2015) Le, Quoc V, Jaitly, Navdeep, and Hinton, Geoffrey E. A simple way to initialize recurrent networks of rectified linear units. arXiv preprint arXiv:1504.00941, 2015.
 LeCun et al. (2004) LeCun, Yann, Huang, Fu Jie, and Bottou, Leon. Learning methods for generic object recognition with invariance to pose and lighting. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 2, pp. II–97. IEEE, 2004.
 Lin et al. (2013) Lin, Min, Chen, Qiang, and Yan, Shuicheng. Network in network. arXiv preprint arXiv:1312.4400, 2013.
 Moczulski et al. (2015) Moczulski, Marcin, Denil, Misha, Appleyard, Jeremy, and de Freitas, Nando. ACDC: A structured efficient linear layer. CoRR, abs/1511.05946, 2015. URL http://arxiv.org/abs/1511.05946.
 Neelakantan et al. (2015) Neelakantan, Arvind, Le, Quoc V., and Sutskever, Ilya. Neural programmer: Inducing latent programs with gradient descent. CoRR, abs/1511.04834, 2015. URL http://arxiv.org/abs/1511.04834.
 Ning et al. (2015) Ning, Lipeng, Georgiou, Tryphon T, Tannenbaum, Allen, and Boyd, Stephen P. Linear models based on noisy data and the frisch scheme. SIAM Review, 57(2):167–197, 2015.
 Sak et al. (2014) Sak, Hasim, Senior, Andrew W, and Beaufays, Françoise. Long shortterm memory recurrent neural network architectures for large scale acoustic modeling. In INTERSPEECH, pp. 338–342, 2014.
 Saunderson et al. (2012) Saunderson, James, Chandrasekaran, Venkat, Parrilo, Pablo A, and Willsky, Alan S. Diagonal and lowrank matrix decompositions, correlation matrices, and ellipsoid fitting. SIAM Journal on Matrix Analysis and Applications, 33(4):1395–1416, 2012.
 Spearman (1904) Spearman, Charles. ” general intelligence,” objectively determined and measured. The American Journal of Psychology, 15(2):201–292, 1904.
 Srivastava et al. (2014) Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 Srivastava et al. (2015) Srivastava, Rupesh Kumar, Greff, Klaus, and Schmidhuber, Jürgen. Highway networks. arXiv preprint arXiv:1505.00387, 2015.
 Sun et al. (2015) Sun, Chen, Shetty, Sanketh, Sukthankar, Rahul, and Nevatia, Ram. Temporal localization of finegrained actions in videos by domain transfer from web images. In Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, pp. 371–380. ACM, 2015.
 Tang (2013) Tang, Yichuan. Deep learning using linear support vector machines. arXiv preprint arXiv:1306.0239, 2013.
 Tieleman & Hinton (2012) Tieleman, Tijmen and Hinton, Geoffrey. Lecture 6.5  rmsprop,, 2012.
 Vinyals et al. (2014) Vinyals, Oriol, Kaiser, Lukasz, Koo, Terry, Petrov, Slav, Sutskever, Ilya, and Hinton, Geoffrey. Grammar as a foreign language. arXiv preprint arXiv:1412.7449, 2014.
 Zhang et al. (2016) Zhang, Saizheng, Wu, Yuhuai, Che, Tong, Lin, Zhouhan, Memisevic, Roland, Salakhutdinov, Ruslan, and Bengio, Yoshua. Architectural complexity measures of recurrent neural networks. arXiv preprint arXiv:1602.08210, 2016.