Approximating RealTime Recurrent Learning with Random Kronecker Factors
Abstract
Despite all the impressive advances of recurrent neural networks, sequential data is still in need of better modelling. Truncated backpropagation through time (TBPTT), the learning algorithm most widely used in practice, suffers from the truncation bias, which drastically limits its ability to learn longterm dependencies.The RealTime Recurrent Learning algorithm (RTRL) addresses this issue, but its high computational requirements make it infeasible in practice. The Unbiased Online Recurrent Optimization algorithm (UORO) approximates RTRL with a smaller runtime and memory cost, but with the disadvantage of obtaining noisy gradients that also limit its practical applicability. In this paper we propose the Kronecker Factored RTRL (KFRTRL) algorithm that uses a Kronecker product decomposition to approximate the gradients for a large class of RNNs. We show that KFRTRL is an unbiased and memory efficient online learning algorithm. Our theoretical analysis shows that, under reasonable assumptions, the noise introduced by our algorithm is not only stable over time but also asymptotically much smaller than the one of the UORO algorithm. We also confirm these theoretical results experimentally. Further, we show empirically that the KFRTRL algorithm captures longterm dependencies and almost matches the performance of TBPTT on real world tasks by training Recurrent Highway Networks on a synthetic string memorization task and on the Penn TreeBank task, respectively. These results indicate that RTRL based approaches might be a promising future alternative to TBPTT.
Approximating RealTime Recurrent Learning with Random Kronecker Factors
Asier Mujika ^{†}^{†}thanks: Author was supported by grant no. CRSII5_173721 of the Swiss National Science Foundation. Department of Computer Science ETH Zürich, Switzerland asierm@inf.ethz.ch Florian Meier Department of Computer Science ETH Zürich, Switzerland meierflo@inf.ethz.ch Angelika Steger Department of Computer Science ETH Zürich, Switzerland steger@inf.ethz.ch
noticebox[b]Preprint. Work in progress.\end@float
1 Introduction
Processing sequential data is a central problem in the field of machine learning. In recent years, Recurrent Neural Networks (RNN) have achieved great success, outperforming all other approaches in many different sequential tasks like machine translation, language modeling, reinforcement learning and more.
Despite this success, it remains unclear how to train such models. The standard algorithm, Truncated Back Propagation Through Time (TBPTT) [19], considers the RNN as a feedforward model over time with shared parameters. While this approach works extremely well in the range of a few hundred timesteps, it scales very poorly to longer time dependencies. As the time horizon is increased, the parameters are updated less frequently and more memory is required to store all past states. This makes TBPTT illsuited for learning longterm dependencies in sequential tasks.
An appealing alternative to TBPTT is RealTime Recurrent Learning (RTRL) [20]. This algorithm allows online updates of the parameters and learning arbitrarily longterm dependencies by exploiting the recurrent structure of the network for forward propagation of the gradient. Despite its impressive theoretical properties, RTRL is impractical for decently sized RNNs because runtime and memory costs scale poorly with network size.
As a remedy to this issue, Tallec and Ollivier [17] proposed the Unbiased Online Recurrent Learning algorithm (UORO). This algorithm unbiasedly approximates the gradients, which reduces the runtime and memory costs such that they are similar to the costs required to run the RNN forward. Unbiasedness is of central importance since it guarantees convergence to a local optimum. Still, the variance of the gradients slows down learning.
Here we propose the Kronecker Factored RTRL (KFRTRL) algorithm. This algorithm builds up on the ideas of the UORO algorithm, but uses Kronecker factors for the RTRL approximation. We show both theoretically and empirically that this drastically reduces the noise in the approximation and greatly improves learning. However, this comes at the cost of requiring more computation and only being applicable to a class of RNNs. Still, this class of RNNs is very general and includes TanhRNN and Recurrent Highway Networks [21] among others.
The main contributions of this paper are:

We propose the KFRTRL online learning algorithm.

We theoretically prove that our algorithm is unbiased and under reasonable assumptions the noise is stable over time and asymptotically by a factor smaller that the noise of UORO.

We test KFRTRL on a binary string memorization task where our networks can learn dependencies of up to steps.

We evaluate on characterlevel Penn TreeBank, where the performance of KFRTRL almost matches the one of TBPTT for steps.

We empirically confirm that the variance of KFRTRL is stable over time and that increasing the number of units does not increase the noise significantly.
2 Related Work
Training Recurrent Neural Networks for finite length sequences is currently almost exclusively done using BackPropagation Through Time [16] (BPTT). The network is "unrolled" over time and is considered as a feedforward model with shared parameters (the same parameters are used at each time step). Like this, it is easy to do backpropagation and exactly calculate the gradients in order to do gradient descent.
However, this approach does not scale well to very long sequences, as the whole sequence needs to be processed before calculating the gradients, which makes training extremely slow and very memory intensive. In fact, BPTT cannot be applied to an online stream of data. In order to circumvent this issue, Truncated BackPropagation Through Time [19] (TBPTT) is used generally. The RNN is only "unrolled" for a fixed number of steps (the truncation horizon) and gradients beyond these steps are ignored. Therefore, if the truncation horizon is smaller than the length of the dependencies needed to solve a task, the network cannot learn it.
Several approaches have been proposed to deal with the truncation horizon. Anticipated Reweighted Truncated Backpropagation [18] samples different truncation horizons and weights the calculated gradients such that the expected gradient is that of the whole sequence. Jaderberg et al. [6] proposed Decoupled Neural Interfaces, where the network learns to predict incoming gradients from the future. Then, it uses these predictions for learning. The main assumption of this model is that all future gradients can be computed as a function of the current hidden state.
A more extreme proposal is calculating the gradients forward and not doing any kind of BPTT. This is known as RealTime Recurrent Learning [20] (RTRL). RTRL allows updating the model parameters online after observing each input/output pair; we explain it in detail in Section 3. However, its large running time of order and memory requirements of order , where is the number of units of a fully connected RNN, make it unpractical for large networks. To fix this, Tallec and Ollivier [17] presented the Unbiased Online Recurrent Optimization (UORO) algorithm. This algorithm approximates RTRL using a low rank matrix. This makes the runtime of the algorithm of the same order as a single forward pass in an RNN, . However, the low rank approximation introduces a lot of variance, which negatively affects learning as we show in Section 5.
Other alternatives are Reservoir computing approaches [9] like Echo State Networks [7] or Liquid State Machines [10]. In these approaches, the recurrent weights are fixed and only the output connections are learned. This allows online learning, as gradients do not need to be propagated back in time. However, it prevents any kind of learning in the recurrent connections, which makes the RNN computationally much less powerful.
3 RealTime Recurrent Learning and UORO
RTRL [20] is an online learning algorithm for RNNs. Contrary to TBPPT, no previous inputs or network states need to be stored. At any timestep , RTRL only requires the hidden state , input and in order to compute . With at hand, is obtained by applying the chain rule. Thus, the parameters can be updated online, that is, for each observed input/output pair one parameter update can be performed.
In order to present the RTRL update precisely, let us first define an RNN formally. An RNN is a differentiable function , that maps an input , a hidden state and parameters to the next hidden state . At any timestep , RTRL computes by applying the chain rule:
(1)  
(2) 
where the middle term vanishes because we assume that the inputs do not depend on the parameters. For notational simplicity, define , and , which reduces the above equation to
(3) 
Both and are straightforward to compute for RNNs. We assume to be fixed, which implies . With all this, RTRL obtains the exact gradient for each timestep and enables online updates of the parameters. However, updating the parameters means that is only exact in the limit were the learning rate is arbitrarily small. In practice learning rates are sufficiently small such that this is not an issue.
The downside of RTRL is that for a fully connected RNN with units the matrices and have size and , respectively. Therefore, computing Equation 3 takes operations and requires storage, which makes RTRL impractical for large networks.
The UORO algorithm [17] addresses this issue and reduces runtime and memory requirements to at the cost of obtaining an unbiased but noisy estimate of . More precisely, the UORO algorithm keeps an unbiased rankone estimate of by approximating as the outer product of two vectors of size and size , respectively. At any time , the UORO update consists of two approximation steps. Assume that the unbiased approximation of is given. First, is approximated by a rankone matrix. Second, the sum of two rankone matrices is approximated by a rankone matrix yielding the estimate of . The estimate is provably unbiased and the UORO update requires the same runtime and memory as updating the RNN [17].
4 Kronecker Factored RTRL
Our proposed Kronecker Factored RTRL algorithm (KFRTRL) is an online learning algorithm for RNNs, which does not require storing any previous inputs or network states. KFRTRL approximates , which is the derivative of the internal state with respect to the parameters, see Section 3, by a Kronecker product. The following theorem shows that the KFRTRL algorithm satisfies various desireable properties.
Theorem 1.
For the class of RNNs defined in Lemma 1, the estimate obtained by the KFRTRL algorithm satisfies

is an unbiased estimate of , that is , and

assuming that the spectral norm of is at most for some arbitrary small , then at any time , the mean of the variances of the entries of is of order .
Moreover, one timestep of the KFRTRL algorithm requires operations and memory.
We remark that the class of RNNs defined in Lemma 1 contains many widely used RNN architectures like Recurrent Highway Networks and TanhRNNs, but does not include GRUs [4] or LSTMs [5]. Further, the assumption that the spectral norm of is at most is reasonable, as otherwise gradients might grow exponentially as noted by Bengio et al. [2]. Lastly, the bottleneck of the algorithm is a matrix multiplication and thus for sufficiently large matrices an algorithm with a better run time than may be be practical.
In the remainder of this section, we explain the main ideas behind the KFRTRL algorithm (formal proofs are given in the appendix). In the subsequent Section 5 we show that these theoretical properties carry over into practical application. KFRTRL is well suited for learning longterm dependencies (see Section 5.1) and almost matches the performance of TBPTT on a complex real world task, that is, character level language modeling (see Section 5.2). Moreover, we confirm empirically that the variance of the KFRTRL estimate is stable over time and scales well as the network size increases (see Section 5.3).
Before giving the theoretical background and motivating the key ideas of KFRTRL, we give a brief overview of the KFRTRL algorithm. At any timestep , KFRTRL maintains a vector and a matrix , such that satisfies . Both and are factored as a Kronecker product, and then the sum of these two Kronecker products is approximated by one Kronecker product. This approximation step (see Lemma 2) works analogously to the second approximation step of the UORO algorithm (see rankone trick, Proposition in [17]). The detailed algorithmic steps of KFRTRL are presented in Algorithm 1 and motivated below.
Theoretical motivation of the KFRTRL algorithm
The key observation that motivates our algorithm is that for many popular RNN architectures can be exactly decomposed as the Kronecker product of a vector and a diagonal matrix, see Lemma 1. Such a decomposition exists if every parameter affects exactly one element of assuming is fixed. This condition is satisfied by many popular RNN networks like TanhRNN and Recurrent Highway Networks. The class of RNNs considered in the following lemma contains all these RNN architectures.
Lemma 1.
Assume the learnable parameters are a set of matrices , let be the hidden state concatenated with the input and let for . Assume that is obtained by pointwise operations over the ’s, that is, , where is such that is bounded by a constant. Let be the diagonal matrix defined by , and let . Then, it holds .
Further, we observe that the sum of two Kronecker products can be approximated by a single Kronecker product. The following lemma, which is the analogue of Proposition in [15] for Kronecker products, states how this is achieved.
Lemma 2.
Let , where the matrix has the same size as the matrix and has the same size as . Let and be chosen independently and uniformly at random from and let be positive reals. Define and . Then, is an unbiased approximation of , that is . Moreover, the variance of this approximation is minimized by setting the free parameters .
Lastly, we show by induction that random vectors and random matrices exist, such that satisfies . Assume that satisfies . Equation 3 and Lemma 1 imply that
(4) 
Next, by linearity of expectation and since the first dimension of is , it follows
(5) 
Finally, we obtain by Lemma 2 for any
(6) 
where the expectation is taken over the probability distribution of , , and .
With these observations at hand, we are ready to present the KFRTRL algorithm. At any timestep we receive the estimates and from the previous timestep. First, compute , and . Then, choose and uniformly at random from and compute
(7)  
(8) 
where and . Lastly, our algorithm computes , which is used for optimizing the parameters. For a detailed pseudocode of the KFRTRL algorithm see Algorithm 1. In order to see that is an unbiased estimate of , we apply once more linearity of expectation: .
One KFRTRL update has runtime and requires memory. In order to see the statement for the memory requirement, note that all involved matrices and vectors have elements, except . However, we do not need to explicitly compute in order to obtain , because can be evaluated in this order. In order to see the statement for the runtime, note that and have both size . Therefore, computing requires operations. All other arithmetic operations trivially require runtime .
Comparison of the KFRTRL with the UORO algorithm
Since the variance of the gradient estimate is directly linked to convergence speed and performance, let us first compare the variance of the two algorithms. Theorem 1 states that the mean of the variances of the entries of are of order . In the appendix, we show a slightly stronger statement, that is, if for all , then the mean of the variances of the entries is of order , where is the number of elements of . The bound is obtained by a trivial bound on the size of the entries of and and using . In the appendix, we show further that already the first step of the UORO approximation, in which is approximated by a rankone matrix, introduces noise of order . Assuming that all further approximations would not add any noise, then the same trivial bounds on yield a mean variance of . We conclude that the variance of KFRTRL is asymptotically by (at least) a factor smaller than the variance of .
Next, let us compare the generality of the algorithm when applying it to different network architectures. The KFRTRL algorithms requires that in one timestep each parameter only affects one element of the next hidden state (see Lemma 1). Although many widely used RNN architectures satisfy this requirement, seen from this angle, the UORO algorithm is favorable as it is applicable to arbitrary RNN architectures.
Finally, let us compare memory requirements and runtime of KFRTRL and UORO. In terms of memory requirements, both algorithms require storage and perform equally good. In terms of runtime, KFRTRL requires , while UORO requires operations. However, the faster runtime of UORO comes at the cost of worse variance and therefore worse performance. In order to reduce the variance of UORO by a factor , one would need independent samples of . This could be achieved by reducing the learning rate by a factor of , which would then require times more data, or by sampling times in parallel, which would require times more memory. Still, our empirical investigation shows, that KFRTRL outperforms UORO, even when taking UORO samples of to reduce the variance (see Figure 3). Moreover, for sufficiently large networks the scaling of the KFRTRL runtime improves by using fast matrix multiplication algorithms.
5 Experiments
In this section, we quantify the effect on learning that the reduced variance of KFRTRL compared to the one of UORO has. First, we evaluate the ability of learning longterm dependencies on a deterministic binary string memorization task. Since most real world problems are more complex and of stochastic nature, we secondly evaluate the performance of the learning algorithms on a character level language modeling task, which is a more realistic benchmark. For these two tasks, we also compare to learning with Truncated BPTT. Finally, we investigate the variance of KFRTRL and UORO by comparing to their exact counterpart, RTRL. For all experiments, we use a singlelayer Recurrent Highway Network [21]. ^{1}^{1}1For implementation simplicity, we use instead of as nonlinearity function. Both functions have very similar properties, and therefore, we do not believe this has any significant effect.
5.1 Copy Task
In the copy task experiment, a binary string is presented sequentially to an RNN. Once the full string has been presented, it should reconstruct the original string with out any further information. Figure 5.1 shows several inputoutput pairs. We refer to the length of the string as . Figure 5.1 summarizes the results. The smaller variance of KFRTRL greatly helps learning faster and capturing longer dependencies. KFRTRL and UORO manage to solve the task on average up to and , respectively. As expected, TBPTT cannot learn dependencies that are longer than the truncation horizon.
5.2 Character level language modeling on the Penn Treebank dataset
A standard test for RNNs is character level language modeling. The network receives a text sequentially, character by character, and at each timestep it must predict the next character. This task is very challenging, as it requires both long and short term dependencies. Additionally, it is highly stochastic, i.e. for the same input sequence there are many possible continuations, but only one is observed at each training step. Figure 2 and Table 1 summarize the results. Truncated BPTT outperforms both online learning algorithms, but KFRTRL almost reaches its performance and considerably outperforms UORO. For this task, the noise introduced in the approximation is more harmful than the truncation bias. This is probably the case because the short term dependencies dominate the loss, as indicated by the small difference between TBPTT with truncation horizon and . For this experiment we use the Penn TreeBank [11] dataset, which is a collection of Wall Street Journal articles. The text is lower cased and the vocabulary is restricted to K words. Out of vocabulary words are replaced by "<unk>" and numbers by "N". We split the data following Mikolov et al. [14]. The experimental setup is the same as in the Copy task, and we pick the optimal learning rate from the same range. Apart from that, we reset the hidden state to the all zeros state with probability at each time step. This technique was introduced by Melis et al. [12] to improve the performance on the validation set, as the initial state for the validation is the all zeros state. Additionally, this helps the online learning algorithms, as it resets the gradient approximation, getting rid of stale gradients. Similar techniques have been shown [3] to also improve RTRL. Name Validation Test #params KFRTRL 1.77 1.72 K UORO 2.63 2.61 K TBPTT5 1.64 1.58 K TBPTT25 1.61 1.56 K Merity et al. [13]  1.18 13.8M5.3 Variance Analysis
With our final set of experiments, we empirically measure how the noise evolves over time and how it is affected by the number of units . Here, we also compare to taking samples of UORO, we refer to this as UOROAVG. This brings the computation cost on par with that of KFRTRL, . For each experiment, we compute the angle between the gradient estimate and the exact gradient of the loss with respect to the parameters. Intuitively, measures how aligned the gradients are, even if the magnitude is different. Figure 3 shows that is stable over time and the noise does not accumulate for any of the three algorithms. Figure 3 shows that KFRTRL scales better with the number of units than both UOROAVG and UORO. In the first experiment, we run several untrained RHNs with units over the first characters of Penn TreeBank. In the second experiment, we compute after running RHNs with different number of units for steps on Penn TreeBank. We perform repetitions per experiment and plot the mean and standard deviation.6 Conclusion
In this paper, we have presented the KFRTRL online learning algorithm. We have proven that it approximates RTRL in an unbiased way, and that under reasonable assumptions the noise is stable over time and much smaller than the one of UORO, the only other previously known unbiased RTRL approximation algorithm. Additionally, we have empirically verified that the reduced variance of our algorithm greatly improves learning for the two tested tasks. In the first task, an RHN trained with KFRTRL effectively captures longterm dependencies (it learns to memorize binary strings of length up to ). In the second task, it almost matches the performance of TBPTT in a standard RNN benchmark, character level language modeling on Penn TreeBank. More importantly, our work opens up interesting directions for future work, as even minor reductions of the noise could make the approach a viable alternative to TBPTT, especially for tasks with inherent longterm dependencies. For example constraining the weights, constraining the activations or using some form of regularization could reduce the noise. Further, it may be possible to design architectures that make the approximation less noisy. Moreover, one might attempt to improve the runtime of KFRTRL by using approximate matrix multiplication algorithms or inducing properties on the matrix that allow for fast matrix multiplications, like sparsity or lowrank. This work advances the understanding of how unbiased gradients can be computed, which is of central importance as unbiasedness is essential for theoretical convergence guarantees. Since RTRL based approaches satisfy this key assumption, it is of interest to further progress them.References
 Abadi et al. [2016] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system for largescale machine learning. In OSDI, volume 16, pages 265–283, 2016.
 Bengio et al. [1994] Y. Bengio, P. Simard, and P. Frasconi. Learning longterm dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994.
 Catfolis [1993] T. Catfolis. A method for improving the realtime recurrent learning algorithm. Neural Networks, 6(6):807–821, 1993.
 Chung et al. [2014] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
 Hochreiter and Schmidhuber [1997] S. Hochreiter and J. Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 Jaderberg et al. [2016] M. Jaderberg, W. M. Czarnecki, S. Osindero, O. Vinyals, A. Graves, D. Silver, and K. Kavukcuoglu. Decoupled neural interfaces using synthetic gradients. arXiv preprint arXiv:1608.05343, 2016.
 Jaeger [2001] H. Jaeger. The “echo state” approach to analysing and training recurrent neural networkswith an erratum note. Bonn, Germany: German National Research Center for Information Technology GMD Technical Report, 148(34):13, 2001.
 Kingma and Ba [2014] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Lukoševičius and Jaeger [2009] M. Lukoševičius and H. Jaeger. Reservoir computing approaches to recurrent neural network training. Computer Science Review, 3(3):127–149, 2009.
 Maass et al. [2002] W. Maass, T. Natschläger, and H. Markram. Realtime computing without stable states: A new framework for neural computation based on perturbations. Neural computation, 14(11):2531–2560, 2002.
 Marcus et al. [1993] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993.
 Melis et al. [2017] G. Melis, C. Dyer, and P. Blunsom. On the state of the art of evaluation in neural language models. arXiv preprint arXiv:1707.05589, 2017.
 Merity et al. [2018] S. Merity, N. S. Keskar, and R. Socher. An analysis of neural language modeling at multiple scales. arXiv preprint arXiv:1803.08240, 2018.
 Mikolov et al. [2012] T. Mikolov, I. Sutskever, A. Deoras, H.S. Le, S. Kombrink, and J. Cernocky. Subword language modeling with neural networks. preprint (http://www. fit. vutbr. cz/imikolov/rnnlm/char. pdf), 2012.
 Ollivier et al. [2015] Y. Ollivier, C. Tallec, and G. Charpiat. Training recurrent networks online without backtracking. arXiv preprint arXiv:1507.07680, 2015.
 Rumelhart et al. [1986] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by backpropagating errors. nature, 323(6088):533, 1986.
 Tallec and Ollivier [2017a] C. Tallec and Y. Ollivier. Unbiased online recurrent optimization. arXiv preprint arXiv:1702.05043, 2017a.
 Tallec and Ollivier [2017b] C. Tallec and Y. Ollivier. Unbiasing truncated backpropagation through time. arXiv preprint arXiv:1705.08209, 2017b.
 Williams and Peng [1990] R. J. Williams and J. Peng. An efficient gradientbased algorithm for online training of recurrent network trajectories. Neural Computation, 2:490–501, 1990.
 Williams and Zipser [1989] R. J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280, 1989.
 Zilly et al. [2016] J. G. Zilly, R. K. Srivastava, J. Koutník, and J. Schmidhuber. Recurrent highway networks. arXiv preprint arXiv:1607.03474, 2016.