Lie Access Neural Turing Machine
Abstract
Following the recent trend in explicit neural memory structures, we present a new design of an external memory, wherein memories are stored in an Euclidean key space . An LSTM controller performs read and write via specialized read and write heads. It can move a head by either providing a new address in the key space (aka random access) or moving from its previous position via a Lie group action (aka Lie access). In this way, the “L” and “R” instructions of a traditional Turing Machine are generalized to arbitrary elements of a fixed Lie group action. For this reason, we name this new model the Lie Access Neural Turing Machine, or LANTM.
We tested two different configurations of LANTM against an LSTM baseline in several basic experiments. We found the right configuration of LANTM to outperform the baseline in all of our experiments. In particular, we trained LANTM on addition of digit numbers for , but it was able to generalize almost perfectly to , all with the number of parameters 2 orders of magnitude below the LSTM baseline.
L¿\arraybackslashm2cm
1 Introduction
Recurrent neural networks (RNNs) are powerful devices that, unlike conventional neural networks, are able to keep state across time. They achieved great results in diverse fields like machine translation [22, 3, 1], speech recognition [5, 2], image captioning [17, 12, 23], and many others. However, despite such advances, traditional RNNs still have trouble maintaining memory for long periods of time, presenting an obstacle to attaining humanlike general intelligence.
Following the pioneering work of Graves et al. [6] and Weston et al. [25], researchers have studied many variations of external memories equipped to RNNs or explicit memory structures which ameliorate the problem discussed above and obtained great results in applications like question answering [25, 21, 13], algorithm learning [6, 11, 10, 14, 27, 7], machine translation [11], and others. In this paper we propose a new variation of external memory.
In a conventional RAM used in personal computers, memory is stored at integer addresses, and access is either random or sequential. Here we replace the integers with , and to retrieve memory, the controller can either issue a brand new address or “drag” the previous address in some chosen “direction” (formally, apply a Lie group action to the previous address). The former is the analog of random access, and the latter is the analog of sequential access. We call the latter “Lie access,” with the meaning parametrized by a Lie group which specifies how this “dragging” is to be done. We call a model built around this concept of “Lie access” a Lie Access Neural Turing Machine, or LANTM. We give two specific implementations in section 3 and explore them in section 4 with several experiments. While we will refer to these implementations also as LANTMs, we want to stress they are certainly not the only ways of instantiating the “Lie access” concept.
2 Background
2.1 Lie groups
We assume the reader has a basic knowledge of groups and group actions and the passing notion that Lie groups are just groups with “differentiable” operations. Such a background should enable one to understand the rest of this paper other than section 6. We defer readers who need slightly more exposition on these topics to Appendix B.1.
2.2 Recurrent Neural Networks
Unlike the conventional feedforward neural network, a recurrent neural network (RNN) has selfconnections. Mathematically, an RNN is a function , where is the input space, the output space, and the space of internal states. On input and with initial state , the RNN transitions into states (internally) and returns a sequence (externally) defined recursively by
In this work, we use a particular variant of RNN called the Long Short Term Memory (LSTM) [8]. LSTM’s hidden state consists of two variables , where is also the output to the external world (i.e. it fills the role of in the above description). The is the “memory” of the machine, designed to be maintained for a long time when necessary. There are many variants of LSTM. In this paper we define the function as follows:
where is the logistic function. are called the input, forget, and output gates, respectively, which modulate multiplicatively different quantities in the computation. The weights are trainable through backpropagation through time (BPTT) [24]. The undashed parts of figure 1 show a schematic of the equations above.
In models with external memories, LSTM often serves as the controller [6, 7, 27]. This means that 1) the entire system carries state over time from both the LSTM and the external memory, 2) the LSTM controller collects reading from and computes additional instructions to the external memory, and 3) the LSTM possibly performs extra processing to return the desired output at each time point. The dashed parts of figure 1 demonstrate a typical such arrangement, in which represents the state of the memory, represents the reading from the memory, represents a subroutine used for reading from and writing to the memory. The entire system is now described by the recurrence defined by
where is a set of instructions to read from and write to the memory, as illustrated in figure 1. is usually a softmax layer that produces a distribution over all possible symbols in a language task such as those explored in this paper, and this is indeed the case with LANTM. In the next section, we show how LANTM implements .
3 Lie Access Memory
The Lie Access Neural Turing Machine (LANTM) is inspired by the external memory architecture of Neural Turing Machine (NTM): a neural network controller reads from and writes to a memory structure via specially designed, differentiable functions called “heads”. The heads themselves do not have any trainable parameters, so the only learning done is by the controller, and the entire network can be trained by gradient descent.
In a LANTM, the memory structure is a dictionary, with keys in an Euclidean space for a fixed , called the key space or address space; and with values (called memory vectors) in another Euclidean space for a fixed ( is called the memory width). At time step , each read head converts instructions from the controller to a read address that retrieves a reading from the memory by a weighted inverse squared law, to be elaborated below. Each write head converts instructions from the controller to a new memory vector and a new address , along with a scalar , called the memory strength of the vector. Such a triple is essentially appended to the memory.
The most important hyperparameter of a LANTM is its choice of Lie group that acts on . At time , the controller may emit new addresses for each head (random access) or issue Lie actions that change the old addresses (Lie access). One may imagine the key space to be a piece of paper, and the read and write heads to be stones placed on this paper. The controller is a hand that moves the stones from turn to turn. Sometimes it may lift a stone up and place it somewhere completely unrelated to its original position (random access); other times it may drag a stone along a chosen direction (Lie access). Thus Lie access generalizes sequential access in a conventional memory array to a continuous setting.
In the design discussed in this paper, there is no explicit erasure. However, the machine can theoretically store the exact negation of a memory vector at the same location to cancel out that memory, albeit the required precision to do so would probably be overwhelming.
What follows are details of the overview given above.
3.1 Read
Let denote the set of memory vectors stored in the key space by time . We choose a canonical ordering on this set, for example by time added, and write for the th vector in this order. Denote by the corresponding addresses of and by the corresponding memory strength of . In this section we introduce two weight schemes for retrieving a value from the memory via an address. The main idea of both is summarized by figure 2.
The read key produces weightings over all memory vectors , each with address , by normalizing their inverse squared distances and multiplying by their strengths :
with the convention that it takes the limit value when for some . ^{1}^{1}1 In practice, as the formula for can induce numerical instability as for some , we adjust the formula with a small , e.g. , so that
The reading is then defined as
We call this method of converting a read key to a set of weighting via a polynomial law InvNormalize, or InvNorm for short, in contrast with the use of exponential law in the case of SoftMax weight scheme, which computes the weights as
where is a temperature emitted by the controller at time that represent the certainty of its reading. The higher is, the more tends to be uniform.
Given the ubiquity of SoftMax in the machine learning literature, one may consider it a natural choice for the weight scheme. But as will be seen in the experiments, InvNorm is crucial in making the Euclidean space work as an address space.
3.2 Write
There is no extra ingredient to writing other than adding the produced memory vector , its strength , and its address to the collection of memory vectors, strengths, and addresses. To ensure that memory selection by weighted average works well, we squash the values of to by , but squashing by the logistic sigmoid function is also conceivable. Without such squashing, a memory vector with large values can dominate the output of a weight method despite having low weight .
3.3 Addressing procedure
Here we describe how the keys and are produced. The procedure is the same for both read and write keys, so we assume that we are to compute a single key . We describe the abstraction of the process over a fixed Lie group acting smoothly on the key space .
The controller emits 3 things: a candidate key , a mixing coefficient, or gate, (via the sigmoid function), and an action that we also call step. The gate mixes the previous key with the candidate key to produce a preaction key , which is transformed by to produce the final key : (here denotes group action)
Figure (3) summarizes the addressing procedure.
In our experiments, the Lie group is acting additively on . This means that the controller outputs 2 numbers , so that acts upon a key by
Section C in the Appendix gives example implementations for the scaling rotation and the rotation groups acting on .
3.4 Interpolation of Lie action
For readers unfamiliar with the Lie group examples mentioned below, we recommend a visit to section C in the Appendix.
For groups like , there is a welldefined convex interpolation between two elements that stays in the group. For some others like , the straightline interpolation for , sometimes produce elements outside the group (in this case sometimes the elements cancel out and get 0), but does so with probability zero in a suitable sense.
Then, as for keys, we can let the controller output a candidate action and a mixing coefficient to smoothly mix with the previous action to produce a final action
This allows the controller to “move in a straight line within the group of actions” by merely left saturating (i.e. squash to 0) the gates and for all , so that . Of course, the “straight line” can be actually curved depending on the group. For example, when , a typical “straight line” will be a spiral tending exponentially toward the origin or growing exponentially unbounded.
Even if a group doesn’t have a natural straightline interpolation, there may be another way to mix two actions. In the case of , we can just project a straightline interpolation onto the circle (barring a measure zero chance of intepolating into ). ^{2}^{2}2 There is, in fact, a canonical way to interpolate the most common Lie groups, including all of the groups mentioned above, based on the exponential map and the BakerCampbellHausdorff formula [16], but the details are outside the scope of this paper and the computational cost, while acceptable in control theory settings, is too hefty for us. Interested readers are referred to [20] and [18].
4 Experiments
In our experiments, the Lie group for both types of LANTM is the translation group acting on ^{3}^{3}3We early on experimented with the scaling rotation group , which produced acceptable results when input lengths were small but encountered numerical problems when input lengths were large due to exponentiating scale., and we used Lie action interpolation as specified above. We outline the most important experimental setup in the main text below but defer other details to the Appendix section A.
4.1 permutation and arithmetic tasks
We tested the two variations of LANTM along with a baseline LSTM in an encoderdecoder setup (cf. [22]) on the copy, reverse, and bigram flip tasks as done in [7], as well as the double and addition tasks designed in a similar vein. Table 1 shows input/output templates for each permutation task.
task  input  output 

copy  
reverse  
bigramFlip 
Each arithmetic tasks have all numbers, input or output, formatted with the least significant digits on the left and with zero padding. The double task takes an integer padded to digits and outputs in digits, zero padded to digits. The addition task takes two integers padded to digits and interleaved, forming a length input sequence and outputs zero padded to digits. Table 3 show example input/outputs for each task with .
task  input  output  explanation 

double  
addition 
task  min train  max train  min test  max test 

copy  2  64  65  128 
reverse  2  64  65  128 
bigramFlip  2  32  33  64 
double  2  40  41  80 
addition  2  16  17  32 
The machines are first fed a learnable initial state and then provided with the input sequence, flanked by a startofinput (SOI) symbol and a repetition of an endofinput (EOI) symbol . The machines are to output the correct sequence during the response phase, which starts when they receive the first . The repetition of effectively means that the correct symbols are not shown to the machines during answering, i.e. we do not use teacher forcing. The machine also must correctly emit an endofoutput (EOO) symbol to terminate their answers. Figure (5) is an example of inputs and correct outputs during a copy task.
As usual, prediction is performed via argmax but training is done by minimizing negative log likelihood. To evaluate the performance of the models, we compute the fraction of characters correctly predicted and the fraction of all answers completely correctly predicted, respectively called “fine score” and “coarse score” following [7].
Task parameters and hyperparameters. We trained the models on the above tasks for input sizes summarized by table 3. For all tasks, the LANTM has a singlelayer, 50cell or 100cell LSTM controller. The memory width (i.e. the size of each memory vector) is 20. For all tasks, the LSTM baseline has 4 layers, each with 256 cells. In the Appendix, the exact parameters for each model in each task are listed in table A.1, and other experimental details are given in section A. Notice that the LSTM has 2 orders of magnitude more parameters than the LANTM models.
Results. LANTMInvNorm was able to master all tasks and generalize nearly perfectly to 2x the training sizes, as shown in table 4. LANTMSoftMax did as well on the copy and double tasks but failed at all the others, having performed worse than the LSTM baseline. The baseline itself learned tasks with smaller training input sizes (bigramFlip, double, addition) almost flawlessly, but generalization to 2x training size was inadequate on all tasks, with coarse score not exceeding 6%.
task  model  1x coarse  1x fine  2x coarse  2x fine 

copy  LANTMInvNorm  100%  100%  100%  100% 
LANTMSoftMax  100%  100%  99%  100%  
LSTM  58%  97%  0%  52%  
reverse  LANTMInvNorm  100%  100%  100%  100% 
LANTMSoftMax  1%  12%  0%  4%  
LSTM  65%  95%  0%  44%  
bigramFlip  LANTMInvNorm  100%  100%  99%  100% 
LANTMSoftMax  12%  40%  0%  10%  
LSTM  98%  100%  4%  58%  
double  LANTMInvNorm  100%  100%  100%  100% 
LANTMSoftMax  100%  100%  100%  100%  
LSTM  98%  100%  2%  60%  
addition  LANTMInvNorm  100%  100%  99%  100% 
LANTMSoftMax  17%  61%  0%  29%  
LSTM  97%  100%  6%  64% 
We tested the learned InvNorm model on larger, arbitrarily selected input sizes. The results are summarized by table 5. On permutation tasks, it generalized quite well when challenged by 4 times the training size, able to get more than 90% of test problems correct. On the double task, its extrapolation performance was similar, with 86% coarse score on 4x training size. Notice that LANTMInvNorm on several of the tasks (8x bigramFlip, 8x double, 4x addition) achieved high fine scores when extrapolating to large input sizes despite having low coarse scores. This suggests that the extrapolation errors systematically occur at the end of each output on those tasks.
task  4x coarse  4x fine  5x coarse  5x fine  8x coarse  8x fine 

copy  100%  100%  91%  100%  
reverse  91%  98%  12%  65%  
bigramFlip  96%  100%  12%  96%  
double  86%  99%  21%  90%  
addition  2%  95% 
We have created videos of the read and write locations of LANTMInvNorm and LANTMSoftMax while learning each of the 5 tasks, tracking their progress over time. They are available in the Supplementary Materials, with details explained in appendix D. In appendix E, we look at the behaviors of trained LANTMInvNorm through their read and write locations, gate values, and example input/output to analyze what exactly they learned and where their extrapolation errors come from when challenged by extreme input lengths.
4.2 Python programs
The above problem setting is highly structured and favors the design of LANTM. In this task we trained the models on generated python programs, following [26], that is more natural. The dataset comprises of 6 types of programs of integers: addition/subtraction, identity, multiplication with one small operand, small for loops, variable substitution, and ternary “ if else ” statements, as illustrated in table A.2.
The models are required to read the input program, which terminates with a “print” statement, and output the correct integer response, in reverse sequence, without being fed the correct answer (same as in our last experiment, but different from [26], which used teacher forcing). We performed curriculum learning, using the “mixed” strategy of [26], starting from 2 digits operands up to 4 digits operands. We evaluated the models on their coarse and fine scores on randomly sampled 4 digit programs. Training was done by RMSProp with learning rate 0.002, which was multiplied by 0.8 whenever the validation accuracy became lower than the highest of the last four.
Here the LSTM baseline is a single layer of 128 cells, and the LANTM models also have controllers who have the same size. In addition, each LANTM model has memory size 128.
The results are summarized by table 7. We noted that the small loop programs were the most difficult program type, for which all models predicted less than half of the characters correctly, so we trained them in a separate experiment only on small loop programs. The results are given in table 7
model  coarse  fine 

LSTM  35%  66% 
LANTMInvNorm  39%  74% 
LANTMSoftMax  35%  67% 
model  coarse  fine 

LSTM  0%  51% 
LANTMInvNorm  0%  55% 
LANTMSoftMax  0%  55% 
Here the advantage of LANTM over LSTM is not as dramatic. The memory access of LANTM were not nearly as orderly and neat as in the previous experiment, but rather erratic looking. An interactive plot of example read and write locations and other state data of LANTMInvNorm while learning small loops can be found in the Supplementary Materials.
4.3 language modelling
Finally, we tested the models on the Penn treebank corpus. To train and predict continuously, whenever the external memories of LANTMs were fill up to 100 memory vectors, the oldest 60 vectors were discarded. As in the last experiment, the LSTM baseline is a single layer of 128 cells, and the LANTM models also have controllers with the same size. In addition, each LANTM model has memory size 128. We unrolled BPTT to 20 steps, and trained with Adagrad with learning rate 0.05, which was halved each time the validation perplexity exceeded that of the previous epoch.
model  validation  test 

LSTM  130  124 
LANTMInvNorm  128  123 
LANTMSoftMax  134  130 
We observed that LANTMInvNorm had its read and write locations at two distant clusters, so that its read weights were all diffuse across the entire memory. This may be due to the repeated application of a (approximately) single Lie action over the long course of training, blowing up the magnitude of keys, which degrades random access, as the typical squashing functions of the controller limits the range of keys it can produce. This means that, rather than storing useful information at particular locations, the machine stored deltas at each time step, so that the whole memory averaged together gave the desired information. LANTMSoftMax also exhibited the same behavior, but because high fidelity access only required the read key to be closer to the desired key much more than to other keys (rather than that its distance to be absolutely small as with InvNorm), we cannot immediately infer that it also only stored deltas.
5 Related Works
Zaremba et al. [26] taught LSTM to evaluate simple python programs via curriculum learning, which formed the basis of one of our experiments. Kalchbrenner et al. [11] arranged LSTM cells in a multidimensional grid to form the grid long short term memory, and learned copy and addition tasks as well. Graves et al. [6] created NTM which has inspired much of the design in our work. Zhang et al. [29] found several tweaks to NTM to improve its convergence and performance. Grefenstette et al. [7] designed smooth versions of stack, queue, and deque as external memories to an LSTM controller. Their unbounded memory and experimental setups were direct influences on this paper. Zaremba et al. [27] used reinforcement learning to absolve the need of the NTM to involve the entire memory during memory retrieval. Weston et al. [25] came upon similar ideas in the memory network as the NTM at around the same time, but with less focus on sequence learning and more on question answering tasks (QA). Sukhbaater et al. [21] improved on their results to give a memory network trainable via gradient descent endtoend and allowing multiple adaptive memory queries (“multiple hops”) which help in complex relational reasoning. Dynamic memory network of Kumar et al. [13] added an episodic memory module similar to the multiple hops feature of Sukhbaatar et al.’s model, but which dynamically chose when to stop accessing memory rather than after a fixed number of times. They achieved state of art results in several tasks such as QA and sequence modelling. Danihelka et al. [4] designed an external memory based on holographic reduced representations, which can store unlimited memory but the larger the size the more noisy the retrieval. Kaiser et al. [10] created the neural GPU based on convolutional kernels, which learned long multiplication of binary numbers up to 20 bits but were able to generalize to 2000 bits. Kurach et al. [14] generalized tha random access of conventional RAMs to create the Neural Random Access Machine, which learned simple algorithm and was able to generalize to larger lengths, and memory access during inference can be done in constant time. Neelakantan et al. [19] investigated adding gradient noise to training, and found that in many of the models mentioned above, this method improved the performance or allowed a greater percentage of random initializations to converge to the optimum.
6 Generalization and Theoretical Considerations ^{4}^{4}4This part mentions some advanced mathematical concepts but is not necessary to the understanding of the rest of the paper
We want to stress that the model explained 3 is but one way to implement Lie access memory. Indeed, the Euclidean key space could be generalized to any Riemannian manifold equipped with a subgroup of its isometry group, as 1) a notion of metric is required in Lie access memory (hence the Riemannian part), and 2) one wants the ability to store and retrieve information in a “straight line” which suggests that the Lie action be invariant with respect to the metric (hence the isometry part).
A potentially useful Riemannian manifold other than is the hyperbolic space, specifically the Poincare disk model [15]. As seen in the language modelling task, repeated application of Lie action on may blow up the magnitude of keys, degrading random access. The Poincare disk model has its points in the (open) unit ball that prevents this problem from occurring. The other standard Riemannian model, the sphere, is not quite as desirable in this setting, because it “wraps around” (i.e. is not acyclic, in homological/homotopic terms), which can confuse gradient descent.
7 Conclusion
In this paper we introduced Lie access memory and explored two different implementations in the experiments. The LANTM model with the InvNorm weight scheme in all tasks performed better than the baseline, and spectacularly so in sequence and addition tasks where it learned to generalize to extraordinary lengths, whereas that with the SoftMax weight scheme failed to outperform the baseline in the reverse, bigramFlip, addition, and language modelling tasks. LANTMInvNorm held its largest advantage over LSTMs in case of long, structured tasks.
The Python program experiment shows that in less structured environments or environments with redundant or useless information, our LANTM designs could not utilize their memory as impressively as in more structure environments. Thus further work needs to be done toward combining logical reasoning with natural language processing.
We adopted a simple way to turn the episodic nature of our unbounded memory to continuous use, but it was far from perfect. In the language modelling experiment, the LANTM models did not seem to use the memory in a remarkable way. Future work should explore different options for adapting Lie access memory to continuous tasks, for example, by bounding the memory or by using the Poincare disk model as the underlying manifold as suggested in section 6.
References
 [1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv:1409.0473 [cs, stat], September 2014. arXiv: 1409.0473.
 [2] Kyunghyun Cho, Aaron Courville, and Yoshua Bengio. Describing Multimedia Content using Attentionbased Encoder–Decoder Networks. arXiv:1507.01053 [cs], July 2015. arXiv: 1507.01053.
 [3] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning Phrase Representations using RNN EncoderDecoder for Statistical Machine Translation. arXiv:1406.1078 [cs, stat], June 2014. arXiv: 1406.1078.
 [4] Ivo Danihelka, Greg Wayne, Benigno Uria, Nal Kalchbrenner, and Alex Graves. Associative Long ShortTerm Memory. arXiv:1602.03032 [cs], February 2016. arXiv: 1602.03032.
 [5] Alex Graves, Abdelrahman Mohamed, and Geoffrey Hinton. Speech Recognition with Deep Recurrent Neural Networks. arXiv:1303.5778 [cs], March 2013. arXiv: 1303.5778.
 [6] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural Turing Machines. arXiv:1410.5401 [cs], October 2014. arXiv: 1410.5401.
 [7] Edward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman, and Phil Blunsom. Learning to Transduce with Unbounded Memory. arXiv:1506.02516 [cs], June 2015. arXiv: 1506.02516.
 [8] Sepp Hochreiter and JÃ¼rgen Schmidhuber. Long ShortTerm Memory. Neural Comput., 9(8):1735–1780, November 1997.
 [9] Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. An Empirical Exploration of Recurrent Network Architectures. In Proceedings of the 32nd International Conference on Machine Learning (ICML15), pages 2342–2350, 2015.
 [10] Åukasz Kaiser and Ilya Sutskever. Neural GPUs Learn Algorithms. arXiv:1511.08228 [cs], November 2015. arXiv: 1511.08228.
 [11] Nal Kalchbrenner, Ivo Danihelka, and Alex Graves. Grid Long ShortTerm Memory. arXiv:1507.01526 [cs], July 2015. arXiv: 1507.01526.
 [12] Andrej Karpathy and Li FeiFei. Deep VisualSemantic Alignments for Generating Image Descriptions. arXiv:1412.2306 [cs], December 2014. arXiv: 1412.2306.
 [13] Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, and Richard Socher. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing. arXiv:1506.07285 [cs], June 2015. arXiv: 1506.07285.
 [14] Karol Kurach, Marcin Andrychowicz, and Ilya Sutskever. Neural RandomAccess Machines. arXiv:1511.06392 [cs], November 2015. arXiv: 1511.06392.
 [15] John Lee. Riemannian Manifolds: An Introduction to Curvature. Number 176 in Graduate Texts in Mathematics. SpringerVerlag, 1997.
 [16] John Lee. Introduction to Smooth Manifolds. Number 218 in Graduate Texts in Mathematics. Springer, 2 edition, 2012.
 [17] Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. Deep Captioning with Multimodal Recurrent Neural Networks (mRNN). arXiv:1412.6632 [cs], December 2014. arXiv: 1412.6632.
 [18] A. Marthinsen. Interpolation in Lie Groups. SIAM Journal on Numerical Analysis, 37(1):269–285, January 1999.
 [19] Arvind Neelakantan, Luke Vilnis, Quoc V. Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach, and James Martens. Adding Gradient Noise Improves Learning for Very Deep Networks. arXiv:1511.06807 [cs, stat], November 2015. arXiv: 1511.06807.
 [20] Tatiana Shingel. Interpolation in special orthogonal groups. IMA Journal of Numerical Analysis, 29(3):731–745, July 2009.
 [21] Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. EndToEnd Memory Networks. arXiv:1503.08895 [cs], March 2015. arXiv: 1503.08895.
 [22] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to Sequence Learning with Neural Networks. arXiv:1409.3215 [cs], September 2014. arXiv: 1409.3215.
 [23] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and Tell: A Neural Image Caption Generator. arXiv:1411.4555 [cs], November 2014. arXiv: 1411.4555.
 [24] Paul J. Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10):1550–1560, 1990.
 [25] Jason Weston, Sumit Chopra, and Antoine Bordes. Memory Networks. arXiv:1410.3916 [cs, stat], October 2014. arXiv: 1410.3916.
 [26] Wojciech Zaremba and Ilya Sutskever. Learning to Execute. arXiv:1410.4615 [cs], October 2014. arXiv: 1410.4615.
 [27] Wojciech Zaremba and Ilya Sutskever. Reinforcement Learning Neural Turing Machines  Revised. arXiv:1505.00521 [cs], May 2015. arXiv: 1505.00521.
 [28] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent Neural Network Regularization. arXiv:1409.2329 [cs], September 2014. arXiv: 1409.2329.
 [29] Wei Zhang, Yang Yu, and Bowen Zhou. Structured Memory for Neural Turing Machines. arXiv:1510.03931 [cs], October 2015. arXiv: 1510.03931.
Appendix A Experimental details
a.1 permutation and arithmetic tasks
The baselines of our experiments are LSTMs in an encoderdecoder setup as described in [22]. We tested 2 variations of LANTM with an InvNorm and a SoftMax address mechanism, along with the LSTM baseline, on the permutation and arithmetic tasks to be described. The Lie group for both types of LANTM is the translation group acting on ^{5}^{5}5We early on experimented with the scaling rotation group , which produced acceptable results when input lengths were small but encountered numerical problems when input lengths were large due to exponentiating scale.. For both LANTMs and LSTM, we embed the input vocabulary continuously via a real embedding matrix into an Euclidean space before feeding into the models; we also pass the outputs through a softmax layer to arrive at probability distributions over the vocabulary set (this is the box in figure 1). As usual, prediction is performed via argmax but training is done by minimizing negative log likelihood.
The machines are first fed a learnable initial state and then provided with the input sequence, flanked by a startofinput (SOI) symbol and a repetition of an endofinput (EOI) symbol . The machines are to output the correct sequence during the response phase, which starts when they receive the first . The repetition of effectively ensures that the correct symbols are not shown to the machines during answering. The machine also must correctly emit an endofoutput (EOO) symbol to terminate their answers. The LANTM models are not allowed to write to the memory during the response phase, so that there is more emphasis on collecting the right information during the input phase. Figure (5) is an example of inputs and correct outputs during a copy task.
Tasks. Each task has a length parameter . The permutation tasks include

copy
input: output: 
reverse
input: output: 
bigramFlip
input: output:
The arithmetic tasks include the following. Note that all numbers, input or output, are formatted with the least significant digits on the left and with zero padding.

double. Let be an integer in the range , with zero padding in front (on the right) to make up digits.
input: in base 10, zero padded to digits output: in base 10, zero padded to digits 
addition. Let and be integers in the range , with zero padding in front (on the right) to make up digits. If they have digits and , respectively, with the least significant digits on the left, then
input: output:
In other words, we interleave the inputs. Thus this is a different encoding of the addition problem from previous works like [26] and [9].
Task parameters and hyperparameters. We trained the models on the above tasks for input sizes summarized by table 3. For all tasks, the LANTM has a singlelayer, 50cell or 100cell LSTM controller. The Lie group for all LANTMs is the translation group acting on the key space . The memory width (i.e. the size of each memory vector) is 20. For all tasks, the LSTM baseline has 4 layers, each with 256 cells. The exact setting of parameters for each model in each task is listed in table A.1.
Model  Task  LSTM size  Vocab  Embed.  Mem. width  LR  #Param 

LANTM InvNorm  copy  50  128  7  20  0.02  26105 
reverse  50  128  7  20  0.02  26105  
bigramFlip  100  128  7  20  0.02  70155  
addition  50  14  14  20  0.01  20291  
double  50  14  7  20  0.02  18695  
LANTM SoftMax  copy  50  128  7  20  0.02  26156 
reverse  50  128  7  20  0.02  26156  
bigramFlip  100  128  10  20  0.02  72123  
addition  50  14  14  20  0.01  20291  
double  50  14  14  20  0.02  20291  
LSTM  copy  128  7  NA  0.0002  1918222  
reverse  128  7  NA  0.0002  1918222  
bigramFlip  128  7  NA  0.0002  1918222  
addition  14  64  NA  0.0002  1918222  
double  14  64  NA  0.0002  1918222 
“Vocab” is the size of the vocabulary (i.e. the total number of possible characters of each input sequence). “Embed” is the dimension of the embedding space. For example, if “Embed” is 7, then each character is mapped to a vector in . “Mem. width” is the size of each memory vector. “LR” is the learning rate. “#Param” gives the total number of trainable parameters.
Training and testing. We seek to minimize the negative log likelihood of the individual output characters given the input. All models are trained through RMSProp with momentum .95. Every epoch has 10 batches, and every batch has 32 instances of the task. For the LANTM models, after 100 epochs, we half the learning rate if the best error so far is not improved in 30 epochs. The LSTMs are trained with learning rate 0.0002, with no learning rate adjustments during training.
Since the training sets are large and separate from the test sets, we train until convergence, testing the models periodically — every 20 epochs for the LANTM models, and every 200 epochs for the LSTM baseline. After training is complete, the best test scores are tabulated.
We tested the models by drawing 100 batches of random problems and computing fine and coarse scores as in [7]. Fine score refers to the percentage of digit or characters (including the EOO marker) that the model correctly outputs. Coarse score refers to the percentage of total problems that the model answers completely correctly.
Tweaks to the LANTM model. We applied two tweaks to the LANTM model: 1) we initialized the mix coefficients for write address and action to strong negative values. This means that the LANTM would tend to write in a straight line. 2) We normalized the step sizes to approximately 1 but did not normalize the initial state step sizes. We found that these two tweaks improved convergence speed and consistency ^{6}^{6}6A video of the read and writes of a LANTMInvNorm learning the copy task with no biases (tweak 1) is available in the Supplementary Materials. Compare with the corresponding video with biases. Details of the videos can be found in appendix D. . Note that with the second tweak, the “group” of actions is no longer a group. This is akin to restricting the head shifts of an NTM to and [6].
a.2 python programs
There are 6 types of programs of integers: addition/subtraction, identity, multiplication with one small operand, small for loops, variable substitution, and ternary “ if else ” statements, as illustrated in table A.2.
Input  Target  

identity  print(4103)  3014. 
small mult.  print((14*5608))  21587. 
if then else  print((4242 if 83026721 else 3716))  2424. 
var. subst.  f=3184;print((f29))  33728 
addition  print((3547+7004))  15501. 
small loop  b=1398;for x in range(10):b=6843;print(b)  23076. 
The models were required to read the input program, which terminates with a “print” statement, and output the correct integer response, in reverse sequence, without being fed the correct answer (same as in sequence and arithmetic tasks, but different from [26], which used teacher forcing). The LANTM models were prohibited from writing during the answer phase, as above. All input symbols were embedded into before being fed to the machines.
We performed curriculum learning, using the “mixed” strategy of [26], starting from 2 digits operands up to 4 digits operands. We evaluated the models on their coarse and fine scores on randomly sampled 4 digit programs. Training was done by RMSProp with learning rate 0.002, which was multiplied by 0.8 whenever the validation accuracy became lower than the highest of the last four. BPTT was always performed over the entire input and response phase.
The LSTM baseline had a single layer of 128 cells, as did the controllers of LANTMInvNorm and LANTMSoftMax, which also had memory width of 128. This comes out to be 127,890 parameters for the LSTM baseline and 212,149 parameters for the LANTM models. The LSTM was initialized to have weights uniformly in except that the forget gates are set to 1. The controllers of the LANTM models have weights initialized uniformly in and the forget gates set to 1 as well. There were no write biases or normalization of step sizes.
a.3 language modelling
The Penn treebank corpus consists of 929k/73k/82k train/validation/test words, with a total vocabulary of 10k words. We followed [28] for preprocessing the corpus. We used batch size of 32.
We embed the words into before feeding into the models. The LSTM baseline is 1 layer of 128 cells, and the LANTM models have controllers of the same size, along with memory vectors in . This translates to 4,047,632 parameters for LSTM and 4,323,329 parameters for the LANTM models. The LSTM was initialized to have weights uniformly in except that the forget gates are set to 1. The controllers of the LANTM models have weights initialized uniformly in and the forget gates set to 1 as well. The write biases were set to 10 as in the sequence and arithmetic tasks, but there is no normalization of step sizes. Whenever the external memories filled up to 100, the oldest 60 memory vectors were discarded.
The number of BPTT steps is 20, and we used Adagrad with learning rate 0.05, which was halved each time the validation perplexity exceeded that of the previous epoch.
Appendix B Background
b.1 Lie groups
We here review basic concepts of (Lie) group theory.
A group is a set with operations (multiplication), (inverse), and (unit) of arity respectively 2, 1, 0, such that

(associativity) for all ,

(inverse) for all ,

(identity) for all ,
The classical examples are , , matrix groups like , and cyclic groups .
A group often “acts on” another object or set, like a hand twists a rubik’s cube. For example, imagine an equilateral triangle with its vertices colored differently. Rotating the triangle by 120 degrees permutes the vertex color but leaves the overall shape unchanged. If we let correspond respectively to rotations of the equilateral triangle by 0, 120, or 240 degrees, and addition in corresponds to applying two such rotations consecutively, then is said to act on the set of color permutations of the triangle, because it maps one such permutation to another by a rotation. Or, consider as a set of vectors and as a set of points. One may drag an element of by a vector from , thus mapping it to another element of . Then we say acts on by vector addition. As this example illustrates, a group always acts on itself by the group multiplication (in the example, this is addition of vectors). So in fact, every group acts on another set. Formally, a group action of group on set is defined as a mapping such that

for all

for all .
It is the ubiquity of group action that explains the ubiquity of groups in mathematics. In this paper, we only borrow the language of groups and group actions to the extent it neatly expresses many ideas central to our design. No advanced ideas from mathematics are used.
A Lie group is a group with a smooth manifold structure such that multiplication and inverse operations are smooth maps. Similarly, a smooth group action of a Lie group on smooth manifold is just a group action that is smooth. In the context of smooth Lie group action, we also call elements of Lie actions.
The reader who has had no experience with smooth topology need not worry too much about the precise meaning of these definitions beyond the intuition that “Lie group is a group such that most things you do to it are differentiable” and “smooth Lie group action is a differentiable group action”. Indeed, the only reason we require a Lie group rather than a group is so that its group action yields to gradient descent. (To that end, it is not strictly necessary for the groups to be infinitely differentiable, but as all common differentiable groups are Lie groups and all groups explored in this paper are Lie group, this distinction is not needed.) The reader hoping to learn the basics of smooth manifolds and Lie groups can consult John Lee’s excellent Introduction to Smooth Manifolds [16].
Appendix C Example representation of Lie group actions on the key space
c.1 Example: The scaling rotation group
The scaling rotation group is the group of linear transformations of that decomposes into a rotation followed by a dilation (or contraction).
In the specific case of , the controller would produce 2 numbers , which represents the element
of the group. The matrix acts on a key by left matrix multiplication
This is the same as scaling by the scalar and then rotating (i.e. left multiplication) by the orthogonal matrix
Another viewpoint is to treat as the complex number . Then one can view the action for as the complex multiplication .
c.2 Example: The rotation group
The rotation, or special orthogonal, group is as its name suggests, the group of all linear transformations of expressable as a rotation.
When , we can just modify the scheme from the last example by scaling to unit norm, . The rest will follow just the same.
Appendix D Videos of read/write
For each task and each of LANTMInvNorm and LANTMSoftMax, we created a video of sample read and writes over the course of learning; the entire album is available in the Supplementary Materials. Each video was created as follows:

At the end of each epoch, we randomly selected an input of the maximium training length specific to that task (for example, in the case of addition task, two 16digit numbers interleaved).

We ran the model, with all weights set as trained so far, on this input and record the read and write locations in the key space, along with the strength of each memory vector.

When training is complete, we plot the recording of each epoch in a separate frame, and string them together into a video file. The write locations are marked by red circles, and filled so that a darker fill color means higher memory strength. The read locations are marked by blue disks and connected together by a blue line chronologically (the read line).
Even though we did not explicitly indicate the directionality of the read line, one may infer the directionality of the write sequence by noting that a red circle with white filling marks the beginning of the writes. Then the read sequence will follow this directionality in all tasks other than the reverse task.
Analysis. One sees clearly that LANTMInvNorm learned to write in a straight line (which is not surprising given our tweaks to the model) and then read along that same line. On the other hand, LANTMSoftMax tended to quarantine its read locations to one end of the write line in the reverse, bigramFlip, and addition tasks. In the copy and double tasks, the read line doesn’t stick to the write line as closely with LANTMSoftmax as with LANTMInvNorm. This is expected since SoftMax assigns a memory vector with high value just if its location is closer to the read location than any other memory vector, whereas InvNorm requires to be very close to .
Appendix E Close analysis
In this section, we discuss the performance of LATNMInvNorm through various statistics and example input/outputs.
e.1 Permutation tasks
e.1.1 copy
Figure 0(a) shows the read and write locations of such a LANTMInvNorm, trained on length 1 to 64 input, running on a typical length 320 input. As one might expect, the reads and writes proceed along straight lines in the key space. The actual read locations keep close to the corresponding write locations. In this execution, the LANTM made no errors (figure 0(c)).
Figure 0(b) shows the values of the 4 gates governing the computation of read and write keys. A value of 0 means the gate takes the previous step or key location, while a value of 1 means the gate takes the newly computed step or key location. While the write location gates during the input phase and the read location gates during the response phase were as expected pushed to 0, the write step and read step gates were unexpectedly pushed to 1. Thus the LANTM must have memorized a fixed step size and used it for both reads and writes.
e.1.2 reverse
The counterparts of these graphs for the reverse task are exhibited in figure E.2. On the left we have data for length 128 input, demonstrating a correct execution, while on the right we have data for length 300 input, demonstrating what goes on when extrapolating to higher input sizes.
We see that LANTM trained on the reverse task functions much like that trained on the copy task, with read and write heads traversing on straight lines, except now the directionalities are opposed. However, when running on length 300 input, the read line, i.e. the curve connecting the read locations in sequence, bends suddenly toward the end, causing almost all reads at the end to diverge from the writes and making almost all outputs at the end to be incorrect. This is somewhat surprising, for one might have expected error to come in the form of the accumulation of a small difference between the slopes of the read and write lines. Along with the sudden dip in read step gate value at the end (blue line in figure 1(d)), the bending of the read line suggests that the LSTM controller started to forget its learned program as the answering phase drew toward a conclusion.
e.1.3 bigramFlip
The same phenomena appear with the bigramFlip task, where reads and writes happen along 2 closely aligned lines, but when tested by a long input, the reads will abruptly fall out of order: while in the reverse task, the read line visibly bends away from the write line, here the lines stay straight but each step in the read line is elongated, starting around the 187th read (figure 2(b)).
One might be surprised to see that the read happens along a line instead of zigzagging inside the write line. On closer inspection, we find that LANTM works as follows:

LANTM stores representations of the inputs in input order.

Meanwhile it memorizes the first two input characters and outputs them in the reverse order after reading the first two EOI symbols.

When it sees the first EOI symbols, it starts reading the second bigram, i.e. it reads characters 3 and 4 (or their representations in memory; this corresponds to the 5th and 6th memory vectors) after seeing the first and second EOI symbols. This effectively allows it to “look ahead” and have each bigram on hand before having to output the flipped image of it.

The LSTM flips each “look ahead” bigram and outputs it in order. Repeat for each bigram.
Unique to the LANTM trained on bigramFlip is the oscillation of the read step gate between 0 and 1 (figure 2(c) and 2(d)). This seems like more an artifact of the learning process than a feature of the learned computation, as it would also imply that the controller memorized a single fixed read step, and that the error that occurs with extrapolation seems to stem from the adulteration of this memory.
e.2 Arithmetic tasks
In the double task, the LANTM behaved much like it did in the copy task. It stored the input in a line and then computed the doubling with carry digitwise.
In the addition task, the LANTM learned to compress each pair of digits of the input numbers (which, as mentioned above, are interleaved) and store them in the odd write locations; the even write locations had vanishing memory strength (figure 4(a) and 4(b)). The LANTM then read off the information by skipping through the odd memory locations.
As with copy and reverse tasks, the read step gate values during the response phase were all close to 1, meaning that the LANTM kept the read step in the LSTM controller memory. This suggests that the read step gate might be an unnecessary design.