Self-Attentive Associative Memory

Self-Attentive Associative Memory


Heretofore, neural networks with external memory are restricted to single memory with lossy representations of memory interactions. A rich representation of relationships between memory pieces urges a high-order and segregated relational memory. In this paper, we propose to separate the storage of individual experiences (item memory) and their occurring relationships (relational memory). The idea is implemented through a novel Self-attentive Associative Memory (SAM) operator. Found upon outer product, SAM forms a set of associative memories that represent the hypothetical high-order relationships between arbitrary pairs of memory elements, through which a relational memory is constructed from an item memory. The two memories are wired into a single sequential model capable of both memorization and relational reasoning. We achieve competitive results with our proposed two-memory model in a diversity of machine learning tasks, from challenging synthetic problems to practical testbeds such as geometry, graph, reinforcement learning, and question answering.


1 Introduction

Humans excel in remembering items and the relationship between them over time [26, 17]. Numerous neurocognitive studies have revealed this striking ability is largely attributed to the perirhinal cortex and hippocampus, two brain regions that support item memory (e.g., objects, events) and relational memory (e.g., locations of objects, orders of events), respectively [7, 6]. Relational memory theory posits that there exists a representation of critical relationships amongst arbitrary items, which allows inferential reasoning capacity [9, 40]. It remains unclear how the hippocampus can select the stored items in clever ways to unearth their hidden relationships and form the relational representation.

Research on artificial intelligence has focused on designing item-based memory models with recurrent neural networks (RNNs) [14, 10, 13] and memory-augmented neural networks (MANNs) [11, 12, 19, 21]. These memories support long-term retrieval of previously seen items yet lack explicit mechanisms to represent arbitrary relationships amongst the constituent pieces of the memories. Recently, further attempts have been made to foster relational modeling by enabling memory-memory interactions, which is essential for relational reasoning tasks [29, 30, 35]. However, no effort has been made to model jointly item memory and relational memory explicitly.

We argue that dual memories in a single system are crucial for solving problems that require both memorization and relational reasoning. Consider graphs wherein each node is associated with versatile features– as example a road network structure where each node is associated with diverse features: graph 1 where the nodes are building landmarks and graph 2 where the nodes are flora details. The goal here is to reason over the structure and output the associated features of the nodes instead of the pointer or index to the nodes. Learning to output associated node features enables generalization to entirely novel features, i.e., a model can be trained to generate a navigation path with building landmarks (graph 1) and tested in the novel context of generating a navigation path with flora landmarks (graph 2). This may be achieved if the model stores the features and structures into its item and relational memory, separately, and reason over the two memories using rules acquired during training.

Another example requiring both item and relational memory can be understood by amalgamating the -farthest [30] and associative recall [11] tasks. -farthest requires relational memory to return a fixed one-hot encoding representing the index to the -farthest item, while associative recall returns the item itself, requiring item memory. If these tasks are amalgamated to compose Relational Associative Recall (RAR) – return the -farthest item from a query (see 3.2), it is clear that both item and relational memories are required.

Three limitations of the current approaches are: the relational representation is often computed without storing, which prevents reusing the precomputed relationships in sequential tasks [35, 29], few works that manage both items and the relationships in a single memory, make it hard to understand how relational reasoning occurs [30, 32], the memory-memory relationship is coarse since it is represented as either dot product attention [35] or weighted summation via neural networks [29]. Concretely, the former uses a scalar to measure cosine distance between two vectors and the later packs all information into one vector via only additive interactions.

To overcome the current limitations, we hypothesize a two-memory model, in which the relational memory exists separately from the item memory. To maintain a rich representation of the relationship between items, the relational memory should be higher-order than the item memory. That is, the relational memory stores multiple relationships, each of which should be represented by a matrix rather than a scalar or vector. Otherwise, the capacity of the relational memory is downgraded to that of the item memory. Finally, as there are two separate memories, they must communicate to enrich the representation of one another.

To implement our hypotheses, we introduce a novel operator that facilitates the communication from the item memory to the relational memory. The operator, named Self-attentive Association Memory (SAM) leverages the dot product attention with our outer product attention. Outer product is critical for constructing higher-order relational representations since it retains bit-level interactions between two input vectors, thus has potential for rich representational learning [33]. SAM transforms a second-order (matrix) item memory into a third-order relational representation through two steps. First, SAM decodes a set of patterns from the item memory. Second, SAM associates each pairs of patterns using outer product and sums them up to form a hetero-associative memory. The memory thus stores relationships between stored items accumulated across timesteps to form a relational memory.

The role of item memory is to memorize the input data over time. To selectively encode the input data, the item memory is implemented as a gated auto-associative memory. Together with previous read-out values from the relational memory, the item memory is used as the input for SAM to construct the relational memory. In return, the relational memory transfers its knowledge to the item memory through a distillation process. The backward transfer triggers recurrent dynamics between the two memories, which may be essential for simulating hippocampal processes [18]. Another distillation process is used to transform the relational memory to output value.

Taken together, we contribute a new neural memory model dubbed SAM-based Two-memory Model (STM) that takes inspiration from the existence of both item and relational memory in human brain. In this design, the relational memory is higher-order than the item memory and thus necessitates a core operator that manages the information exchange from the item memory to the relational memory. The operator, namely Self-attentive Associative Memory (SAM), utilizes outer product to construct a set of hetero-associative memories representing relationships between arbitrary stored items. We apply our model to a wide range of tasks that may require both item and relational memory: various algorithmic learning, geometric and graph reasoning, reinforcement learning and question-answering tasks. Several analytical studies on the characteristics of our proposed model are also given in the Appendix.

2 Methods

2.1 Outer product attention (OPA)

Outer product attention (OPA) is a natural extension of the query-key-value dot product attention [35]. Dot product attention (DPA) for single query and pairs of key-value can be formulated as follows,


where , , , is dot product, and forms function. We propose a new outer product attention with similar formulation yet different meaning,


where , , , is element-wise multiplication, is outer product and is chosen as element-wise function.

A crucial difference between DPA and OPA is that while the former retrieves an attended item , the latter forms a relational representation . As a relational representation, captures all bit-level associations between the key-scaled query and the value. This offers two benefits: a higher-order representational capacity that DPA cannot provide and a form of associative memory that can be later used to retrieve stored item by using a contraction operation (see Appendix C-Prop. 6).

OPA is closely related to DPA. The relationship between the two for simple and is presented as follows,

Proposition 1.

Assume that is a linear transformation: (), we can extract from by using an element-wise linear transformation () and a contraction : such that


see Appendix A. ∎

Moreover, when , applying a high dimensional transformation is equivalent to the well-known bi-linear model (see Appendix B-Prop. 4). By introducing OPA, we obtain a new building block that naturally supports both powerful relational bindings and item memorization.

Figure 1: STM (left) and SAM (right). SAM uses neural networks to extract query, key and value elements from a matrix memory . In this illustration, and . Then, it applies outer product attention to output a tensor relational representation. In STM, at every timestep, the item memory is updated with new input using gating mechanisms (Eq. 10). The item memory plus the read-out from the relational memory is forwarded to SAM, resulting in a new relational representation to update the relational memory (Eq. 11-12). The relational memory transfers its knowledge to the item memory (Eq. 13) and output value (Eq. 14).

2.2 Self-attentive Associative Memory (SAM)

We introduce a novel and generic operator based upon OPA that constructs relational representations from an item memory. The relational information is extracted via preserving the outer products between any pairs of items from the item memory. Hence, we name this operator Self-attentive Associative Memory (SAM). Given an item memory and parametric weights , , , SAM retrieves queries, keys and values from as , and , respectively,


where is layer normalization operation [2]. Then SAM returns a relational representation , in which the -th element of the first dimension is defined as


where . , and denote the -th row vector of matrix , the -th row vector of matrix and , respectively. A diagram illustrating SAM operations is given in Fig. 1 (right).

It should be noted that can be any item memory including the slot-based memories [21], direct inputs [35] or associative memories [16, 14]. We choose as a form of classical associative memory, which is biologically plausible [23]. From we read query, key and value items to form –a new set of hetero-associative memories using Eq. 8. Each hetero-associative memory represents the relationship between a query and all values. The role of the keys is to maintain possible perfect retrieval for the item memory (Appendix C-Prop. 6).

The high-order structure of SAM allows it to preserve bit-level relationships between a query and a value in a matrix. SAM compresses several relationships with regard to a query by summing all the matrices to form a hetero-associative memory containing scalars, where is the dimension of . As there are relationships given 1 query, the summation results in on average scalars of representation per relationship, which is greater than if . By contrast, current self-attention mechanisms use dot product to measure the relationship between any pair of memory slots, which means 1 scalar per relationship.

2.3 SAM-based Two-Memory Model (STM)

To effectively utilize the SAM operator, we design a system which consists of two memory units and : one for items and the other for relationships, respectively. From a high-level view, at each timestep, we use the current input data and the previous state of memories to produce output and new state of memories . The memory executions are described as follows.


The item memory distributes the data from the input across its rows in the form of associative memory. For an input , we update the item memory as


where and are feed-forward neural networks that output -dimensional vectors. This update does not discriminate the input data and inherits the low-capacity of classical associative memory [28]. We leverage the gating mechanisms of LSTM [13] to improve Eq. 9 as


where and are forget and input gates, respectively. Detailed implementation of these gates is in Appendix D.


As relationships stored in are represented as associative memories, the relational memory can be read to reconstruct previously seen items. As shown in Appendix C-Prop. 7, the read is basically a two-step contraction,


where is a feed-forward neural network that outputs a -dimensional vector. The read value provides an additional input coming from the previous state of to relational construction process, as shown later in Eq. 12.

-Read -Write

We use SAM to read from and construct a candidate relational memory, which is simply added to the previous relational memory to perform the relational update,


where and are blending hyper-parameters. The input for SAM is a combination of the current item memory and the association between the extracted item from the previous relational memory and the current input data . Here, enhances the relational memory with information from the distant past. The resulting relational memory stores associations between several pairs of items in a tensors of size . In our SAM implementation, .


In this phase, the relational knowledge from is transferred to the item memory by using high dimensional transformation,


where is a function that flattens the first two dimensions of its input tensor, is a feed-forward neural network that maps and is a blending hyper-parameter. As shown in Appendix B-Prop. 5, with trivial , the transfer behaves as if the item memory is enhanced with long-term stored values from the relational memory. Hence, -Transfer is also helpful in supporting long-term recall (empirical evidences in 3.1). In addition, at each timestep, we distill the relational memory into an output vector . We alternatively flatten and apply high-dimensional transformations as follow,


where is a function that flattens the last two dimensions of its input tensor. and are two feed-forward neural networks that map and , respectively. is a hyper-parameter.

Unlike the contraction (Eq. 11), the distillation process does not simply reconstruct the stored items. Rather, thanks to high-dimensional transformations, it captures bi-linear representations stored in the relational memory (proof in Appendix B). Hence, despite its vector form, the output of our model holds a rich representation that is useful for both sequential and relational learning. We discuss further on how to quantify the degree of relational distillation in Appendix G. Summary of components of STM is presented in Fig. 1 (left).

3 Results

3.1 Ablation study

We test different model configurations on two classical tasks for sequential and relational learning: associative retrieval [1] and -farthest [30] (see Appendix E for task details and learning curves).

Associative retrieval

This task measures the ability to recall a seen item given its associated key and thus involves item memory. We use the setting with input sequence length 30 and 50 [41]. Three main factors affecting the item memory of STM are the dimension of the auto-associative item memory, the gating mechanisms (Eq. 10) and the relational transfer (Eq. 13). Hence, we ablate our STM (, full features) by creating three other versions: small STM with transfer (), small STM without transfer (, w/o transfer) and STM without gates (, w/o gates). is fixed to as the task does not require much relational learning.

Table 1 reports the number of epochs required to converge and the final testing accuracy. Without the proposed gating mechanism, STM struggles to converge, which highlights the importance of extending the capacity of the auto-associative item memory. The convergence speed of STM is significantly improved with a bigger item memory size. Relational transfer seems more useful for longer input sequences since if requested, it can support long-term retrieval. Compared to other fast-weight baselines, the full-feature STM performs far better as it needs only 10 and 20 epochs to solve the tasks of length 30 and 50, respectively.

Model Length 30 Length 50
E. A. E. A.
Fast weight 50 100 5000 20.8
WeiNet 35 100 50 100
STM (, w/o transfer) 10 100 100 100
STM () 20 100 80 100
STM (, w/o gates) 100 24 100 20
STM () 10 100 20 100
Table 1: Comparison of models on associative retrieval task with number of epochs E. required to converge (lower is better) and convergence test accuracy A. (, higher is better). is reported from [41].


This task evaluates the ability to learn the relationship between stored vectors. The goal is to find the -farthest vector from a query vector, which requires a relational memory for distances between vectors and a sorting mechanism over the distances. For relational reasoning tasks, the pivot is the number of extracted items for establishing the relational memory. Hence, we run our STM with different using the same problem setting (8 - dimensional input vectors), optimizer (Adam), batch size (1600) as in [30]. We also run the task with TPR [32]–a high-order fast-weight model that is designed for reasoning.

As reported in Table 2, increasing gradually improves the accuracy of STM. As there are 8 input vectors in this task, literally, at each timestep the model needs to extract 8 items to compute all pairs of distances. However, as the extracted item is an entangled representation of all stored vectors and the temporarily computed distances are stored in a separate high-order storage, even with , STM achieves moderate results. With , STM nearly solves the task perfectly, outperforming RMC by a large margin. We have tried to tune TPR for this task without success (see Appendix E). This illustrates the challenge of training high-order neural networks in diverse contexts.

Model Accuracy ()
DNC 25
RMC 91
TPR 13
STM () 84
STM () 95
STM () 98
Table 2: Comparison of models on -farthest task (test accuracy). is reported from [30].

3.2 Algorithmic synthetic tasks

Figure 2: Bit error per sequence vs training iteration for algorithmic synthetic tasks.

Algorithmic synthetic tasks [11] examine sequential models on memorization capacity (eg., Copy, Associative recall) and simple relational reasoning (eg., Priority sort). Even without explicit relational memory, MANNs have demonstrated good performance [11, 22], but they are verified for only low-dimensional input vectors (<8 bits). As higher-dimensional inputs necessitate higher-fidelity memory storage, we evaluate the high-fidelity reconstruction capacity of sequential models for these algorithmic tasks with 32-bit input vectors.

Two chosen algorithmic tasks are Copy and Priority sort. Item memory is enough for Copy where the models just output the input vectors seen in the same order in which they are presented. For Priority sort, a relational operation that compares the priority of input vectors is required to produce the seen input vectors in the sorted order according to the priority score attached to each input vector. The relationship is between input vectors and thus simply first-order (see Appendix G for more on the order of relationship).

Inspired by Associative recall and -farthest tasks, we create a new task named Relational Associative Recall (RAR). In RAR, the input sequence is a list of items followed by a query item. Each item is a list of several 32-bit vectors and thus can be interpreted as a concatenated long vector. The requirement is to reconstruct the seen item that is farthest or closest (yet unequal) to the query. The type of the relationship is conditioned on the last bit of the query vector, i.e., if the last bit is 1, the target is the farthest and 0 the closest. The evaluated models must compute the distances from the query item to any other seen items and then compare the distances to find the farthest/closest one. Hence, this task is similar to the -farthest task, which is second-order relational and thus needs relational memory. However, this task is more challenging since the models must reconstruct the seen items (32-bit vectors). Compared to possible one-hot outputs in -farthest, the output space in RAR is per step, thereby requiring high-fidelity item memory.

We evaluate our model STM (, ) with the 4 following baselines: LSTM [13], attentional LSTM [4], NTM [11] and RMC [30]. Details of the implementation are listed in Appendix F. The learning curves (mean and error bar over 5 runs) are presented in Fig. 2.

LSTM is often the worst performer as it is based on vector memory. ALSTM is especially good for Copy as it has a privilege to access input vectors at every step of decoding. However, when dealing with relational reasoning, memory-less attention in ALSTM does not help much. NTM performs well on Copy and moderately on Priority sort, yet badly on RAR possibly due to its bias towards item memory. Although equipped with self-attention relational memory, RMC demonstrates trivial performance on all tasks. This suggests a limitation of using dot-product attention to represent relationships when the tasks stress memorization or the relational complexity goes beyond dot-product capacity. Amongst all models, only the proposed STM demonstrates consistently good performance where it almost achieves zero errors on these 3 tasks. Notably, for RAR, only STM can surpass the bottleneck error of 30 bits and reach bit error, corresponding to 0% and 87% of items perfectly reconstructed, respectively.

Model #Parameters Convex hull TSP Shortest Minimum
path spanning tree
LSTM 4.5 M 89.15 82.24 73.15 (2.06) 62.13 (3.19) 72.38 80.11
ALSTM 3.7 M 89.92 85.22 71.79 (2.05) 55.51 (3.21) 76.70 73.40
DNC 1.9 M 89.42 79.47 73.24 (2.05) 61.53 (3.17) 83.59 82.24
RMC 2.8 M 93.72 81.23 72.83 (2.05) 37.93 (3.79) 66.71 74.98
STM 1.9 M 96.85 91.88 73.96 (2.05) 69.43 (3.03) 93.43 94.77
Table 3: Prediction accuracy () for geometric and graph reasoning with random one-hot features. Italic numbers are tour length–additional metric for TSP. Average optimal tour lengths found by brute-force search for and are 2.05 and 2.88, respectively.
Figure 3: Average reward vs number of games for reinforcement learning task in n-frame skip settings.

3.3 Geometric and graph reasoning

Problems on geometry and graphs are a good testbed for relational reasoning, where geometry stipulates spatial relationships between points, and graphs the relational structure of nodes and edges. Classical problems include Convex hull, Traveling salesman problem (TSP) for geometry, and Shortest path, Minimum spanning tree for graph. Convex hull and TSP data are from [36] where input sequence is a list of points’ coordinates (number of points ). Graphs in Shortest path and Minimum spanning tree are generated with solutions found by Dijkstra and Kruskal algorithms, respectively. A graph input is represented as a sequence of triplets . The desired output is a sequence of associated features of the solution points/nodes (more in Appendix H).

We generate a random one-hot associated feature for each point/node, which is stacked into the input vector. This allows us to output the node’s associated features. This is unlike [36], who just outputs the pointers to the nodes. Our modification creates a challenge for both training and testing. The training is more complex as the feature of the nodes varies even for the same graph. The testing is challenging as the associated features are likely to be different from that in the training. A correct prediction for a timestep is made when the predicted feature matches perfectly with the ground truth feature in the timestep. To measure the performance, we use the average accuracy of prediction across steps. We use the same baselines as in 3.2 except that we replace NTM with DNC as DNC performs better on graph reasoning [12].

We report the best performance of the models on the testing datasets in Table 3. Although our STM has fewest parameters, it consistently outperforms other baselines by a significant margin. As usual, LSTM demonstrates an average performance across tasks. RMC and ALSTM are only good at Convex hull. DNC performs better on graph-like problems such as Shortest path and Minimum spanning tree. For the NP-hard TSP (), despite moderate point accuracy, all models achieve nearly minimal solutions with an average tour length of . When increasing the difficulty with more points (), none of these models reach an average optimal tour length of . However, only STM approaches closer to the optimal solution without the need for pointer and beam search mechanisms. Armed with both item and relational memory, STM’s superior performance suggests a qualitative difference in the way STM and other methods solve these problems.

3.4 Reinforcement learning

Memory is helpful for partially observable Markov decision process [5]. We apply our memory to LSTM agents in Atari game environment using A3C training [24]. More details are given in Appendix I. In Atari games, each state is represented as the visual features of a video frame and thus are partially observable. To perform well, RL agents should remember and relate several frames to model the game state comprehensively. These abilities are challenged when over-sampling and under-sampling the observation, respectively. We analyze the performance of LSTM agents and their STM-augmented counterparts under these settings using a game: Pong.

To be specific, we test the two agents on different frame skips (0, 4, 16, 32). We create -frame skip setting by allowing the agent to see the environment only after every frames, where 4-frame skip is standard in most Atari environments. When no frameskip is applied (over-sampling), the number of observations is dense and the game is long (up to 9000 steps per game), which requires high-capacity item memory. On the contrary, when a lot of frames are skipped (under-sampling), the observations become scarce and the agents must model the connection between frames meticulously, demanding better relational memory.

We run each configuration 5 times and report the mean and error bar of moving average reward (window size ) through training time in Fig. 3. In a standard condition (4-frame skip), both baselines can achieve perfect performance and STM outperforms LSTM slightly in terms of convergence speed. The performance gain becomes clearer under extreme conditions with over-sampling and under-sampling. STM agents require fewer practices to accomplish higher rewards, especially in the 32-frame skip environment, which illustrates that having strong item and relational memory in a single model is beneficial to RL agents.

3.5 Question answering

bAbI is a question answering dataset that evaluates the ability to remember and reason on textual information [38]. Although synthetically generated, the dataset contains 20 challenging tasks such as pathfinding and basic induction, which possibly require both item and relational memory. Following [32], each story is preprocessed into a sentence-level sequence, which is fed into our STM as the input sequence. We jointly train STM for all tasks using normal supervised training (more in Appendix J). We compare our model with recent memory networks and report the results in Table 4.

MANNs such as DNC and NUTM have strong item memory, yet do not explicitly support relational learning, leading to significantly higher errors compared to other models. On the contrary, TPR is explicitly equipped with relational bindings but lack of item memory and thus clearly underperforms our STM. Universal Transformer (UT) supports a manually set item memory with dot product attention, showing higher mean error than STM with learned item memory and outer product attention. Moreover, our STM using normal supervised loss outperforms MNM-p trained with meta-level loss, establishing new state-of-the-arts on bAbI dataset. Notably, STM achieves this result with low variance, solving 20 tasks for 9/10 run (see Appendix J).

Model Error
Mean Best
DNC [12] 12.8 4.7 3.8
NUTM [22] 5.6 1.9 3.3
TPR [32] 1.34 0.52 0.81
UT [8] 1.12 1.62 0.21
MNM-p [25] 0.55 0.74 0.18
STM 0.39 0.18 0.15
Table 4: bAbI task: mean std. and best error over 10 runs.

4 Related Work

Background on associative memory

Associative memory is a classical concept to model memory in the brain [23]. While outer product is one common way to form the associative memory, different models employ different memory retrieval mechanisms. For example, Correlation Matrix Memory (CMM) and Hopfield network use dot product and recurrent networks, respectively [16, 14]. The distinction between our model and other associative memories lies in the fact that our model’s association comes from several pieces of the memory itself rather than the input data. Also, unlike other two-memory systems [20, 22] that simulate data/program memory in computer architecture, our STM resembles item and relational memory in human cognition.

Background on attention

Attention is a mechanism that allows interactions between a query and a set of stored keys/values [11, 3]. Self-attention mechanism allows stored items to interact with each other either in forms of feed-forward [35] or recurrent [30, 21] networks. Although some form of relational memory can be kept in these approaches, they all use dot product attention to measure every interaction as a scalar, and thus loose much relational information. We use outer product to represent the interaction as a matrix and thus our outer product self-attention is supposed to be richer than the current self-attention mechanisms (Prop. 1).

SAM as fast-weight

Outer product represents Hebbian learning–a fast learning rule that can be used to build fast-weights [37]. As the name implies, fast-weights update whenever an input is introduced to the network and stores the input pattern temporarily for sequential processing [1]. Meta-trained fast-weights [25] and gating of fast-weights [31, 41] are introduced to improve memory capacity. Unlike these fast-weight approaches, our model is not built on top of other RNNs. Recurrency is naturally supported within STM.

The tensor product representation (TPR), which is a form of high-order fast-weight, can be designed for structural reasoning [33]. In a recent work [32], a third-order TPR resembles our relational memory where both are tensors. However, TPR does not enable interactions amongst stored patterns through self-attention mechanism. The meaning of each dimension of the TPR is not related to that of . More importantly, TPR is restricted to question answering task.

SAM as bi-linear model

Bi-linear pooling produces output from two input vectors by considering all pairwise bit interactions and thus can be implemented by means of outer product [34]. To reduce computation cost, either low-rank factorization [39] or outer product approximation [27] is used. These approaches aim to enrich feed-forward layers with bi-linear poolings yet have not focused on maintaining a rich memory of relationships.

Low-rank bi-linear pooling is extended to perform visual attentions [15]. It results in different formulation from our outer product attention, which is equivalent to full rank bi-linear pooling ( 2.1). These methods are designed for static visual question answering while our approach is used to maintain a relational memory over time, which can be applied to any sequential problem.

5 Conclusions

We have introduced the SAM-based Two-memory Model (STM) that implements both item and relational memory. To wire up the two memory system, we employ a novel operator named Self-attentive Associative Memory (SAM) that constructs the relational memory from outer-product relationships between arbitrary pieces of the item memory. We apply read, write and transfer operators to access, update and distill the knowledge from the two memories. The ability to remember items and their relationships of the proposed STM is validated through a suite of diverse tasks including associative retrieval, -farthest, vector algorithms, geometric and graph reasoning, reinforcement learning and question answering. In all scenarios, our model demonstrates strong performance, confirming the usefulness of having both item and relational memory in one model.


  1. Jimmy Ba, Geoffrey E Hinton, Volodymyr Mnih, Joel Z Leibo, and Catalin Ionescu. Using fast weights to attend to the recent past. In Advances in Neural Information Processing Systems, pages 4331–4339, 2016a.
  2. Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016b.
  3. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  4. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. Proceedings of the International Conference on Learning Representations, 2015.
  5. Bram Bakker. Reinforcement learning with long short-term memory. In Advances in neural information processing systems, pages 1475–1482, 2002.
  6. Mark J Buckley. The role of the perirhinal cortex and hippocampus in learning, memory, and perception. The Quarterly Journal of Experimental Psychology Section B, 58(3-4):246–268, 2005.
  7. Neal J Cohen, Russell A Poldrack, and Howard Eichenbaum. Memory for items and memory for relations in the procedural/declarative memory framework. Memory, 5(1-2):131–178, 1997.
  8. Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers. arXiv preprint arXiv:1807.03819, 2018.
  9. Howard Eichenbaum. Memory, amnesia, and the hippocampal system. MIT press, 1993.
  10. Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
  11. Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
  12. Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwińska, Sergio Gómez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, et al. Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626):471–476, 2016.
  13. Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  14. John J Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8):2554–2558, 1982.
  15. Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. Bilinear attention networks. In Advances in Neural Information Processing Systems, pages 1564–1574, 2018.
  16. Teuvo Kohonen. Correlation matrix memories. IEEE transactions on computers, 100(4):353–359, 1972.
  17. Alex Konkel and Neal J Cohen. Relational memory and the hippocampus: representations and methods. Frontiers in neuroscience, 3:23, 2009.
  18. Dharshan Kumaran and James L McClelland. Generalization through the recurrent interaction of episodic memories: a model of the hippocampal system. Psychological review, 119(3):573, 2012.
  19. Hung Le, Truyen Tran, Thin Nguyen, and Svetha Venkatesh. Variational memory encoder-decoder. In Advances in Neural Information Processing Systems, pages 1508–1518, 2018a.
  20. Hung Le, Truyen Tran, and Svetha Venkatesh. Dual memory neural computer for asynchronous two-view sequential learning. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery; Data Mining, KDD ’18, pages 1637–1645, New York, NY, USA, 2018b. ACM. ISBN 978-1-4503-5552-0. doi: 10.1145/3219819.3219981. URL
  21. Hung Le, Truyen Tran, and Svetha Venkatesh. Learning to remember more with less memorization. In International Conference on Learning Representations, 2019. URL
  22. Hung Le, Truyen Tran, and Svetha Venkatesh. Neural stored-program memory. In International Conference on Learning Representations, 2020. URL
  23. David Marr and W Thomas Thach. A theory of cerebellar cortex. In From the Retina to the Neocortex, pages 11–50. Springer, 1991.
  24. Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016.
  25. Tsendsuren Munkhdalai, Alessandro Sordoni, Tong Wang, and Adam Trischler. Metalearned neural memory. In Advances in Neural Information Processing Systems, pages 13310–13321, 2019.
  26. Ingrid R Olson, Katie Page, Katherine Sledge Moore, Anjan Chatterjee, and Mieke Verfaellie. Working memory for conjunctions relies on the medial temporal lobe. Journal of Neuroscience, 26(17):4596–4601, 2006.
  27. Ninh Pham and Rasmus Pagh. Fast and scalable polynomial kernels via explicit feature maps. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 239–247. ACM, 2013.
  28. Raúl Rojas. Neural networks: a systematic introduction. Springer Science & Business Media, 2013.
  29. Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. A simple neural network module for relational reasoning. In Advances in neural information processing systems, pages 4967–4976, 2017.
  30. Adam Santoro, Ryan Faulkner, David Raposo, Jack Rae, Mike Chrzanowski, Theophane Weber, Daan Wierstra, Oriol Vinyals, Razvan Pascanu, and Timothy Lillicrap. Relational recurrent neural networks. In Advances in Neural Information Processing Systems, pages 7299–7310, 2018.
  31. Imanol Schlag and Jürgen Schmidhuber. Gated fast weights for on-the-fly neural program generation. In NIPS Metalearning Workshop, 2017.
  32. Imanol Schlag and Jürgen Schmidhuber. Learning to reason with third order tensor products. In Advances in Neural Information Processing Systems, pages 9981–9993, 2018.
  33. Paul Smolensky. Tensor product variable binding and the representation of symbolic structures in connectionist systems. Artificial intelligence, 46(1-2):159–216, 1990.
  34. Joshua B Tenenbaum and William T Freeman. Separating style and content with bilinear models. Neural computation, 12(6):1247–1283, 2000.
  35. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
  36. Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In Advances in Neural Information Processing Systems, pages 2692–2700, 2015.
  37. Christoph von der Malsburg. The correlation theory of brain function, 1981. URL
  38. Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M Rush, Bart van Merriënboer, Armand Joulin, and Tomas Mikolov. Towards ai-complete question answering: A set of prerequisite toy tasks. arXiv preprint arXiv:1502.05698, 2015.
  39. Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 1821–1830, 2017.
  40. Dagmar Zeithamova, Margaret L Schlichting, and Alison R Preston. The hippocampus and inferential reasoning: building memories to navigate future decisions. Frontiers in human neuroscience, 6:70, 2012.
  41. Wei Zhang and Bowen Zhou. Learning to update auto-associative memory in recurrent neural networks for improving sequence memorization. ArXiv, abs/1709.06493, 2017.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description