The geometry of integration in text classification RNNs
Despite the widespread application of recurrent neural networks (RNNs) across a variety of tasks, a unified understanding of how RNNs solve these tasks remains elusive. In particular, it is unclear what dynamical patterns arise in trained RNNs, and how those patterns depend on the training dataset or task. This work addresses these questions in the context of a specific natural language processing task: text classification. Using tools from dynamical systems analysis, we study recurrent networks trained on a battery of both natural and synthetic text classification tasks. We find the dynamics of these trained RNNs to be both interpretable and low-dimensional. Specifically, across architectures and datasets, RNNs accumulate evidence for each class as they process the text, using a low-dimensional attractor manifold as the underlying mechanism. Moreover, the dimensionality and geometry of the attractor manifold are determined by the structure of the training dataset; in particular, we describe how simple word-count statistics computed on the training dataset can be used to predict these properties. Our observations span multiple architectures and datasets, reflecting a common mechanism RNNs employ to perform text classification. To the degree that integration of evidence towards a decision is a common computational primitive, this work lays the foundation for using dynamical systems techniques to study the inner workings of RNNs.
Modern recurrent neural networks (RNNs) can achieve strong performance in natural language processing (NLP) tasks such as sentiment analysis, document classification, language modeling, and machine translation. However, the inner workings of these networks remain largely mysterious.
As RNNs are parameterized dynamical systems tuned to perform specific tasks, a natural way to understand them is to leverage tools from dynamical systems analysis. A challenge inherent to this approach is that the state space of modern RNN architectures—the number of units comprising the hidden state—is often high-dimensional, with layers routinely comprising hundreds of neurons. This dimensionality renders the application of standard representation techniques, such as phase portraits, difficult. Another difficulty arises from the fact that these are monolithic systems trained end-to-end. Instead of modular components with clearly delineated responsibilities that can be understood and tested independently, neural networks could learn an intertwined blend of different mechanisms needed to solve a task, making understanding them that much harder.
Recent work has shown that modern RNN architectures trained on binary sentiment classification learn low-dimensional, interpretable dynamical systems (Maheswaranathan et al., 2019). These RNNs were found to implement an integration-like mechanism, moving their hidden states along a line of stable fixed points to keep track of accumulated positive and negative tokens. Later, Maheswaranathan and Sussillo (2020) showed that contextual processing in these networks, e.g. handling word sequences like “not good”, was built on top of the line-integration mechanism, employing an additional subspace which the network entered upon encountering a modifier word (e.g. “not” and “very”). The understanding achieved in those works suggest the potential of the dynamical systems perspective. However it remains to be seen if this perspective can shed light on RNNs in more complicated settings, beyond binary sentiment classification.
This work applies the dynamical systems perspective to understand how RNNs perform various multi-class text classification tasks, including document classification, review star prediction (from one to five), and emotion tagging. Because text classification is a fundamental task in natural language processing, and the underlying evidence-integration mechanisms are core computational primitives, precisely understanding how RNNs solve text classification may prove useful in understanding more complex NLP problems.
We find the network dynamics in these text classification tasks to be low-dimensional and interpretable. All architectures use manifolds of attractive fixed points to accumulate evidence for each class. When class labels are categorical (e.g. “sports” or “business”), the dimensionality of the integration manifold simply reflects the number of classes in the dataset. Integration is also the dominant mechanism when class labels are ordered (as in star prediction); here, however, the manifolds case are two-dimensional in both three- and five-star settings. We show that this dimensionality reflects underlying correlations in the training dataset, and show that simple count statistics in the raw dataset is sufficient to explain the observed dimensionality. This unified picture allows us to also understand the integration manifolds in the the case of emotion tagging, which unlike the previous tasks we study allows each phrase to contain multiple labels.
In studying these tasks, we rely heavily on synthetic datasets, which offer a controlled setting to study network dynamics. We develop synthetic data for each type of classification task; networks trained on these datasets exhibit similar dynamics and geometry as their natural counterparts.
Related work Apart from the recent studies mentioned in the introduction—which to our knowledge are the only ones to study discrete-time RNNs in language tasks using dynamical systems analyses—the dynamical properties of continuous-time RNNs have been extensively studied (Vyas et al., 2020) for their potential connections to neural computation in biological systems.
A related body of work, interpretability of neural language models, is reviewed thoroughly in Belinkov and Glass (2018). Common methods of analysis include training auxiliary classifiers (e.g., part-of-speech) on RNN trajectories to probe the network’s representations; use of challenge sets to capture wider language phenomena than seen in natural corpora; and visualization of hidden unit activations as in Karpathy et al. (2015) and Radford et al. (2017).
Models We study three common RNN architectures: LSTMs (Hochreiter and Schmidhuber, 1997), GRUs (Cho et al., 2014), and UGRNNs (Collins et al., 2016). We denote their -dimensional hidden state and -dimensional input at time as and , respectively. The function that applies hidden state update for these networks will be denoted by , so that . The network’s hidden state after the entire example is processed, , is fed through a linear layer to get output logits for each label: . We call the rows of ‘readout vectors’ and denote the readout corresponding to the neuron by , for . Throughout the main text, we will present results for the GRU architecture. Qualitative features of results were found to be constant across all architectures; additional results for the LSTM and UGRNN are given in Appendix E.
Tasks The classification tasks we study fall into three categories. In the categorical case, samples are classified into non-overlapping classes, for example “sports” or “politics”. By contrast, in the ordered case, there is a natural ordering among labels: for example, predicting a numerical rating (say, out of five stars) accompanying a user’s review. Like the categorical labels, ordered labels are exclusive. Some tasks, however, involve labels which may not be exclusive; an example of this multi-labeled case is tagging a document for the presence of one or more emotions. A detailed description of the natural and synthetic datasets used is provided in Appendices C and D, respectively.
Linearization and eigenmodes
Part of our analysis relies on linearization to render the complex RNN dynamics tractable.
This linearization is possible because, as we will see, the RNN states visited during training and inference lie near approximate fixed points of the dynamics—points that the update equation leave (approximately) unchanged, i.e. for which .
where we have defined the recurrent and input Jacobians and , respectively (see Appendix A for details).
In the linear approximation, the spectrum of plays a key role in the resulting dynamics. Each eigenmode of represents a displacement whose magnitude either grows or shrinks exponentially in time, with a timescale determined by the magnitude of the corresponding (complex) eigenvalue via the relation . Thus, eigenvalues within the unit circle thus represent stable (decaying) modes, while those outside represent unstable (growing) modes. The Jacobians we find in practice almost exclusively have stable modes, most of which decay on very short timescales (a few tokens). Eigenmodes near the unit circle have long timescales, and therefore facilitate the network’s storage of information.
Latent semantic analysis For a given text classification task, one can summarize the data by building a matrix of word or token counts for each class (analogous to a document-term matrix (Manning and Schutze, 1999), where the documents are classes). Here, the entry corresponds to the number of times the word in the vocabulary appears in examples belonging to the class. In effect, the column corresponding to a given word forms an “evidence vector”, i.e. a large entry in an particular row suggests strong evidence for the corresponding class. Latent semantic analysis (LSA) (Deerwester et al., 1990) looks for structure in this matrix via a singular value decomposition (SVD); if the evidence vectors lie predominantly in a low-dimensional subspace, LSA will pick up on this structure. The top singular modes define a “semantic space”: the left singular vectors correspond to the projections of each class label into this space, and the right singular vectors correspond to how individual tokens are represented in this space.
Below, we will show that RNNs trained on classification tasks pick up on the same structure in the dataset as LSA; the dimensionality and geometry of the semantic space predicts corresponding features of the RNNs.
Regularization While the main text focuses on the interaction between dataset statistics and resulting network dimensionality, regularization also plays a role in determining the dynamical structures. In particular, strongly regularizing the network can reduce the dimensionality of the resulting manifolds, while weakly regularizing can increase the dimensionality. Focusing on -regularization, we document this effect for the synthetic and natural datasets in Appendicies D.1 and F, respectively.
3.1 Categorical classification yields simplex attractors
We begin by analyzing networks trained on categorical classification datasets, with natural examples including news articles (AG News dataset) and encyclopedia entries (DBPedia Ontology dataset). We find dynamics in these networks which are largely low-dimensional and governed by integration. Contrary to our initial expectations, however, the dimensionality of the network’s integration manifolds are not simply equal to the number of classes in the dataset. For example, rather than exploring a three-dimensional cube, RNNs trained on 3-class categorical tasks exhibit a largely two-dimensional state space which resembles an equilateral triangle (Fig. 1a, d). As we will see, this is an example of a pattern that generalizes to larger numbers of classes.
Synthetic categorical data To study how networks perform -class categorical classification, we introduce a toy language whose vocabulary includes words: a single evidence word “” for each label , and a neutral word “neutral”. Synthetic phrases, generated randomly, are labeled with the class for which they contain the most evidence words (see Appendix D for more details). This is analogous to a simple mechanism which classifies documents as, e.g., “sports” or “finance” based on whether they contain more instances of the word “football” or “dollar”.
The main features of the categorical networks’ integration manifolds are clearly seen in the 3-class synthetic case. First, the dynamics are low-dimensional: performing PCA on the set of hidden states explored from hundreds of test phrases reveals that more than of its variance is contained in the top two dimensions. Projected onto these dimensions, the set of network trajectories takes the shape of an equilateral triangle (Fig. 1a). Diving deeper into the dynamics of the trained network, we examine the deflections, or change in hidden state, , induced by each word. The deflections due to evidence words “” align with the corresponding readout vector at all times (Fig. 1b). Meanwhile, the deflection caused by the “neutral” word is much smaller, and on average, nearly zero. This suggests that the RNN dynamics approximate that of a two-dimensional integrator: as the network processes each example, evidence words move its hidden state within the triangle in a manner that is approximately constant across the phrase. The location of the hidden state within the triangle encodes the integrated, relative counts of evidence for each of the three classes. Since the readouts are of approximately equal magnitude and align with the triangle’s vertices, the phrase is ultimately classified by whichever vertex is closest to the final hidden state. This corresponds to the evidence word contained the most in the given phrase.
Natural categorical data Despite the simplicity of the synthetic categorical dataset, its working mechanism generalizes to networks trained on natural datasets. We focus here on the 3-class AG News dataset, with matching results for 4-class AG news and 3- and 4-class DBPedia Ontology in Appendix E. Hidden states of these networks, as in the synthetic case, fill out an approximate equilateral triangle whose vertices once again lie parallel to the readout vectors (Fig. 1d). While these results bear a strong resemblance to their synthetic counterparts, the manifolds for natural datasets are, unsurprisingly, less symmetric.
Though the vocabulary in natural corpora is much larger than the synthetic vocabulary, the network still learns the same underlying mechanism: by suitably arranging its input Jacobian and embedding vectors, it aligns an input word’s deflection in the direction that changes relative class scores appropriately (Fig. 1e). Most words behave like the synthetic word “neutral”, causing little movement within the plane; certain words, however, (like “football”) cause a large shift toward a particular vertex (in this case, “Sports”). Again, the perturbation is relatively uniform across the plane, indicating that the order of words does not strongly influence the network’s prediction.
In both synthetic and natural cases, the two-dimensional integration mechanism is enabled by a manifold of approximate fixed points near the network trajectories, which allow the network to maintain its position in the absence of new evidence. As the position withing the plane encodes the network’s integrated evidence, this maintenance is essential. In all 3-class categorical networks, we find a planar, approximately triangle-shaped manifold of fixed points which lie near the network trajectories (Fig. 1c, f); vertices of this manifold align with the readout vectors. PCA reveals the dimensionality of this manifold to be identical to that of the hidden states.
The fixed points’ stability is crucial to ensure robust storage of the integrated evidence. We check for stability directly by linearizing the dynamics around each fixed point and examining the spectra of the recurrent Jacobians. Almost all of the Jacobian’s eigenvalues well within the unit circle, corresponding to perturbations which decay on the timescale of a few tokens. Only two modes, which lie within the fixed-point plane, are capable of preserving information on timescales on the order of the mean document length (Fig. 2a). This linear stability analysis confirms our picture of a two-dimensional attractor manifold of fixed points; the network dynamics quickly suppress activity in dimensions outside of the fixed-point plane. Integration, i.e. motion within the fixed-point plane is enabled by two eigenmodes with long time constants.
LSA predictions Intuitively, the two-dimensional structure in this three-class classification task reflects the fact that the network tracks relative score between the three classes to make its prediction. To see this two-dimensional structure quantitatively in the dataset statistics, we apply latent semantic analysis (LSA) to the dataset, finding a low-rank approximation to the evidence vectors of all the words in the vocabulary. This analysis (Fig. 1h) shows that two modes are necessary to capture the variance, just as we observed in the RNNs. Moreover, the class vectors projected into this space (Fig. 1g) match exactly the structure observed in the RNN readouts. The network appears to pick up on the same structure in the dataset’s class counts identified by LSA.
General -class categorical networks
The triangular structure seen in the -class networks above is an example of a general pattern: -class categorical classification tasks result in an -dimensional simplex attractor (Fig. 3a). We verify this with synthetic data consisting of up to classes, analyzing the subspace of explored by the resulting networks. More than of the variance of the hidden states is contained in dimensions, with the subspace taking on the approximate shape of a regular -simplex centered about the origin (Fig. 1b).
As a natural example of this simplex, the full, 4-class AG News dataset results in networks whose trajectories explore an approximate 3-simplex, or tetrahedron. The fixed points also form a 3-dimensional tetrahedral attractor (Fig. 3c). Additional results for 4-class natural datasets, which also yield tetrahedron attractors, are shown in Appendix E.
3.2 Ordered classification yields plane attractors
Having seen networks employ simplex attractors to integrate evidence in categorical classification, we turn to ordered datasets, with Yelp and Amazon review star prediction as natural examples. Star prediction is a more finely-grained version of binary sentiment prediction that RNNs solve by integrating valence along a one-dimensional line attractor (Maheswaranathan et al., 2019). This turns out not to be the case for either 3-class or 5-class star prediction.
For a network trained on the 5-class Yelp dataset, we plot the two-dimensional projection of RNN trajectories while processing a test batch of reviews as well as the readout vectors for each class (Fig. 4d). Similar results for 3-class Yelp and Amazon networks are in Appendix E. The top two dimensions capture more than 98% of the variance in the explored states: as with categorical classification tasks, the dynamics here are largely low-dimensional. A manifold of fixed points that is also planar exists nearby (Fig. 4f). The label predicted by the network is determined almost entirely by the position within the plane. Additionally, eigenmodes of the linearized dynamics around these fixed points show two slow modes with timescales comparable to document length, separated by a clear gap from the other, much faster, modes (Fig. 2c, d). These two integration modes lie almost entirely within the fixed-point plane, while the others are nearly orthogonal to it.
These facts suggest that — in contrast to binary sentiment analysis — 5-class (and 3-class) ordered networks are two-dimensional, tracking two scalars associated with each token rather than simply a single sentiment score. As an initial clue to understanding what these two dimensions represent, we examine the deflections in the plane caused by particular words (Fig. 4f). These deflections span two dimensions—in contrast to a one-dimensional integrator, the word ‘horrible’ has a different effect than multiple instances of a weaker word like “poor.” These two dimensions of the deflection vector seem to roughly correspond to a word’s “sentiment” (e.g. good vs. bad) and “intensity” (strong vs. neutral). In this two-dimensional integration, a word like ‘okay’ is treated by the network as evidence of a neutral (e.g., 3-star) review.
Inspired by these observations, we build a synthetic ordered dataset with a word bank , in which each word now has a separate sentiment and intensity score.
More generally, the appearance of a plane attractor in both 3-class and 5-class ordered classification shows that in integration models, relationships (such as order) between classes can change the dimensionality of the network’s integration manifold. These relationships cause the LSA evidence vectors for each word to lie in a low-dimensional space. As in the previous section, we can see this low-dimensional in the dataset statistics itself using LSA, showing that two singular values explain more than 95% of the variance (Fig. 4f). Thus, the planar structure of these networks, with dimensions tracking both (roughly) sentiment and intensity, is a consequence of correlations present in the dataset itself.
3.3 Multi-labeled classification yields independent attractors
So far, we have studied classification datasets where there is only a single label per example. This only requires networks to keep track of the relative evidence for each class, as the overall evidence does not affect the classification. Put another way, the softmax activation used in the final layer will normalize out the total evidence accumulated for a given example. This results in networks that, for an -way classification task, need to integrate (or remember) at most quantities as we have seen above. However, this is not true in multi-label classification. Here, individual class labels are assigned independently to each example (the task involves independent binary decisions). Networks trained on this task do need to keep track of the overall evidence level.
To study how this changes the geometry of integration, we trained RNNs on a multi-label classification dataset, GoEmotions (Demszky et al., 2020). Here, the labels are emotions and a particular text may be labeled with multiple emotions. We trained networks on two reduced variants of the full dataset, only keeping two or three labels. The results for three labels are detailed in Appendix E.5. For the two-class version, we only kept the labels “admiration” and “approval”, and additionally resampled the dataset so each of the possible label combinations were equally likely. We found that RNNs learned a two-dimensional integration manifold where the readout vectors span a two-dimensional subspace (Fig. 5d), rather than a one-dimensional line as in binary classification. Across the fixed point manifold, there were consistently two slow eigenvalues (Fig. 5e), corresponding to the two integration modes. Similar to the previous datasets, increasing regularization would (eventually) compress the dimensionality, again measured using the participation ratio (Fig. 5f).
The synthetic version of this task classifies a phrase as if it were composed of two independent sentiment analyses (detailed in Appendix D.3). This is meant to represent the presence/absence of a given emotion in a given phrase, but ignores the possibility of correlations between certain emotions. After training a network on this data, we find a low-dimensional hidden-state and fixed-point space that both take on the shape of a square (Fig. 5a, c). The deflections of words affecting independent labels act along orthogonal directions (Fig. 5b).
These results suggest that integration manifolds are also found in RNNs trained on multi-labeled classification datasets. Moreover, the geometry of the corresponding fixed points and readouts is different from the exclusive case; instead of an -dimensional simplex we get an -dimensional hypercube. Again, this makes intuitive sense given that the networks must keep track of independent quantities in order to solve these tasks.
In this work we have studied text classification RNNs using dynamical systems analysis. We found integration via attractor manifolds to underlie these tasks, and showed how the dimension and geometry of the manifolds were determined by statistics of the training dataset. As specific examples, we see -dimensional simplexes in -class categorical classification where the network needs to track relative class scores; -dimensional attractors in ordered classification, reflecting the need to track sentiment and intensity; and -dimensional hypercubes in -class multi-label classification.
We hope this line of analysis — using dynamical systems tools to understand RNNs — builds toward a deeper understanding of how neural language models perform more involved tasks in NLP, including language modeling or translation. These tasks cannot be solved by a pure integration mechanism, but it is plausible that integration serves as a useful computational primitive in RNNs more generally, similar to how line attractor dynamics serve as a computational primitive on top of which contextual processing occurs (Maheswaranathan and Sussillo, 2020).
The authors would like to thank Ethan Dyer and Jascha Sohl-Dickstein for helpful discussions. The work of KA was supported, in part, by the Simons Foundation as part of the Simons Collaboration on Ultra Quantum Matter.
Appendix A Methods
a.1 Fixed-points and Linearization
We study several RNN architectures and we will generically denote their -dimensional hidden state and -dimensional input at time as and , respectively. The function that applies hidden state update for these networks will be denoted by , so that . The output logits are a readout of the final hidden state, . We will denote the readout corresponding to the th neuron by , for .
We define a fixed point of the hidden-state space to satisfy the expression . This definitions of fixed-points is inherently -dependent. In this text, we focus on fixed points when , where is defined to be the average input of the system. Since the natural datasets we consider in this work generally have a large vocabulary of words and the input of the system is one-hot encoded, for those datasets we will make the approximation . We will also be interested in finding points in hidden state space that only satisfy this fixed point relation approximately, i.e. . The slowness of the approximate fixed points can be characterized by defining a loss function . Throughout this text we use the term fixed point to include these approximate fixed points as well.
Expanding around a given hidden state and input, , the first-order approximation of is
where we have defined the recurrent and input Jacobians as and , respectively. If we expand about a fixed point , the effect of an input on the hidden state can be approximated by . Writing the eigendecomposition, , with , we have
where is the diagonal matrix containing the (complex) eigenvalues, that are sorted in order of decreasing magnitude; are the columns of ; and are the rows of . The magnitude of the eigenvalues of correspond to a time constant . The time constants, , approximately determine how long and what information the system remembers from a given input.
We find fixed points by minimizing a function which computes the magnitude of the displacement resulting from applying the update rule at point . That is, we numerically solve
We seed the minimization procedure with hidden states visited by the network while processing test examples. To better sample the region, we also add some isotropic Gaussian noise to the initial points.
a.2 Dimensionality Measures
Here we provide details regarding the measures used to determine both the dimensionality of our hidden-state and fixed-point manifolds. When we discuss the dimensionality of a set of points, we will mean their intrinsic dimensionality. Roughly, this is the dimensionality of a manifold that summarizes the discrete data points, accounting for the fact said manifold could be embedded in a higher-dimensional space in a non-linear fashion. For example, if the discrete points lie along a one-dimensional line that is non-linearly embedded in some two-dimensional space, then the measure of intrinsic dimensionality should be close to .
Let be the set of points for for which we wish to measure the dimensionality. In this text, is either a set of hidden-states or a set of fixed points and each is a point in hidden-state space. To determine an accurate measure of dimensionality, we use the following measures:
Variance explained threshold. Let be the eigenvalues from PCA (i.e. the variances) on . A simple measure of dimensionality is to threshold the number of PCA dimensions needed to reach a certain percentage of variance explained. For low number of classes, this threshold can simply be set at fixed values 90% or 95%. However, we would expect such threshold to breakdown as the number of classes increase, so we also use an -dependent threshold of %.
Global participation ratio. Again using PCA on as above, the participation ratio (PR) is defined to be a scalar function of the eigenvalues:
Intuitively, this is a scalar measure of the number of “important” PCA dimensions.
Local participation ratio. Since PCA is a linear mapping, both of the above measures will fail if the manifold is highly non-linear. We thus implement a local PCA as follows: we choose a random point and compute its nearest neighbors, then perform PCA on this subset of points. We then calculate the participation ratio on the eigenvalues of the local PCA using equation 5. We repeat the process over several random points, and then average the results. This measure is dependent upon the hyperparameter .
MLE measure of intrinsic dimension (Levina and Bickel, 2005). This is a nearest-neighbor based measure of dimension. For a point , let be the Euclidean distance to its th nearest neighbor. Define the scalar quantities
This measure is also dependent upon the number of nearest neighbors .
We plot these dimensionality measures used on synthetic categorical data for class sizes to in Figure 6. Despite their simplicity, we find the variance explained threshold and the global participation ratio to be the best match to what is theoretically predicted, hence we use these measures in the main text and in what follows.
Appendix B Models and Training
The three architectures we study are specified below, with and respectively representing trainable weight matrices and bias parameters, and denoting the hidden state at timestep . All other vectors () represent intermediate quantities; represents a pointwise sigmoid nonlinearity; and is either the ReLU or tanh nonlinearity.
Update-Gate RNN (UGRNN)
Gated Recurrent Unit (GRU)
With the natural datasets, we form the input vectors by using a (learned) 128-dimensional embedding layer. These UGRNNs and GRUs have hidden-state dimension , while the LSTMs have hidden-state dimension . For the synthetic datasets, due to their small vocabulary size, we simply pass one-hot encoded inputs in the RNN architectures, i.e. we use no embedding layer. For UGRNNs and GRU, we use a hidden-state dimension of , while for LSTMs we use a -dimensional hidden-state.
The model’s predictions (logits for each class) are computed by passing the final hidden state through a linear layer. In the synthetic experiments, we do not add a bias term to this linear readout layer, chosen for simplicity and ease of interpretation.
We train the networks using the ADAM optimizer (Kingma and Ba, 2014) with an exponentially-decaying learning rate schedule. We train using cross-entropy loss with added regularization, penalizing the squared norm of the network parameters. Natural experiments use a batch size of 64 with initial learning rate , clipping gradients to a maximum value of 30; the learning rate decays by every step. Synthetic experiments use a batch size of 128, initial learning rate , and a gradient clip of ; the learning rate decays by every step.
Appendix C Natural Dataset Details
We use the following text classification datasets in this study:
The Yelp reviews dataset (Zhang et al., 2015) consists of Yelp reviews, labeled by the corresponding star rating (1 through 5). Each of the five classes features 130,000 training examples and 10,000 test examples. The mean length of a review is 143 words.
The Amazon reviews dataset (Zhang et al., 2015) consists of reviews of products bought on Amazon.com over an 18-year period. As with the Yelp dataset, these reviews are labeled by the corresponding star rating (1 through 5). Each of the five classes features 600,000 training examples and 130,000 test examples. The mean length of a review is 86 words.
The DBPedia ontology dataset (Zhang et al., 2015) consists of titles and abstracts of Wikipedia articles in one of 14 non-overlapping categories, from DBPedia 2014. Categories include: company, educational institution, artist, athlete, office holder, mean of transportation, building, natural place, village, animal, plant, album, film, and written work. Each class contains 40,000 training examples and 5,000 testing examples. We use the abstract only for classification; mean abstract length is 56 words.
The AG’s news corpus (Zhang et al., 2015) contains titles and descriptions of news articles from the web, in the categories: world, sports, business, sci/tech. Each category features 30,000 training examples and 1,900 testing examples. We use only the descriptions for classification; the mean length of a description is 35 words.
The GoEmotions dataset (Demszky et al., 2020) contains text from 58,000 Reddit comments collected between 2005 and 2019. These comments are labeled with the following 27 emotions: admiration, approval, annoyance, gratitude, disapproval, amusement, curiosity, love, optimism, disappointment, joy, realization, anger, sadness, confusion, caring, excitement, surprise, disgust, desire, fear, remorse, embarrassment, nervousness, pride, relief, grief. The mean length of a comment is 16 words.
Two main characteristics distinguish these datasets: (i) whether there is a notion of order among the class labels, and (ii) whether labels are exclusive. The reviews datasets, Amazon and Yelp, are naturally ordered, while the labels in the other datasets are unordered. All of the datasets besides GoEmotions feature exclusive labels; only in GoEmotions can two or more labels (e.g., the emotions anger and disappointment) characterize the same example. In addition to the standard five-class versions of the ordered datasets, we form three-class subsets by collecting reviews with 1, 3, and 5 stars (excluding reviews with 2 and 4 stars).
We build a vocabulary for each dataset by converting all characters to lowercase and extracting the 32,768 most common words in the training corpus. Tokenization is done by TensorFlow TF.Text WordpieceTokenizer.
Appendix D Synthetic Dataset Details
In this appendix we provide many additional details and results from our synthetic datasets. Although these datasets represent significantly simplified settings compared to their realistic counterparts, often the results from training RNNs on the synthetic and natural datasets are strikingly similar.
d.1 Categorical Dataset
For the categorical synthetic dataset used in Section 3.1, we generate phrases of words, drawing from a word bank consisting of words, . Each word has an -dimensional vector of integers associated with it, with for all . The word “” has score defined by and for . Meanwhile, the word “neutral” has for all . Additionally, each phrase has a corresponding score that also takes the form of an -dimensional vector of integers, . A phrase’s score is equal to the sum of scores of the words contained in said phrase, . The phrase is then assigned a label corresponding to the class with the maximum score, .
In the main text we analyze synthetic datasets where phrases are drawn from a uniform distribution over all possible scores, .
To do so, we enumerate all possible scores a phrase of length can produce as well as all possible word combinations that can generate a given score.
It is also possible to build phrases by drawing each word from a uniform distribution over all words in .
In practice, we find all results on synthetic datasets have minor quantitative differences when comparing these two methods, but qualitatively the results are the same.
As highlighted in the main text, after training on this synthetic data we find the explored hidden-state space to resemble a regular -simplex. This holds for a large range values relative to the natural datasets. In Figure 7, we plot the (global) participation ratio, defined in equation 5, as a function of the number of classes, .
In addition to the hidden states forming a simplex, we observe the readout vectors are approximately equal magnitude and are aligned along the vertices of said -simplex. In Figure 8, we plot several measures on the readout vectors that support this claim that we now discuss. We find the readout vectors to have very close to the same magnitude (Fig. 8, left panel). The angle (in degrees) between a pair of vectors that point from the center of a -simplex to two of its vertices is
For example, for a regular -simplex, i.e. an equilateral triangle, this predicts an angle between readout vectors of degrees. The distribution of pairwise angles between readout vectors is plotted the center panel of Figure 8. Lastly, if the readouts lie entirely within the -simplex, all of them should live in the same subspace. To measure this, define to be projection of into the subspace formed by the other readout vectors, i.e. the span of the set . We then define the subspace percentage, , as follows,
If all the readouts lie within the same subspace, then . The right panel of Figure 8 shows that in practice for the synthetic data with regularization parameter of .
Why a Regular -Simplex? Here we propose an intuitive scenario that leads the network’s hidden states to form a regular -simplex. To classify a given phrase correctly, the network must learn to keep track of the value of the -dimensional score vector . One way this can be done is follows: Let the network’s hidden state live in some dimensional subspace. Within this subspace, let the readout vectors be orthogonal and have equal magnitude. Furthermore, define a Cartesian coordinate system to have basis vectors aligned with the readouts, with the coordinate along the direction of readout . Then, the coordinates within this subspace encodes the components of the -dimensional score vector : the evidence word ’ moves you along the coordinate direction some fixed amount and so . Note the subspace of explored by hidden states has a finite extent, since the phrases are of finite length. This subspace can be further subdivided into regions corresponding to different class labels: if for all then and the phrase is classified as Class . The left panel of Figure 9 shows an example of the -dimensional subspace for .
The important step that gets us from a subspace of to the regular -simplex is the presence of the softmax layer used when calculating loss. Since this function normalizes the scores, it is only the relative size of the components of that matters. Removing the dependence on the absolute score values corresponds to projecting onto the subspace orthogonal to the -dimensional ones vector, . This projection results in an -simplex with the readouts aligned with the vertices. A demonstration of this procedure for is shown in Figure 9.
d.2 Ordered Dataset
As alluded to in the main text, we try two renditions of ordered synthetic data. The details of both are given below. The first relies on a ground truth of only a sentiment score, while the second classifies based on both sentiment and neutrality. Although the first is simpler and still bears many resemblances to natural data (i.e. Yelp and Amazon), we find the second to be a better match overall.
Sentiment Only Synthetic Data The first synthetic data for ordered datasets is very similar into that of the categorical sets with a minor difference in the word bank and how phrases are assigned labels. For -class ordered datasets, the word bank always consists of only three words . We now take the word and phrase scores to be 1-dimensional and , , and . We then subdivide the range of possible scores into equal regions, and a phrase is labeled by the region which its score fall into. Given the above definitions, the range of scores is , and so for the , with labels , we define some threshold . Then a label is assigned as follows:
Meanwhile, for , one could draw the region divisions at the score values . Similar to the categorical data above, in the main text we draw phrases from a uniform distribution over all possible scores.
The case corresponds to sentiment analysis, and its hidden state space, word deflections, and fixed point manifold are plotted in Figure 10.
The simplest ordered dataset beyond binary sentiment analysis is that of , and a plot showing the final hidden states, deflections, and fixed-point manifold is shown in top row of Figure 11. In the bottom row, we show the same plots for . In both cases, the hidden-state trajectories move away from onto a curve embedded in 2d plane, with the curve bent around the origin of said plane. The readout vectors are evenly fanned out in the 2d plane, which subdivides the curve into regions corresponding to each of the classes. The curve subdivisions reflect the ordering of the score subdivisions, for we see ‘Neutral’ lying in between ‘Positive’ and ‘Negative’ and for the stars are ordered from 1 to 5.
In contrast to categorical data, the word deflection are highly varied and have a strong dependence on a state’s location in hidden-state space. On average, the words ‘good’ and ‘bad’ move the hidden state further left/right along the curve. Although for the word ‘neutral’ is on average smaller, it tends to move the hidden state along the ‘Neutral’ or ‘3 Star‘ readout. These dynamics are how the network encodes the relative count of ‘good’ and ‘bad’ words in a phrase that ultimately determines the phrase’s classification. We show the fixed points in the far right panel of Figure 11. For , the fixed point manifold mostly resembles that of a one-dimensional bent line attractor, with a small region that is two-dimensional along the ‘Neutral‘ readout. For , the fixed point manifold is much more planar. Thus, the case exhibits very similar dynamics to that of the line attractor studied in Maheswaranathan et al. (2019), the attractor is now simply subdivided into three regions due to the readout vector alignments.
Sentiment and Neutrality Synthetic Data Instead of classifying a phrase based off a single sentiment score, our second ordered synthetic model classifies a phrase based off of two scores that track the sentiment and intensity of a given phrase. We draw from an enhanced word bank consisting of . We take the two-dimensional word score to have components corresponding to where positive (negative) sentiment scores correspond to positive (negative) sentiment and positive (negative) intensity scores correspond to high (low) emotion. The word score values we use are
As with the other synthetic models, we sum all word scores across a phrase to arrive at a phrase’s sentiment and intensity score, . We then assign the phrase a label based off the following criterion:
Thus we see that scores with negative (low) intensity where the intensity magnitude is greater than the sentiment magnitude are classified as ‘Three Star’, i.e. it is a neutral phrase. Otherwise, phrases with low intensity that are the less extreme reviews are classified as either ‘Two Star’ or ‘Four Star’ based on their sentiment. Finally, phrases with high intensity are labeled either ‘One Star’ or ‘Five Star’, again based on their sentiment.
d.3 Multi-Labeled Dataset
Here we provide details of the synthetic multi-labeled dataset, that corresponds to natural dataset GoEmotions in Section 3.3 of the main text. Let us introduce this by taking the as an explicit example, where each phrase can have up to two labels. We draw from a word bank consisting of , where
We then classify each phrase with two labels, individually based on the score vector components and . Namely,
Thus there are four possible combinations of labels. For this synthetic datasets, we generate phrases by uniformly drawing words one-by-one from . Generalization of the above construction to an arbitrary number of possible labels is straightforward: one simply adds additional -dimension score vectors and for each possible label and then uses the components of the score to assign the labels, , individually.
The results after training a network on the dataset are shown in the main text in Figure 5, and results for are shown in Figure 12. Again, we see the explored hidden-state space to be low-dimensional, but notably it now resembles a three-dimensional cube. This is certainly a large departure from the categorical dataset, from which we expect a (seven-dimensional) regular -simplex. Instead, what we see here is the “outer product” of three ordered datasets. That is, we expect a single ordered dataset (i.e. binary sentiment analysis) to have a hidden-state space that resembles a line attractor. As one might expect, tasking the network with analyzing three such sentiments at once leads to three line attractors that are orthogonal to one another, forming a cube. This is supported in the center panel of Figure 5, where we see the various sentiment evidences are orthogonal from one another.
Appendix E Additional results on natural datasets
e.1 AG News
e.2 DBPedia 3-class and 4-class categorical prediction
e.3 Yelp 5-class star prediction
Figures 19, 18, and 20 show the fixed-point manifolds associated with a GRU, LSTM and UGRNN, respectively, trained on 5-class and 3-class Yelp dataset. These reviews are naturally five star; we create a 3-class subset by removing examples labeled with 2 and 4 stars. These figures complement Figure 1 in the main text.
e.4 Amazon 5-class and 3-class star prediction
e.5 3-class GoEmotions
In addition to the 2 class variant presented in the main text, we also trained a 3 class version of the GoEmotions dataset. We filtered the dataset to just include the following three classes: “admiration”, “approval”, and “annoyance” (these were selected as they were the classes with the largest number of examples). These results are presented in Figure 24. For this network, despite having three classes, we find that the fixed points are largely two dimensional (Fig. 24a). The timescales of the eigenvalues of the Jacobian computed at these fixed points have two slow modes (Fig. 24b), which overlap with the two modes (Fig. 24c); thus we have a roughly 2D plane attractor. However, the participation ratio (Fig. 24d) indicates that the dimensionality of this attractor is slightly higher than the 2D case shown in Fig. 5. We suspect that these differences are due to the strong degree of class imbalance present in the GoEmotions dataset. There are very few examples with multiple labels, for any particular combination of labels. In synthetic multi-labeled data (which is class balanced), we see much clearer 3D structure when training a 3 class network (Fig. 12).
Appendix F The effect of regularization: collapse, context, and correlations
Regularizing the parameters of the network during training can have a strong effect on the dimension of the resulting dynamics. We describe this effect first for the datasets with ordered labels, Yelp and Amazon reviews. We penalize the squared -norm of the parameters, adding the term to the cross-entropy prediction loss; is the penalty and are the network parameters.
Collapse: Figure 25 shows the performance of the LSTM, GRU, and UGRNN as a function of the penalty. As the penalty is varied, the test accuracy usually decreases gradually; however, at a few values, the accuracy takes a large hit. The first two of these jumps correspond to a decrease in the dimension of the integration manifold from 2D and 1D and then 1D to 0D. The resulting 1D manifold is shown, for the example of a GRU on the Amazon dataset in Figures 26. The effects of collapse on the other architectures for the ordered datasets are identical.
When the regularization is sufficient to collapse the manifold to a 1D line, the dynamics are quite similar to the 1D line attractors studied in Maheswaranathan et al. (2019). A single accumulated valence score is tracked by the network as it moves along the line; this tracking occurs via a single eigenmode with a time constant comparable to the average document length, aligned with the fixed-point manifold. The difference between the binary- and 5-class line-attractor networks are largely in the way the final states are classified; in the 5-class case, the line attractor is divided into sections based largely on the angle the line makes with the readout vector of each class.
The collapse to a 0D manifold with a higher penalty is most strikingly seen in the recurrent Jacobian spectra at the fixed points (Figure 27). Here there are no modes which remember activity on the timescale of the mean document length. Given this lack of integration, it is unclear how these networks are achieving accuracies above random chance.
Context: While the focus of this study has been on how networks perform integration, it is clear from the plots in Figure 25 that the best-performing models are doing more than just bag-of-words style integration. When the order of words in the sentence is shuffled, these models take a hit in accuracy. Interestingly, when the coefficient is increased from the smallest values we use, the contextual effects are the first to be lost: the model’s accuracy on shuffled and ordered examples becomes the same.
Understanding precisely how contextual processing is carried out by the network is an interesting direction for future work. It is important to show, however, that the basic two-dimensional integration mechanism we have presented in the main text still underlies the dynamics of the networks which are capable of handling context. To show this, we plot in Figure 28 the fixed point manifold, colored by the predicted class. As with the models which are not order-sensitive, the classification of the fixed points depends largely on their top two coordinates (after PCA projection). This is the case even though the PCA explained variance clearly shows extension of the dynamics into higher dimensions. It is thus likely that, similarly to how Maheswaranathan and Sussillo (2020) found that the contextual-processing mechanism was a perturbation on top of the integration dynamics for binary sentiment classification, the same is true for more finely-grained sentiment classification.
Correlations: As might be expected, increasing regularization also causes collapse in models trained on categorical classification tasks. For example, as shown in Figure 29, the tetrahedral manifold seen in 4-class AG News networks becomes a square at higher values of , collapsing from three dimensions to two. That is, instead of class labels corresponding to vertices of a tetrahedron, when the regularization is increased, these labels correspond to the vertices of a square.
Interestingly, in the the collapse to a square, we find that — regardless of architecture and across 10 random seeds per architecture — the ordering of vertices around the square appears to reflect correlations between classes. Up to symmetries, the only possible ordering of vertices around the square are: (i) World Sci/Tech Business Sports, (ii) World Sci/Tech Sports Business, and (iii) World Sports Sci/Tech Business. In practice, we observe that most of the time (26 out of 30 trials), order (iii) appears; otherwise, order (i) appears. We never observe order (ii).
To show how this ordering arises from correlations between class labels, we train a bag-of-words model on the full 4-class dataset. Taking the most common 5000 words in the vocabulary, we plot, in Figure 30, the changes in each logit due to these words. As the figure shows, for most pairs of classes there is a weak negative correlation between the evidence for the pair. However, between the classes “Sports” and “Business”, there is a strong negative correlation (=-0.81); between “Sports” and “Sci/Tech”, there is a slightly weaker negative correlation (=-0.61). Stated another way, words which constitute positive evidence for “Sports” are likely to constitute negative evidence for “Business” and/or “Sci/Tech”. This matches with the geometries we observe in practice, where “Sports” and “Business” readouts are ‘repelled’ most often, and otherwise “Sports” and “Sci/Tech” are repelled.
- Although the fixed points expression depends on the input , throughout this text we will only study fixed points under a network’s average input which is usually well approximated by (see Appendix A).
- A -simplex is a line segment, a -simplex a triangle, a -simplex a tetrahedron, etc. A simplex is regular if it has the highest degree of symmetry (e.g. an equilateral triangle is a regular -simplex).
- Interestingly, a synthetic model that only tracks sentiment fails to match the dynamics of natural ordered data for . We take this as further evidence that natural ordered datasets classify based on two-dimensional integration. This simple model still produces surprisingly rich dynamics that we detail in Appendix D.2.
- We have also analyzed the same synthetically generated data for variable phrase lengths. The qualitative results focused on in this text did not change in this setting.
- The ordered dataset is equivalent to the categorical dataset. Intuitively, ‘good‘ and ‘bad‘ can be though of evidence vectors for the classes ‘Positive‘ and ‘Negative‘, respectively. Just like the categorical classification, whichever of these evidence words appears the most in a given phrase will be the phrase’s label.
- Analysis methods in neural language processing: A survey. CoRR abs/1812.08951. External Links: Cited by: §1.
- Estimating the intrinsic dimension of data with a fractal-based method. IEEE Transactions on pattern analysis and machine intelligence 24 (10), pp. 1404–1407. Cited by: 5th item.
- Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR abs/1406.1078. External Links: Cited by: §2.
- Capacity and trainability in recurrent neural networks. External Links: Cited by: §2.
- Indexing by latent semantic analysis. Journal of the American society for information science 41 (6), pp. 391–407. Cited by: §2.
- GoEmotions: a dataset of fine-grained emotions. pp. 4040–4054. External Links: Cited by: 5th item, §3.3.
- Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: §2.
- Visualizing and understanding recurrent networks. CoRR abs/1506.02078. External Links: Cited by: §1.
- Adam: a method for stochastic optimization. International Conference on Learning Representations, pp. . Cited by: Appendix B.
- Maximum likelihood estimation of intrinsic dimension. In Advances in neural information processing systems, pp. 777–784. Cited by: 4th item.
- How recurrent networks implement contextual processing in sentiment analysis. arXiv preprint arXiv:2004.08013. Cited by: Appendix F, §1, §4.
- Reverse engineering recurrent networks for sentiment classification reveals line attractor dynamics. In Advances in Neural Information Processing Systems 32, pp. 15696–15705. Cited by: §D.2, §D.2, Figure 26, Appendix F, §1, §3.2.
- Foundations of statistical natural language processing. MIT press. Cited by: §2.
- Measuring the strangeness of strange attractors. Physica. D 9 (1-2), pp. 189–208. Cited by: 5th item.
- Learning to generate reviews and discovering sentiment. CoRR abs/1704.01444. External Links: Cited by: §1.
- Computation through neural population dynamics. Annual Review of Neuroscience 43, pp. 249–275. Cited by: §1.
- Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama and R. Garnett (Eds.), pp. 649–657. Cited by: 1st item, 2nd item, 3rd item, 4th item.