StateDenoised Recurrent Neural Networks
Abstract
Recurrent neural networks (RNNs) are difficult to train on sequence processing tasks, not only because input noise may be amplified through feedback, but also because any inaccuracy in the weights has similar consequences as input noise. We describe a method for denoising the hidden state during training to achieve more robust representations thereby improving generalization performance. Attractor dynamics are incorporated into the hidden state to ‘clean up’ representations at each step of a sequence. The attractor dynamics are trained through an auxillary denoising loss to recover previously experienced hidden states from noisy versions of those states. This statedenoised recurrent neural network (SDRNN) performs multiple steps of internal processing for each external sequence step. On a range of tasks, we show that the SDRNN outperforms a generic RNN as well as a variant of the SDRNN with attractor dynamics on the hidden state but without the auxillary loss. We argue that attractor dynamics—and corresponding connectivity constraints—are an essential component of the deep learning arsenal and should be invoked not only for recurrent networks but also for improving deep feedforward nets and intertask transfer.
1 Introduction
Noise robustness is a fundamental challenge for every information processing system. A traditional approach to handling noise is to design preprocessing stages that estimate and filter noise from an input signal [Boll, 1979]. More recently in machine learning, loss functions have been explored to achieve invariance to taskirrelevant perturbations in the input [Simard et al., 1992, Zheng et al., 2016]. Although such methods are suitable for handling external noise, we argue in this article that internal noise can be an even greater challenge. To explain what we mean by internal noise, consider a deepnet architecture in which representations are transformed from one hidden layer to the next. To the degree that the network weights are not precisely tuned to extract only the critical features of the domain, irrelevant features may be selected and may potentially interfere with subsequent processing. Recurrent architectures are particularly fragile because internalstate dynamics can amplify noise over the course of sequence processing [Mathis and Mozer, 1994]. In this article, we propose a suppression method that improves the generalization performance of deep nets. We focus on recurrent nets, where the potential benefits are most significant.
Our approach draws inspiration from the human brain, a notably robust and flexible information processing system. Although the brain operates in an extremely highdimensional continuous state space, the conscious mind categorizes and interprets the world via language. We argue that categorization and interpretation processes serve to suppress noise and increase the reliability of behavior. Categorization treats instances that vary in incidental ways as identical: a chair is something you can sit on regardless of whether it is made of wood or metal. Language plays a similar role in cognition. For example, Witthoft et al. [2003] demonstrated that color terms of one’s native language influence a perceptualjudgment task—selecting a color patch that best matches a reference patch. The linguistic label assigned to a patch is the basis of matching, rather than the lowlevel perceptual data.
Suppressing variability facilitates communication in informationprocessing systems. Whether we are considering groups of humans, regions within the brain, components of an articulated neural net architecture, or layers of a feedforward network, information must be communicated from a sender to a receiver. To ensure that the receiver interprets a message as intended, the sender should limit its messages to a canonical form that the receiver has interpreted successfully in the past. Language serves this role in interhuman communication. We explore noise suppression methods to serve this role in intranetwork communication in deeplearning models.
In the next section, we describe a recurrent neural network architecture for cleaning up noisy representations—an attractor net. We then propose integrating attractor networks into deep networks, specifically sequenceprocessing networks, for denoising internal states. We present a series of experiments on smallscale problems to illustrate the methodology and analyze its benefits, followed by experiments on more naturalistic problems which also show reliable and sometimes substantial benefits of denoising, particularly in datalimited situations. We postpone a discussion of related work until we present the technical content of our work, which should facilitate a comparison.
2 Noise suppression via attractor dynamics
The attractor nets we explore are discretetime nonlinear dynamical systems. Given a static dimensional input , the network state at iteration , , is updated according to:
(1) 
where is a nonlinearity and is a weight matrix. Under certain conditions on and , the state is guaranteed to converge to a limit cycle or fixed point. A limit cycle of length occurs if . A fixed point is the special case of .
Attractor nets have a long history beginning with the work of Hopfield [1982] that was partly responsible for the 1980s wave of excitement in neural networks. Hopfield defined a mapping from network state, , to scalar energy values via an energy (or Lyapunov) function, and showed that the dynamics of the net perform local energy minimization. The shape of the energy landscape is determined by weights and input (Figure 2a). Hopfield’s original work is based on binaryvalued neurons with asynchronous update; since then similar results have been obtained for certain nets with continuousvalued nonlinear neurons acting in continuous [Hopfield, 1984] or discrete time [Koiran, 1994]. {SCfigure}[1][bt]
We adopt Koiran’s 1994 framework, which dovetails with the standard deep learning assumption of synchronous updates on continuousvalued neurons. Koiran shows that with symmetric weights (), nonnegative selfconnections (), and a bounded nonlinearity that is continuous and strictly increasing except at the extrema (e.g., tanh, logistic, or their piecewise linear approximations), the network asymptotically converges over iterations to a fixed point or limit cycle of length 2. As a shorthand, we refer to convergence in this manner as stability.
Attractor nets have been used for multiple purposes, including contentaddressable memory, information retrieval, and constraint satisfaction [Mozer, 2009, Siegelmann, 2008]. In each application, the network is given an input containing partial or corrupted information, which we will refer to as the cue, denoted ; and the cue is mapped to a canonical or wellformed output in . For example, to implement a contentaddressable memory, a set of vectors, , must first be stored. The energy landscape is sculpted to make each an attractor via a supervised training procedure in which the target output is the vector for some and the input is a noisecorrupted version of the target. Following training, noisecorrupted inputs should be ‘cleaned up’ to reconstruct the state.
In the model described by Equation 1, the attractor dynamics operate in the same representational space as the input (cue) and output. Historically, this is the common architecture. By projecting the input to a higher dimensional latent space, we can design attractor nets with greater representational capacity. Figure 2b shows our architecture, with an dimensional input, , an dimensional output, , and an dimensional attractor state, , where typically for representational flexibility. The input is projected to the attractor space via an affine transform:
(2) 
The attractor dynamics operate as in Equation 1, with initialization and . Finally, the asymptotic attractor state is mapped to the output:
(3) 
where is an output activation function and the and are free parameters of the model.
To conceptualize a manner in which this network might operate, might copy the input features forward and the attractor net might use degrees of freedom in its state representation to maintain these visible features. The other degrees of freedom could be used as latent features that impose higherorder constraints among the visible features (in much the same manner as the hidden units in a restricted Boltzmann machine). When the attractor state is mapped to the output, would then transfer only the visible features.
As Equations 1 and 2 indicate, the input biases the attractor state at every iteration , rather than—as in the standard recurrent net architecture—being treated as an initial value, e.g., . Effectively, there are shortcircuit connections between input and output to avoid vanishing gradients that arise in deep networks (Figure 2c). As a result of the connectivity, it is trivial for the network to copy to —and thus to propagate gradients back from to . For example, the network will implement the mapping if: , , , , and or .
In our simulations, we use an alternative formulation of the architecture that also enables the copying of to by treating the input as unbounded and imposing a bounding nonlinearity on the output. This variant consists of: the input being replaced with in Equation 2, in Equation 3, and the nonlinearity in the attractor dynamics being shifted back one iteration, i.e., Equation 1 becomes . This formulation is elegant if is the activation pattern from a layer of tanh neurons, in which case the tanh and nonlinearities cancel. Otherwise, to ensure numerical stability, one can define the input for some small .
2.1 Training a denoising attractor network
We demonstrate the supervised training of a set of attractor states, , with . The input to the network is a noisy version of some state , , with , and the corresponding target output is simply . With noisy instances of each attractor for training, we define a normalized denoising loss
(4) 
to be minimized by stochastic gradient descent. The aim is to sculpt attractor basins whose diameters are related to . The normalization in Equation 4 serves to scale the loss such that indicates failure of denoising and indicates complete denoising. The range of this loss helps with calibration when combined with other losses.
We trained attractor networks varying the number of target attractor states, , the noise corruption, , and the number of units in the attractor net, . In all simulations, the input and output dimensionality is fixed at , training inputs and testing inputs were generated for each of the attractors. The network is run to convergence. Due to the possibility of a limit cycle of 2 steps, we used the convergence criterion , where is the output at iteration of the attractor dynamics. This criterion ensures that no element of the state is changing by more than some small . For , we found convergence typically in under 5 steps, nearly always under 10.
Figure 1 shows the percentage of noise variance removed by the attractor dynamics on the test set, defined as . The three panels correspond to different levels of noise in the training set; in all cases, the noise in the test set was fixed at . The 4 curves each correspond to a different size of the attractor net, . Larger should have greater capacity for storing attractors, but also should afford more opportunity to overfit the training set. For all noise levels, the smallest () and largest () nets have lower storage capacity than the intermediate () nets, as reflected by a more precipitous drop in noise suppression as , the number of attractors to be stored, increases. One takeaway from this result is that roughly we should choose , that is, the hidden attractor space should be about twice as large as the input/output space. This result does not appear to depend on the number of attractors stored, but it may well depend on the volume of training data. Another takeaway from this simulation is that the noise level in training should match that in testing: training with less ( and more ( noise than in testing () resulted in poorer noise suppression. Based on this simulation, the optimal parameters, and , should yield a capacity for the network that is roughly , as indicated by the performance drop that accelerates for .
3 Statedenoised recurrent neural networks
After demonstrating that attractor dynamics can be used to denoise a vector representation, the next question we investigate is whether denoising can be applied to internal states of a deep network to improve its generalization performance. As we explained earlier, our focus is on denoising the hidden state of recurrent neural networks (RNNs) for sequence processing. RNNs are particularly susceptible to noise because feedback loops can amplify noise over the course of a sequence.
[1.11111][b!]
Figure 3 shows an unrolled RNN architecture with an integrated attractor net. The iterations of the attractor net are unrolled vertically in the Figure, and the steps of the sequence are unrolled horizontally, with each column indicating a single step of the sequence with a corresponding input and output. The goal is for the hidden state to be denoised by the attractor net, yielding a cleaned hidden state, which then propagates to the next step. In order for the architecture to operate as we describe, the parameters of the attractor net (blue and green connections in Figure 3) must be trained to minimize and the parameters of the RNN (purple, orange, and red connections in Figure 3) must be trained to minimize a supervised task loss, , defined over the outputs. We have found it generally beneficial to also train the parameters of the attractor net on , and we describe this version of the model.
We refer to the architecture in Figure 3 as a state denoised recurrent neural net (or SDRNN) when it is trained on the two distinct objectives, and , and as a recurrent neural net with attractors (or RNN+A) when all parameters in the model are trained solely on . For each minibatch of training data, the SDRNN training procedure consists of:

Take one stochastic gradient step on for all RNN and attractor weights.

Recompute hidden state for each training sequence and each step of the sequence , .

With just the attractor net, take one stochastic gradient step on for the attractor weights using the procedure in Section 2.1 with .

Optionally, step 3 might be repeated until .
4 Simulations
4.1 Parity task
We studied a streamed parity task in which 10 binary inputs are presented in sequence and the target output following the last sequence element is if an odd number of s is present in the input or otherwise. The architecture has tanh hidden units, attractor units, and a single input and a single output. We trained the attractor net with and ran it for exactly 15 iterations (more than sufficient to converge). Models were trained on 256 randomly selected binary sequences. Two distinct test sets were used to evaluate models: one consisted of the heldout 768 binary sequences, and a second test set consisted of three copies of each of the 256 training sequences with additive uniform input noise. We performed one hundred replications of a basic RNN, the SDRNN, and the RNN+A (the architecture in Figure 3 trained entirely on the task objective). Other details of this and subsequent simulations are presented in the Supplementary Materials.
Figure 2a shows box plots of performance on the heldout sequences by the RNN, SDRNN, and RNN+A. The box has lines indicating first, second, and third quartiles, and the notches indicate uncertainty in the mean. Figure 2b shows the same pattern of results for the noisy test sequences. The SDRNN significantly outperforms both the RNN and the RNN+A: it generalizes better to novel sequences and is better at ignoring additive noise in test cases, although such noise was absent from training.
In principle, parity should be performed more robustly if a system has a highly restricted state space. Ideally, the state space would itself be binary, indicating whether the number of inputs thus far is even or odd. Such a restricted representation should force better generalization. Indeed, quantizing the hidden activation space for all sequence steps of the test set, we obtain a lower entropy for the SDRNN (3.70, standard error .06) than for the RNN (4.03, standard error .05). However, what is surprising about this simulation is that gradientbased procedures could learn such a restricted representation, especially when two orthogonal losses compete with each other during training. The competing goals are clearly beneficial, as the RNN+A and SDRNN share the same architecture and differ only in the addition of the denoising loss.
Figure 2c,d show the effect of varying in training attractors. If is too small, the attractor net will do little to alter the state, and the model will behave as an ordinary RNN. If is too large, many states will be collapsed together and performance will suffer. An intermediate thus must be best, although what is ‘just right’ may turn out to be domain dependent.
We explored two additional manipulations that had no systematic effect on performance. First, we varied the fixed number of attractor iterations (Figures 2e,f). It is reassuring that more iterations of the attractor net—yielding a deeper net—did not harm performance in any way. It also appears that only two attractor iterations is not as effective as increasing the number of iterations. Second, we hypothesized that because the attractornet targets change over the course of learning, it might facilitate training to introduce weight decay in order to forget the attractors learned early in training. We introduced an regularizer using the ridge loss, , where is the symmetric attractor weight matrix in Equation 1 and is a weight decay strength. As Figures 2g,h indicate, is most effective. For this reason, we performed no further experiments with the ridge loss.
The noise being suppressed during training is neither input noise nor label noise; it is noise in the internal state due to weights that have not yet been adapted to the task. Nonetheless, denoising internal state during training appears to help the model overcome noise in inputs and generalize better.
4.2 Majority task
We next studied a majority task in which the input is a binary sequence and the target output is if a majority of inputs are 1, or otherwise. We trained networks on 100 distinct randomly drawn fixedlength sequences, for length . We performed 100 replications for each and each model. We ensured that runs of the various models were matched using the same weight initialization and the same training and test sets. All models had tanh hidden units, attractor units, .
We chose the majority task because, in contrast to the parity task, we were uncertain if a restricted state representation would facilitate task performance. For the majority task of a given length , the network needs to distinguish roughly states. Collapsing them together is potentially dangerous: if the net does not keep exact count of the input sequence imbalance between 0’s and 1’s, it may fail.
As in the parity task, we tested both on novel binary sequences and training sequences with additive uniform noise. Figures 3a,b show that neither the RNN nor RNN+A beats the SDRNN for any sequence length on either test set. The SDRNN seems superior to the RNN for short novel sequences and long noisy sequences. For short noisy sequences, both architectures reach a ceiling. The only disappointment in this simulation is the lack of a difference for novel long sequences.
4.3 Reber grammar
The Reber grammar [Reber, 1967], shown in Figure 3c, has long been a test case for artificial grammar learning [e.g., Hochreiter and Schmidhuber, 1997]. The task involves discriminating between strings that can and cannot be generated by the finitestate grammar. We generated positive strings by sampling from the grammar with uniform transition probabilities. Negative strings were generated from positive strings by substituting a single symbol for another symbol such that the resulting string is outofgrammar. Examples of positive and negative strings are btssxxttvpse and bptvpxtspse, respectively. Our networks used a onehot encoding of the seven input symbols, tanh hidden units, attractor units, and . The number of training examples was varied from 50 to 800, always with 2000 test examples. Both the training and test sets were balanced to have an equal number of positive to negative strings. One hundred replications of each simulation was run.
Figure 3d presents mean test set accuracy on the Reber grammar as a function of the number of examples used for training. As with previous data sets, the SDRNN strictly outperforms the RNN, which in turn outperforms the RNN+A.
4.4 Symmetry task
The symmetry task involves detecting symmetry in fixedlength symbol strings such as acafbbfaca. This task is effectively a memory task for an RNN because the first half of the sequence must be retained to compare against the second half. We generated strings of length , where is the number of symbols in the left and right sides and is the number of intermediate fillers. For , we generated symbols . We then formed a string whose elements are determined by : for , for , and for . The filler was simply a unique symbol. Negative cases were generated from a randomly drawn positive case by either exchanging two adjacent distinct nonnull symbols, e.g., acafbbafca, or substituting a single symbol with another, e.g., ahafbbfaca. Our training and test sets had an equal number of positive and negative examples, and the negative examples were divided equally between the sequences with exchanges and substitutions.
We trained on 5000 examples and tested on an additional 2000, with the half sequence having length and with an or slot filler. The longer filler makes temporal credit assignment more challenging. As shown in Figures 4.4a,b, the SDRNN obtains as much as a 70% reduction in test error over either the RNN or RNN+A.
[1.11111][b!]
4.5 Partofspeech tagging
In natural language processing, partofspeech (POS) tagging assigns a label to each word in a sentence. Because words change their role depending on context (e.g. “run” can be a noun or a verb), the task requires recognizing how the context influences the label. We used a data set from the NLTK library [Bird et al., 2009]. Each sentence is a sequence, presented one word at a time. Words are represented using a 100dimensional pretrained GloVe embedding [Pennington et al., 2014], and the 147 most popular POS tags are used as outputs (with an additional catchall category) in a onehot representation. Models are trained with a varying number of examples and tested on a fixed 17,202 sentences. Early stopping is performed with a validation set which is 20% of the training set. We compare SDRNN and RNN on four replications of each simulation; in each replication the same subsets of training/validation data are used.
The basic RNN architecture is a singlelayer bidirectional GRU network [Chung et al., 2015] with a softmax output layer. The SDRNN incorporates one attractor network for the forward GRU and another for the backward GRU. We use GRUs in each direction and attractor units in each network. Additional details of the training procedure are presented in the Supplementary Materials. Figure 4.4c compares performance for the RNN and SDRNN as a function of training set size. For small data sets, the SDRNN shows an impressive advantage.
5 Related work in deep learning
Our SDRNN has an intriguing relationship to other recent developments in the deep learning literature. The architecture most closely related to SDRNN is the fast associative memory model of Ba et al. [2016]. Like SDRNN, their model is a sequenceprocessing recurrent net whose hidden state is modulated by a distinct multistage subnetwork. This subnetwork has associative weights which are trained by a different objective than the primary task objective. Ba et al.’s focus is on using the associative weights for rapid learning over the time span of a single sequence. At the start of every sequence, the weights are reset. As sequence elements are presented and hidden state is updated, the associative weights are trained via Hebbian learning. The goal is to evoke a memory of states that arose earlier in the sequence in order to integrate those states with the current state. Although their goal is quite different than our denoising goal, Hebbian learning does enforce weight symmetry so their model might be viewed as having a variety of attractor dynamics. However, their dynamics operate in the hidden activation space, in contrast to our more computationally powerful architecture which operates in a hiddenhidden state space (i.e., the hidden representation of the RNN is mapped to another latent representation).
Graves [2016] has proposed a model for sequence processing which performs multiple steps of internal updating for every external update. Like Graves’s model, SDRNN will vary in the amount of computation time needed for internal updates based on how long it takes for the attractor net to settle into a well formed state.
Turning to research with a cognitive science focus, Andreas et al. [2017] have proposed a model that efficiently learns new concepts and control policies by operating in a linguistically constrained representational space. The space is obtained by pretraining on a language task, and this pretraining imposes structure on subsequent learning. One can view the attractor dynamics of the SDRNN as imposing similar structure, although the bias comes not from a separate task or data set, but from representations already learned for the primary task. Related to language, the consciousness prior of Bengio [2017] suggests potential benefits of operating in a reduced or simplified representational space. The prior deals with the fact that the brain operates in a very high dimensional state space and some restricted form of this representation that selects some information at the expense of other may be necessary for rapid learning and efficient inference. For related ideas, also see Hinton [1990].
Finally, on a methodological note, Trinh et al. [2018] have proposed an auxillary loss function to support the training of RNNs. Although their loss is quite different in nature than our denoising loss—it is based on predicting the input sequence itself—we share the goal of improving performance on sequence processing by incorporating additional, taskorthogonal criteria.
6 Conclusions
Noise robustness is a highly desirable property in neural networks. When a neural network performs well, it naturally exhibits a sort of noise suppression: activation in a layer is relatively invariant to noise injected at lower layers [Arora et al., 2018]. We developed a recurrent net architecture with an explicit objective of attaining noise robustness, regardless of whether the noise is in the inputs, output labels, or the model parameters. The notion of using attractor dynamics to ensure representations are wellformed in the context of the task at hand is quite general, and we believe that attractor networks should prove useful in deep feedforward nets as well as in recurrent nets. In fact, many of the deepest vision nets might well be reconceptualized as approximating attractor dynamics. For example, we are currently investigating the use of convolutional attractor networks for image transformations such as denoising, superresolution, and coloring [Thygarangan et al., 2018]. Successfully transforming images requires satisfying many simultaneous constraints which are elegantly and compactly expressed in a Lyapunov (energy) function and can be approximated by deep feedforward nets. A denoising approach may also be particularly effective for transfer learning. In transfer learning, representations from one domain are applied to another, quite analogous to our SDRNN which leverages representations from one training epoch to support the next.
7 Acknowledgements
This research was supported by the National Science Foundation awards EHR1631428 and SES1461535. We thank Tom Hotchkiss and Victor Lindsey for helpful suggestions on an earlier draft of this paper.
References
 Andreas et al. [2017] J. Andreas, D. Klein, and S. Levine. Learning with latent language. CoRR, abs/1711.00482, 2017.
 Arora et al. [2018] S. Arora, R. Ge, B. Neyshabur, and Y. Zhang. Stronger generalization bounds for deep nets via a compression approach. CoRR, abs/1802.05296, 2018.
 Ba et al. [2016] J. Ba, G. E. Hinton, V. Mnih, J. Z. Leibo, and C. Ionescu. Using fast weights to attend to the recent past. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 4331–4339. Curran Associates, Inc., 2016.
 Bengio [2017] Y. Bengio. The consciousness prior. CoRR, abs/1709.08568, 2017.
 Bird et al. [2009] S. Bird, E. Loper, and E. Klein. Natural language processing with python. O’Reilly Media Inc., 2009. URL https://www.nltk.org/book/.
 Boll [1979] S. Boll. Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing, 27(2):113–120, Apr 1979. ISSN 00963518. doi: 10.1109/TASSP.1979.1163209.
 Chung et al. [2015] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Gated feedback recurrent neural networks. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning  Volume 37, ICML’15, pages 2067–2075. JMLR.org, 2015. URL http://dl.acm.org/citation.cfm?id=3045118.3045338.
 Graves [2016] A. Graves. Adaptive computation time for recurrent neural networks. CoRR, abs/1603.08983, 2016.
 Hinton [1990] G. E. Hinton. Mapping partwhole hierarchies into connectionist networks. Artificial Intelligence, 46:47–75, 1990.
 Hochreiter and Schmidhuber [1997] S. Hochreiter and J. Schmidhuber. Long shortterm memory. Neural Computation, 9(8):1735–1780, 1997.
 Hopfield [1982] J. J. Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8):2554–2558, 1982.
 Hopfield [1984] J. J. Hopfield. Neurons with graded response have collective computational properties like those of twostate neurons. Proceedings of the national academy of sciences, 81(10):3088–3092, 1984.
 Koiran [1994] P. Koiran. Dynamics of discrete time, continuous state hopfield networks. Neural Computation, 6(3):459–468, 1994.
 Masson and Loftus [2003] M. Masson and G. Loftus. Using confidence intervals for graphically based data interpretation. Canadian Journal of Experimental Psychology, 57:203–220, 2003.
 Mathis and Mozer [1994] D. W. Mathis and M. Mozer. On the computational utility of consciousness. In G. Tesauro, D. S. Touretzky, and T. K. Leen, editors, Advances in Neural Information Processing Systems 7, pages 11–18, Cambridge, MA, 1994. MIT Press.
 Mozer [2009] M. C. Mozer. Attractor networks. In T. Bayne, A. Cleeremans, and P. Wilken, editors, Oxford Companion to Consciousness, pages 88–89. Oxford University Press, Oxford UK, 2009.
 Pennington et al. [2014] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on EMNLP, pages 1532–1543, 2014.
 Reber [1967] A. S. Reber. Implicit learning of artificial grammars. Verbal learning and verbal behavior, 5:855–863, 1967.
 Siegelmann [2008] H. T. Siegelmann. Analogsymbolic memory that tracks via reconsolidation. Physica D: Nonlinear Phenomena, 237(9):1207 – 1214, 2008. doi: https://doi.org/10.1016/j.physd.2008.03.038.
 Simard et al. [1992] P. Simard, B. Victorri, Y. LeCun, and J. Denker. Tangent prop  a formalism for specifying selected invariances in an adaptive network. In J. E. Moody, S. J. Hanson, and R. P. Lippmann, editors, Advances in Neural Information Processing Systems 4, pages 895–903. MorganKaufmann, 1992.
 Thygarangan et al. [2018] A. Thygarangan, R. V. Lindsey, and M. C. Mozer. Convolutional attractor networks for image processing. In preparation, 2018.
 Trinh et al. [2018] T. H. Trinh, A. M. Dai, T. Luong, and Q. V. Le. Learning longerterm dependencies in rnns with auxiliary losses. CoRR, abs/1803.00144, 2018.
 Witthoft et al. [2003] N. Witthoft, J. Winawer, L. Wu, M. Frank, A. Wade, and L. Boroditsky. Effects of language on color discriminability. In R. Alterman and D. Kirsh, editors, Proceedings of the TwentyFifth Annual Conference of the Cognitive Science Society, volume Pts 1 and 2, pages 1247–1252. Erlbaum Associates, 2003.
 Zheng et al. [2016] S. Zheng, Y. Song, T. Leung, and I. Goodfellow. Improving the robustness of deep neural networks via stability training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4480–4488, 2016.
Appendix A Model initialization
We initialize the attractor net with all attractor weights being drawn from a normal distribution with mean zero and standard deviation .01. Additionally, we add 1.0 to the ondiagonal weights of and , i.e., and for , where and is the number of units in the input to the attractor net and is the number of internal (hidden) units in the attractor net.
Appendix B Experimental details
In all experiments, we chose a fixed initial learning rate with the ADAM optimizer. We used the same learning rate for and . For all tasks, is mean squared error. For the synthetic simulations (parity, majority, Reber, and symmetry), we trained for a fixed upper bound on the number of epochs but stopped training if the training set performance reached asymptote. (There was no noise in any of these data sets, and thus performance below 100% is indicative that the network had not fully learned the task. Continuing to train after 100% accuracy had been attained on the training set tended to lead to overfitting, so we stopped training at that point. In all simulations, for testing we use the weights that achieve the highest accuracy on the training set (not the lowest loss).
b.1 Parity
We use mean squared error for in Parity, Majority, Reber, and Symmetry. In Parity, the learning rate is .008 for and . Because the data set is noise free, training continues until classification accuracy on the training set is 100% or until 5000 epochs are reached. Training is in complete batches to avoid the noise of minibatch training. Hidden units are tanh, and the data set has a balance on expectation of positive and negative examples. Our entropy calculation is based on discretizing each hidden unit to 8 equalsized intervals in and casting the 10dimensional hidden state into one of bins. For this and only this simulation, the attractor weights were trained only on the attractor loss. In all following simulations, the attractor weights are trained on both losses (as described in the main text).
b.2 Majority
The networks are trained for 2500 epochs or until perfect classification accuracy is achieved on the training set. The attractor net is run for 5 steps. Ten hidden units are used and . On expectation there is a balance between the number of positive and negative examples in both the training and test sets.
b.3 Reber
The networks are trained for 2500 or until perfect classification accuracy is achieved on the training set. We filtered out strings of length greater than 20, and we left padded strings with shorter lengths with the begin symbol, b. The training and test sets had an equal number of positive and negative examples. The architecture included hidden units and attractor units, and the attractor net is run for 5 steps. Without exploring alternatives, we decided to postpone the introduction of until 100 epochs had passed.
b.4 Symmetry
The networks are trained for 2500 or until perfect classification accuracy is achieved on the training set. A learning rate of .002 was used for both losses for and .003 for both losses for . The attractor net was run for 5 iterations.
b.5 Part of speech tagging
The data set had 472 unique POS tags. We preserved only the 147 most popular tags, and lumped all the others into a catchall category, for a total of 148 tags. The catchall category contained only 0.21% of all words. We also kept only the 20,000 most frequent words in the vocabulary, placing the rest into a catchall category with POS also tagged as belonging to the catchall category. An independent attractor network is used for each RNN direction. We use GRU hidden units, attractor units, , and 15 attractor iterations. We apply a dropout of to the output of RNN layer before projecting it into the final POS class probability layer of size . RNN models need to tag every word in the sequence, so averaged categorical crossentropy is used as an optimization objective, where the target distribution is a onehot encoding of POS tags. Attractor nets were run for 15 iterations with .