The Neural State Pushdown Automata

The Neural State Pushdown Automata

Ankur Mali1, Alexander Ororbia2, C. Lee Giles 1
1
Penn State University, State College, PA 16801
2Rochester Institute of Technology, Rochester, NY 14623
1aam35@psu.edu, 2ago@cs.rit.edu, 3clg20@psu.edu
Abstract

In order to learn complex grammars, recurrent neural networks (RNNs) require sufficient computational resources to ensure correct grammar recognition. A widely-used approach to expand model capacity would be to couple an RNN to an external memory stack. Here, we introduce a “neural state” pushdown automaton (NSPDA), which consists of a digital stack, instead of an analog one, that is coupled to a neural network state machine. We empirically show its effectiveness in recognizing various context-free grammars (CFGs). First, we develop the underlying mechanics of the proposed higher order recurrent network and its manipulation of a stack as well as how to stably program its underlying pushdown automaton (PDA) to achieve desired finite-state network dynamics. Next, we introduce a noise regularization scheme for higher-order (tensor) networks, to our knowledge the first of its kind, and design an algorithm for improved incremental learning. Finally, we design a method for inserting grammar rules into a NSPDA and empirically show that this prior knowledge improves its training convergence time by an order of magnitude and, in some cases, leads to better generalization. The NSPDA is also compared to a classical analog stack neural network pushdown automaton (NNPDA) as well as a wide array of first and second-order RNNs with and without external memory, trained using different learning algorithms. Our results show that, for Dyck(2) languages, prior rule-based knowledge is critical for optimization convergence and for ensuring generalization to longer sequences at test time. We observe that many RNNs with and without memory, but no prior knowledge, fail to converge and generalize poorly on CFGs.

Introduction

Despite their success, artificial neural networks (ANNs), especially recurrent neural networks (RNNs), have repeatedly been shown to struggle with generalizing in a sophisticated, systematic manner, often uncovering misleading statistical associations instead of true casual relations. Verifying what is learned by these black-box models remains an open challenge, centering around one central issue – the lack of interpretability and modularity. The fact that successful ANN optimization depends heavily on large quantities of data only serves to further worsen the problem. 111Preprint One research direction towards developing more interpretable ANNs focuses on rule extraction from and assimilation of rules into RNNs [angluin1983inductive, fu1977]. To solve difficult grammatical inference problems, various types of specialized RNNs have been designed [lstmcfg, boden2000context, tabor2000fractal, wiles1995learning, sennhauser2018evaluating, nam2019number] However, it has been shown that RNNs augmented with external memory structures, such as the neural network pushdown automaton (NNPDA), are more powerful than RNNs without, both historically [giles1992learning, pollack1990recursive, zeng1994discrete] and recently, using differentiable memory [joulin2015inferring, grefenstette2015learning, graves2014neural, kurach2015neural, zeng1994discrete, hao2018context, yogatama2018memory, graves2016hybrid]. Yet most of these models often lack interpretability and how they learn any given grammar is still debatable. In the past, rule integration methods have been proposed to tackle the interpretability issue [giles1992learning, omlin1996constructing] and offer a promising path towards the design of ANNs with an underlying knowledge structure that is bit more understandable and transparent. However, to the best of our knowledge, there exists no method for inserting rules into the states of the far more powerful class of higher order, memory-augmented RNNs.In working towards interpretable, memory-based neural models, in this work, our contributions are the following:

  • We propose the neural state pushdown automaton and its incremental training method, which exploits the concept of iterative refinement

  • We develop a novel regularization method that empirically yields better generalization in complex, memory-based RNNs. To our knowledge, we are the first to propose a weight regularizer that works with higher-order RNNs.

  • We propose a method for programming states into a neural state machine with binary second and third-order weights .

  • We develop a method for inserting rules into stack-based recurrent networks.

  • We compare our model with the NNPDA and other RNNs, trained using different learning algorithms.

Motivation & Related Work

Research related to integrating knowledge into ANNs has existed for quite some time, such as through the design of state machines [tivno1998finite, omlin1996constructing]. Recent efforts in the domain of natural language processing have shown the effectiveness of using state machines for tasks such as visual question answering, which allow an agent to directly use higher-level semantic concepts to represent visual and linguistic modalities [manning2019nsm]. With respect to rule-insertion itself, there exists a great deal of work showcasing its effectiveness when used with ANNs[abuMostafa1990hints] as well as with RNNs [giles1992learning, omlin1996constructing]. Notably, [omlin1996constructing] showed how deterministic finite automaton rules could be encoded into second order RNNs. One important, classical model that we draw inspiration from is the neural network pushdown automaton (NNPDA) [nndpa1998sun]. The structure of our proposed model is similar to the NNPDA, but, as we will discuss, the major difference is that the model works with a digital stack as opposed to an analog one. Interestingly enough, prior work has also shown how to “hints” into the NNPDA, where knowledge of “dead states” can be used to guide its learning process [nndpa1998sun]. In the spirit of this hint-based methodology, we will develop a method for encoding useful rules related to target CFGs into our neural state pushdown automaton (NSPDA). This, to our knowledge, is the first approach of its kind, since no rule methodology has been previously proposed for complex state-based models. Creating such a procedure allows us to both exploit the far greater representational capabilities of memory-augmented RNNs while offering an intuitive way for understanding the knowledge contained and acquired by RNNs. In this work, we will focus on RNNs that control a discrete stack, particularly our proposed NSPDA. We will empirically determine if the inductive biases we encode into its synaptic weights speed up the parameter optimization process and, furthermore, improve model generalization over longer sequences at test time. Furthermore, the results of our experiments, which compare a wide variety of RNNs (of varying order, with and without memory), will strongly contradict the claim presented in recent work [gru2019pda], which specifically claims that first order RNNs, like the popular gated recurrent unit RNN [chung2014empirical], are as powerful as a PDA. In essence, our work demonstrates that for an RNN to recognize a complex CFG, it will, at least, require external memory. Our results also demonstrate the value of encoding even partial PDA information which positively impacts convergence time and model generalization.

The Neural State Pushdown Automaton

Neural Architecture

The model we propose, the NSPDA with iterative refinement is shown in figure 1. The NSPDA consists of fully connected recurrent neurons which we will label as state neurons, primarily to distinguish them from the neurons that function as output neurons. Introducing the concept of state neurons is important when considering the notion of higher-order networks, i.e., second or third order RNNs, which allows us to map state representations directly to outputs. In this model, at each time step , the state neuron receives signals from the input neurons, its previous state, and the stack-read neurons. The input neurons process a string, one character at a time, while non-recurrent neurons, also labeled as “action neurons”, represent an operation to be performed on a stack data structure, i.e., Push/Pop/No-op. The action neurons are also designated as the controller which can either be recurrent or linear (recurrent controllers usually perform better in practice, so we focus on these in this paper). Furthermore, “read” neurons are used to keep track of the symbols present at the top of the stack. To make concrete the above high-level description, consider a single hidden-layer NSPDA. A full symbol sequence sample is defined as where the binary label indicates whether the sequence is valid () or not (). When processing a (binary) symbol/token at the discrete time step , the NSPDA is engaged with computing a new state variable vector , where is the total number of input/sensory neurons (or dimensionality of the input space, sometimes classically refered to as alphabet size) and is the total number of state neurons. The action neuron vector is defined as and the read neuron vector is defined as , i.e., the action and read spaces are of the same dimensionality of the input or . Taken together, the above sets of input, state, and read neurons represent a full NSPDA model with parameters . Crucially, and are both 4-dimensional (4D) synpatic weight tensor, i.e., the binary “to-state” tensor and the 4D tenary to-action tensor (note that: is “pop”, is “no-op”, and is “push”). At , inference (for a third order NSPDA) is conducted as follows:

(1)
(2)
(3)

where , , and , are threshold values that determine what the next state of the discrete read unit will be (sampled uniformly from a special interval to create continuous value for backprop to work with). Note that is the next hidden state, is the next stack action, and is the next value of the neuron that reads the content at the top of the stack. and are non-linear activation functions, specifically, quantized sigmoidal functions, defined as:

(4)
(5)
(6)

As the NSPDA processes a string, a prediction of its validity is made at each step. Specifically, the output weights (and bias scalar ) are used to map the state vector to the output space. The output model is defined as , where is the logistic link function. The actual external stack itself is manipulated by discrete-valued action neurons that trigger a discrete push or pop action (as given by Equation 2). Take, for example, a 2-letter alphabet, i.e., . The dimensions of the action and read spaces would then, in this case, be . When using a digital stack, the following actions can be taken:

  • PUSH: This means that the current input is pushed to the top of the stack. Example: To push the symbol “a”, use and .

  • POP: This means that the element is removed from the top of the stack. Example: To remove the symbol “b”, use and .

  • NO-OP: This simply means “no operation, or, in other words, nothing is to be done with the stack. Example: use and .

In the case of the vector , we are reading the symbol currently located at the top of the stack (at each time step) (corresponding read vectors are shown above in the action vector examples0. Our goal is to make sure the RNNs choose the correct action during training and yet still maintain stable binary read states .

Figure 1: NSPDA diagram with iterative refinement step

Learning and Optimization

First, we define the loss function used to both measure the performance of the network as well as optimize its parameters. Classically, state neural models such as the NNPDA exclusively made use of a binary loss function that only considered if a string was valid or invalid [das1993using]. Furthermore, these models only made a prediction/classification at the very end of the sequence. In contrast, the NSPDA is an iterative, step-by-step predictive model. Thus, we consider using a sequence loss based on binary cross entropy.222In preliminary experiments, models using a squared error loss, with and without regularization penalties, had great difficulty in converging. We found using cross entropy was far more effective. . The instantaneous loss, for a single sequence ), is:

(7)

where is the -th prediction/output from the final state neuron. Note that is copied each step in time, which injects an extra error signal throughout the sequence length, improving the optimization process (as opposed to relying on only a single output error signal to be effectively propagated backwards through the underlying computation graph). To compute updates for the NSPDA’s parameters, we employed several gradient-based approaches, including the popular and common back-propagation through time (BPTT) procedure as well as online algorithms such as real-time recurrent learning (RTRL) [williams1989rtrl] and unbiased online recurrent optimization (UORO) [tallec2017uoro]. In short, all of these algorithms compute gradients of the loss function (Equation 7) with respect to NSPDA weights. The primary difference between the algorithms is that BPTT is based on reverse-mode differentiation routine while RTRL is based on forward-mode differentiation (and UORO is a faster, higher variance approximation of RTRL). In further detail, we describe UORO and RTRL in the appendix. While UORO and RTRL are not commonly used to train modern-day RNNs, they offer faster ways to train them without requiring graph unfolding. Thus, we compare the results of using each in our experiments.

Iterative Refinement

One important element we introduced into the training protocol of the NSPDA is that of iterative refinement, an algorithm proposed in the signal processing literature for incorporating partial iterative inference into a next-step predictive RNN [ororbia2019iterdecode]. At a high-level, this means that, during training, at step , the NSPDA is forced to predict the same target () times (except for the state transitions that are provided as “hints”, of which we will describe in a later section). Crucially, the state vector is still carried over these steps, meaning the recurrent synapses relating the state of the model at time to .To adapt iterative refinement to a next-step sequence model like the NSPDA, iterative refinement can cleanly introduced by manipulating the sequence loss of Equation 7 as follows:

(8)
(9)

noting that we have introduced the variable to augment the sample . is an integer sequence computed as follows: where is a binary “hint” vector (automatically generated) of the form ( signals a hint is used, while is “no hint”). Empirically, we found worked well. In [ororbia2019iterdecode], using an RNN’s recurrent weights as a lateral processing mechanism [ororbia2019lifelong] was related to an RNN acting as a deep feedforward network with tied weights across hidden layers (a “prediction episode”). This means that additional nonlinearity (via depth) is being efficiently exploited without incurring the memory cost of storing extra weights. We found that iterative refinement introduces greater stability into learning process primarily when gradient noise is used. Note that, even in this case, while we work with full precision weights for gradient computation, before evaluation is conducted, the weights are converted to discrete values.

Two Stage Incremental learning

Incremental learning, or, in other words, training procedures that sort data samples based on their inherent difficulty and progressively present them to a neural agent progressively, has been shown to quite effective when training RNNs on input data that is known to have some structure [elman1993learning, das1993using]. Based on this prior finding, we developed a two-stage incremental learning approach for improving a higher-order RNN’s ability to generalize to longer sequences. Formally, Algorithm 1 depicts the overall process. We found that using a stochastic learning rate [ororbia2019iterdecode] worked better in the first stage while a fixed learning rate combined with stochastic noise process applied to the weights (similar to gradient noise) worked better during second stage.

Input: (model weights), training set , validation set , (midpoint length threshold), (learning rate)
————————— Stage #1 —————————
) Calculate longest string length
Sequential Curriculum Update Phase
for  to  do
      = Extract from all strings lengths
     TRAIN(Model(), , ) Single pass through
Random Curriculum Phase
while Model() not converged on or  do
     TRAIN(Model(), , ), 
————————— Stage #2 —————————
Sequential Curriculum Update Phase
for  to  do
      = Extract from all strings lengths
     TRAIN(Model(), , ) Single pass through
Random Curriculum Phrase
while Model() not converged on or  do
     TRAIN(Model(), , ), 
return Return final trained model weights
Algorithm 1 Two Stage Incremental Learning

As we will see later experimentally, whenever the data has some exploitable structure that allows for an automatic sorting of samples by increasing complexity, incremental learning is highly effective in training higher-order RNNs. In the case of CFGs, we can sort samples based on string length and progressively build a model that can learn to generalize to increasingly longer string sequences. Algorithm 1 depicts the full process (note that we set in this paper and is a variable that marks the number of epochs so far).

Regularizing Higher Order RNNs:

When training any RNN for long periods of time, the model tends to memorize the input training data which damages its ability to generalize to unseen sequence data, i.e., overfitting. Higher order RNNs are also susceptible to overfitting given their high-capacity and complexity, and yet, no regularization has ever been proposed to help these kinds of RNNs to combat overfitting. In this work we extend an adaptive (layer-dependent) noise scheme that was originally proposed for training neurobiologically-plausible ANNs [ororbia2019biologically], which showed strong positive results for simple feedforward classification tasks, to RNNs. Notably, our noise-based regularizer applies to higher-dimensional tensors, which are fundamental to implementing any -th order RNN.

Input: Tensor , e.g., or
=Percentage of Noise, “” means
function CreatePartitions(W, K,)
      (multiply 1st two dimensions)
      randomly selects matrices in
      Divide W into partitions
      ,
     if is odd
      
      
      
     else is even
      
      
      
      Create set of random matrices from each
     
     Return
function Adaptive Noise(,K)
      Draw Gaussian scalar sample
     for each in  do for each matrix in
         ,      
      Remap matrices in to tensor shaped like
     Return
      Use updated weight matrix for gradient computation
Algorithm 2 Adaptive Noise Regularizer

We are also motivated by the fact that injecting noise to gradients can encourage exploration of an RNN’s error optimization landscape [Goodfellow16] in one of two ways: 1) at the input, i.e., data augmentation [Goodfellow16], or 2) at the recurrence [krueger2016zoneout]. Our regularizer falls under the second case.333We implemented a data augmentation approach but found it yielded poor results when learning context-free grammars. The key details of our noise-based regularizer are depicted in Algorithm 2. Based on preliminary experiments, we found that a noise level less than 30% and more than 8% helps the network to converge faster and, more importantly, generalize better on unseen sequences, longer that than those found in the training set. Experimentally, later we will see that this regularizer improves generalization even when prior knowledge is not integrated into the RNN.

Integrating Prior Knowledge

Programming and Inserting Rules

We start by defining the data generating process that any RNN is to learn from, i.e., a PDA that generates a set of positive and negative strings. Formally, the -state PDA is defined as a 7-tuple where:

  • is the input alphabet

  • is the finite set of states

  • is known as stack alphabet (a finite set of tokens)

  • is the start state

  • is the initial stack symbol

  • is the set of accepting states

  • is the state transition.

To insert rules related to known state transitions into the (-state) NSPDA, one needs to program its recurrent weights (which could be second or third order). Since the number of states in PDA is not known before hand, we assume that and that the network has enough capacity to learn an unknown context-free grammar. In order to program and insert rules, we propose adapting methodology originally developed for second-order RNNs and deterministic finite state automata (DFA) [omlin1996constructing] to the case of PDA-based RNNs. Specifically, we will exploit the similarity between the state transitions of the target PDA and the underlying dynamics of a stack-driven RNN. Consider a known transition , where is the top of the stack and is the sequence of symbols replacing . We then identify PDA states and , which correspond to state neurons and , respectively. Recall that each symbol has specific stack operations associated with it, which provide prior knowledge as to when to push and when to pop from the stack. It is desirable that the state neuron has a high output close to and has a low output close to after reading an input symbol using input neuron and the top of the stack using read neuron (remember that a read depends on an action neuron, as depicted in model Equation 3). This condition can be achieved by doing the following: 1) set the (third order) weights to a large positive value, which helps to ensure that the state neuron at the next time step will be high (and since is sigmoidal, this tends towards ), and 2) set to a large negative value, which would make the output of the state neuron low (tending towards ). The next item to consider are the (ternary) action weights stored in , which drive the action neurons that yield the stack operations (recall that [-1,0,1] maps to [pop,no-op,push]). First, we must assume that the total contribution of the weighted output of all state neurons can be neglected – this can be achieved by setting all other state neurons to the lowest value. In addition, we assume that each state neuron can only be assigned to one known state of the PDA. If we have prior knowledge of accepting and non-accepting states related to a particular neuron, we may then bias its output . We start from (the leftmost neuron in the vector ) and work towards , programming each one by one. Armed with these assumptions, we can then stably encode rules into the NSPDA by programming the weight to be large positive value if the PDA’s state is an accepting state. Otherwise, we set to be a large negative value if the state is non-accepting. If no such knowledge of the PDA is available, remains unchanged. Though described for a third order NSPDA, the above approach for programming weights also applies to a second order model as well. In a lower order NSPDA, with 3D weight tensors and , state updates and transitions are conducted by concatenating a read neuron with an input neuron to create a single vector. However, when programming a second order model, we are now working with a DFA [omlin1996constructing] instead of a PDA, which limits the capabilities of the NSPDA (as well as restricts its capacity) since we do not possess any knowledge about what to push or pop. However, when combined with our proposed learning procedure that incorporates iterative refinement, we believe that the second order NSPDA can still learn what action to perform. However, the issue of dimensionality arises – the state space of a lower order model is very large when compared to that of a third order NSPDA. In the case of a PDA-based model, pushing multiple symbols might lead to reaching same accepting state, however, in case of a DFA-based model (the second order NSPDA), we create separate sets of accepting states for each symbol. We found that this splitting mechanism was crucial in getting our network to work perfectly with a digital stack. While the above rule insertion scheme seems simple enough, determining the actual values for the weights that are to be programmed can be quite problematic. In the case of a third order synaptic connections (with binary weights), with just neurons, there are different combinations, which would quickly render our method impractical and near useless. However, we can sidestep this computational infeasibility by making use of “hints” [omlin1996constructing] within the framework of “orthogonal state encoding”. By assuming that the PDA starts generating a valid grammar at its initial state, we can then randomly choose a single state and make the output of one state neuron equal to . The outputs of all the other neurons are set to be equal to . Following this, we set the values of weights (according to known state transitions) according to the approach described above. Notably, these weights, though initially programmed, are still adaptable, making them amenable to tuning to a target grammar underlying a data sample. Programming the weights of second or third order networks jointly impacts the behavior of the state neurons , the read neurons and the input neurons . Following the scheme we described above yields sparse NSPDA representations of PDA states. It is difficult to program an NSPDA with a minimal number of states, despite the fact that we have a theoretical guarantee that the third order model is equivalent to PDA dynamics [nndpa1998sun]. We will observe in our results, the proposed methodology significantly reduces the NSPDA’s convergence time during optimization (leading to roughly comparable training time characteristic of first order RNNs), which is particularly important given the fact that its inference process entails 4D tensor products (which are far more expensive than the matrix computations of modern-day RNNs).

Rule Method
W1
W2
W1
W2
W1
W2
W1
W2
NNPDA w/o hints
NNPDA w/ dead neuron hints
NSPDA w/o hints
NSPDA w/ Hint #1
NSPDA w/ Hint #2 70 72 150 138 389 134 222 148
Table 1: Comparison between NSPDAs trained w/ and w/o hints using either nd order weights (W1) or rd order weights (W2).
Train Method
M1
M2
M1
M2
M1
M2
M1
M2
Standard
IL
2-IL (ours) 2001 2199 9899 10001 130192 129998 177189 177190
Table 2: Incremental learning NSPDA (without hints) performance results. Each value is a measurement of the average number of characters required to reach convergence (M1 = nd Order NSPDA, M2 = rd order NSDPA).
Regularization Method
M1
M2
M1
M2
M1
M2
M1
M2
w/o reg
w reg 0.00 0.00 0.06 0.01 0.99 0.00 0.09 0.00
Table 3: Mean classification error for an NSPDA w/ & w/o adaptive noise (tested on string length up to ).
RNN Type
Train
Test
Train
Test
Train
Test
Train
Test
Train
Test
RNN
LSTM
LSTM-p
GRU
Stack RNN 40+10
Stack RNN 40+10+ rounding
listRNN 40+5
nd Order RNN
nd Order RNN reg (ours)
NNPDA
NNPDA reg (ours)
NSPDA, M1 (ours)
NSPDA, M2 (ours) 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.01 0.88
Table 4: Mean classification error for various recurrent architectures when tested on strings of length up to .

Experimental Details

We focused on five context-free grammars, some labeled as Dyck(2) languages, which are some of the more difficult CFGs to recognize. For each grammatical inference task, we create a dataset that contains positive and negative (string) samples. Each sequence was of length which was sampled via , where is the uniform distribution defined over the interval . From the samples generated, we randomly sampled a subset from the total number of tokens generated. The number of state neurons for a second order NSPDA is set according to the following formula: . For a third order NSDPA, the number of state neurons was set according to: . All models made use of the iterative refinement loss (Equation 9, with ), weight updates were computed using whichever algorithm, i.e., BPTT, truncated BPTT (TBPTT) ( steps back in time), RTRL, or UORO, yielded best performance for a given model. For higher order networks, UORO performed better and we use this to optimize all RNNs of this type in this study444For all first order RNNs, we found BPTT worked best and use that to train all RNNs of this type in our experiments. (in the appendix, we offer a comparison of the various weight update rules when training an NSPDA). Gradients were hard clipped to . Parameters were updated using stochastic gradient descent (SGD) which made use of the stochastic learning rate annealing scheme proposed in [ororbia2019iterdecode] with initial learning rate of . All models were trained for a maximum of epochs (or until convergence was reached, which was marked as 100% training accuracy). Experiments for each and every model was repeated times. All of our models used our proposed rule encoding scheme and all of the RNNs were trained using our proposed two-stage incremental learning procedure. In Table 4, to demonstrate the value of our proposed two stage incremental training procedure (2-IL), we compare an NSPDA trained without any incremental learning, one with ours, and one with the incremental learning approach (IL) proposed in [das1993using] and find that the our approach yields the best results across all grammars. All higher-order RNNs made use of our proposed adaptive noise regularizer, though in Table 4, we examine how the NSPDA performs with and without the proposed regularizer. With respect to the hints used, for all tables presented in the main paper, whenever hint usage is indicated, we mean Hint #2 (which worked the best empirically). In the appendix, we provide a detailed breakdown and ablation for all of the models investigated in this paper. Specifically, we present results for models that were trained with and without our regularizer as well as under various hint insertion conditions (no hints, Hint #1, and Hint #2). Baseline Algorithms: In order to provide the proper context do demonstrate the effectiveness of our proposed NSPDA, we conduct a thorough comparison of our model to as many baseline RNN models as possible. These models include a plethora of first order RNNs such as variations of the stack-RNN [joulin2015inferring] (depth , all other metaparameters set according to original source) including the two variant models as well as the linked-list model (using the same model labels as the original paper), the Long Short Term Memory RNN [hochreiter1997long] with (LSTM) and without peepholes (LSTM-p), the Gated Recurrent Unit (GRU) RNN [chung2014empirical], and a simple Elman RNN. We also compared to gated first order RNNs with multiplicative units, but due to space constraints, we report these results in the appendix. We furthermore compare against second order RNNs with (nd Order RNN) and without regularization (nd Order RNN reg), as well as the classical NNPDA with and without regularization (NNPDA reg). All baselines RNNs had a single layer of neurons and individual hyperparameters for each was optimized based on validation set performance.

Results and Discussion

To the best of our knowledge, we are the first to conduct a comparison across such a wide variety of RNN models of both first, second, and third order, with and without external (stack-based) memory. For simple algorithmic patterns (non-Dyck(2) CFGs), first order RNNs like the LSTM and GRU perform reasonably well, primarily because they utilize dynamic counting [lstmcfg, sennhauser2018evaluating] but yet do not learn any state transitions. This is evidenced when considering their performance on on the complex Dyck(2) CFG where the majority of RNNs exhibit great difficulty in generalizing to longer sequences. These results do corroborate those of prior work, specifically those that demonstrate that the LSTM essentially performs a form of dynamic counting, making it ill-suited to recognizing complex grammars [Lstmdynamiccounting]. As pointed out by [Lstmdynamiccounting] there is a strong need for neural architectures with external memory, i.e., a stack, to solve complex CFGs but, in this study, we furthermore argue that prior knowledge is also needed as well. This makes sense given that is known that prior information often leads to greatly improved reasoning and better generalization [manning2019nsm]. The stack and list RNNs do make use of (continuous) external memory (in fact, multiple stack/lists) but, theoretically, only one stack should be sufficient to recognize a PDA of any arbitrary length while a 2-stack PDA is as powerful as a Turning machine [hopcroft2pda]. However, quite surprisingly, a stack-RNN with even 10 stacks has difficulty in generalizing to a complex grammar. This lines up with the theory – [hopcroft2pda] has proven that adding any more than stacks to a PDA does not provide any further computational advantage. Finally, it is impressive to see that high order RNNs coupled with external memory, particularly with a discrete stack structure (as opposed to a continuous stack like that of the stack-RNN), perform so well across all CFGs. It is important to note that even the way our state-based RNN operates is markedly different than the way those of the past did – the NSPDA works as a next-step prediction model, which allows us to use the powerful iterative refinement procedure as a way to aggressively error correct its states when predicting string validity (at least during training time). Table 4 shows that our NSPDA model generalizes very well when trained on sequences of length but tested on sequences on length up to . Finally, our results demonstrate the value of rule insertion, which, as we see empirically, in some cases, improved convergence speed by a wide margin.

Conclusions

In this work, we proposed the neural state pushdown automate (NSPDA) and its learning process, which utilizes an iterative refinement-based loss function, a two-stage incremental training procedure, an adaptive noise regularization scheme (which works with any higher order network), and a method for stably encoding rules into the model itself. Our experimental results, which focused on context-free grammars (CFGs), demonstrate that prior knowledge is essential to learning memory-augmented that recognize complex CFGs well. Notably, we have empirically demonstrated the expressvity and flexibility of a high order temporal neural model that learns how to manipulate an external discrete stack. While our proposed neural model works with a discrete stack, our model’s underlying framework could be extended to manipulate other kinds of data structures, a subject of future work. When training on various CFGs, the state-based neural models we optimize converge faster and are more expressive than even powerful classical models such as the neural network pushdown automaton. Furthermore, we have shown that modern-day, popular recurrent network structures (all of which are first order) struggle greatly to recognize complex grammars.These discovered limitations of first order RNNs indicates that ANN research should consider the exploration of more expressive, memory-augmented models that offer ways to better integrate prior knowledge.

References

Appendix

Additional Results

In Table 7, we report an expansion of the model performance table that appears in the main paper. In it, we report the performance of 3 modern gated RNNs with multiplicative gating units, i.e., MI-RNN, MI-LSTM, MI-GRU. Interestingly enough, one could consider the multiplicative units to be a crude approximation of second order state neurons. Table 7 shows results for stably programming the weights of the NSPDA which, in effect, demonstrates that a programmed NSPDA (without learning) is equivalent to complex grammar PDA. In the other table (Table 7), we highlight how various learning algorithms affect the generalization ability of higher order recurrent networks. Here, we compare back-propagation through time (BPTT) to other online learning algorithms such as real time recurrent learning (RTRL) and unbiased online recurrent optimization (UORO). We describe these procedures in further detail in the next section. Notably, in our experiments, we observed that UORO boosts performance for higher order recurrent networks, while being faster than RTRL, the original algorithm-of-choice when training higher order, state-based models. Furthermore, we remark that truncated BPTT (TBPTT), for some CFGs, can actually slightly improve model performance over BPTT (but in ohers, such as is the case for the palindrome CFG, lead to worse generalization).

Model
n=60
n=480
n=960
n=60
n=480
n=960
n=60
n=480
n=960
n=60
n=480
n=960
nd Order NSPDA
rd Order NSPDA
Table 5: Mean classification error results when using a programmed NSPDA (lower is better).
Learning Algorithm
M1
M2
M1
M2
M1
M2
M1
M2
BPTT
TBPTT
RTRL
UORO
Table 6: Mean classification error for the NSPDA trained via various learning algorithms (tested on string length up to ).
RNN Type
Train
Test
Train
Test
Train
Test
Train
Test
Train
Test
RNN
LSTM
LSTM-p
GRU
Stack RNN 40+10
Stack RNN 40+10+ rounding
listRNN 40+5
MI-RNN
MI-LSTM
MI-GRU
nd Order RNN
nd Order RNN reg (ours)
NNPDA
NNPDA reg (ours)
NSPDA, M1 (ours)
NSPDA, M2 (ours) 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.01 0.88
Table 7: Mean classification error for various recurrent architectures when tested on strings of length up to .

On Training Algorithms

For all of the RNNs we study, we compared their (validation) performance when using various online and offline based learning algorithms. As mentioned in the last section, we found that UORO worked best for the NSPDA, which is advantageous in that UORO is faster than RTRL (even largely in terms of complexity) and does not require model unfolding like the popular and standard BPTT/TBPTT algorithms do. These results, again, are summarized in Table 7. Below we briefly describe the non-standard approaches to learning RNNs, specifically RTRL and UORO. Notably, we are the first to implement and adapt UORO in calculating the updates to the weights of higher order networks.

Real-Time Recurrent Learning

Real-time recurrent learning (RTRL) is a classical online learning procedure for training RNNs [williams1989rtrl]. The aim is to optimize the parameters of a state-based model in order to minimize a total (sequence) loss. The state model is abstract to the following function:

(10)

RTRL computes the derivative of the model’s states and outputs with respect to the synaptic weights during the model’s forward computation, as data points in the sequence are processed iteratively, i.e., without any unfolding as in BPTT. When the task is next step prediction (predict given a history ), the loss to optimize, using RTRL, is defined as follows:

(11)

Once we differentiate Equation 10 with respect to , we obtain:

(12)

Where at each time we compute based on . These values are then used to directly compute . The above is, in short, how RTRL calculates its gradients without resorting to backward transfer or computation graph unfolding (as in reverse-mode differentiation). Since the shape of is the same as , for standard RNNs with hidden units, this calculation scales as (time complexity [williams1995gradient]). This high complexity makes RTRL highly impractical for training very wide and very deep recurrent models. However, in the case of a third order model like NSPDA (or an NNPDA), the number of states need for learning a target grammar are generally far fewer than those required of second or first order models (as we mentioned in the main paper). This means that a procedure such as RTRL is still applicable and useful at least for training RNNs to recognize context free grammars (of low input dimensionality).

Unbiased Online Recurrent Optimization

Unbiased Online Recurrent Optimization (UORO) [tallec2017uoro] uses a rank-one trick to approximate the operations need to make RTRL’s gradient computation work. This trick helps to reduce the overall complexity of the at the price of increasing variance of its gradient estimates. When designing an optimizer like UORO, we start from the idea that for any given unbiased estimation of , we can form a stochastic matrix such that . Since Equation 11 and 12 are affine in , the “unbiasedness” (of gradient estimates) is preserved due to the linearlity of the expectation. Next, we compute the value of and plug it into 11 and 12 to calculate the value for and . In a rank-one, unbiased approximation, at time step , . To calculate at , we plug in into 12. Nonetheless, mathematically, the above is still not yet a rank-one approximation of RTRL. In order to finally obtain a proper rank-one approximation, one must use an additional, efficient approximation technique, proposed in [ollivier2015training], to rewrite the above equation as:

(13)

Note that is a vector of independent, random signs and contains positive numbers. Thus, the rank-one trick can be applied for any . In UORO, and are factors meant to control the variance of the estimator’s computed approximate derivatives. In practice, we define as:

(14)

and is defined to be:

(15)

Initially, and , which yields unbiased estimates at time . Given the construction of the UORO procedure, by induction, all subsequent estimates can be shown to be unbiased as well.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
390040
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description