# Improving the Neural GPU Architecture for Algorithm Learning

## Abstract

Algorithm learning is a core problem in artificial intelligence with significant implications on automation level that can be achieved by machines. Recently deep learning methods are emerging for synthesizing an algorithm from its input-output examples, the most successful being the Neural GPU, capable of learning multiplication. We present several improvements to the Neural GPU that substantially reduces training time and improves generalization. We introduce a technique of general applicability to use hard nonlinearities with saturation cost. We also introduce a technique of diagonal gates that can be applied to active-memory models. The proposed architecture is the first capable of learning decimal multiplication end-to-end.

31pt

## 1Introduction

Deep Neural Networks have achieved state-of-the-art results in a wide range of tasks, notably in computer vision [17], speech recognition [1], and natural language processing [4] and extensive work is performed to broaden its scope of application. A particularly interesting area is algorithm learning i.e. synthesizing an algorithm from its input-output examples. It is a long-standing open problem with theoretical results dating back to [7] but still, the results are far from reaching an industry scale.

With the emergence of powerful deep learning methods, research on algorithm learning has acquired a new dimension. Several architectures have appeared that are capable of learning algorithms of moderate complexity, such as sorting, addition or multiplication. The ultimate goal, of course, is to learn algorithms for tasks with unknown solutions. That is not viable yet; nevertheless, the techniques developed for algorithm learning may yield to progress in practical fields relevant now. For example, the Neural GPU architecture, which was designed primarily for algorithm learning, was recently extended to perform machine translation [13].

Neural GPU [14] is the most promising among the proposed architectures for algorithm learning because it is the only one capable of learning multiplication that generalizing to inputs much longer than the training examples. However, it is fragile since only a tiny fraction of the trained models generalize well.

In this paper, we study ways to improve the Neural GPU to obtain faster training and better generalization. The proposed improvements allow us to achieve substantial gains: the model can learn binary multiplication in 800 steps versus 30000 steps that are needed for the original Neural GPU, and, most importantly all the trained models generalize to 100 times longer inputs with less than 1% error. The model can also learn a wider range of problems with similar generalization performance, e.g. the decimal multiplication, which is the first time it has been learned end-to-end. To learn decimal multiplication we use a different representation where each decimal digit is encoded in binary.

The improvements that achieve these goals are removal of the parameter sharing relaxation, introduction of nonlinearities with saturation cost, introduction of a diagonal gating mechanism. We also improve the training schedule by training on all input lengths simultaneously and use a larger learning rate with AdaMax optimizer [16]. We integrate gradient clipping into AdaMax.

We analyze the impact of each improvement separately and show that all of them are relevant to the achieved performance. We find that using hard nonlinearities with saturation cost is the key factor to achieve good generalization.

## 2Related Work

There are two primary approaches to algorithm learning recurrent networks and reinforcement learning. In the reinforcement learning approach, the algorithmic device called controller is decoupled from the environment where the program executes. The device operates in steps which consist of observing the current state of the environment and issuing commands which change the next state of it. Such approach (at least theoretically) scales well in time and memory domain, but elaborate techniques of training are necessary since the overall structure cannot be differentiated. Simple algorithms such as sequence copying and reversal can be learned with the current reinforcement learning techniques [25].

Recurrent networks have a simple computation cell that is unrolled in time according to the length of an input sequence. Such approach is employed by LSTM[11] and GRU [5] networks for sequence classification. Thy scale with sequence length but each cell has a constant amount of memory that essentially limits the learnable problems to regular languages.

There are several more elaborate architectures proposed for algorithm learning. [8] developed a Neural Turing Machine capable of learning and executing simple programs such as repeat copying, simple priority sorting, and associative recall. They use complicated memory addressing to make the model differentiable.

[12] introduce differentiable stack and double linked list data structures. Pointer Networks [24] use soft attention and generalize to a variable-sized output space depending on the input sequence length. This model was shown to be effective for combinatorial optimization problems such as the traveling salesman and Delaunay triangulation. Neural Random-Access Machines [18] introduce a memory addressing scheme potentially allowing constant time access due to discretization. [9] introduce more neural data structures and evaluate them on several sequence processing tasks. A hierarchical memory layout with logarithmic access time is introduced in [2] with both differentiable and reinforcement learning versions being presented.

Grid LSTM [15] allow explicit unrolling along time and memory dimension and are able to learn such tasks as addition and memorization.

A different setting of algorithm learning is explored in [22] where a model is trained on execution traces instead of input and output pairs; this richer supervision allows to induce higher level programs.

Neural GPU [14] is the current state of the art in deep algorithm learning. It can learn fairly complicated algorithms such as addition and binary multiplication, but only a small fraction of the trained models generalize well. The authors train 729 models to find one that generalizes well. They have no success for training decimal multiplication. [21] is able to train Neural GPU on decimal multiplication by using curriculum learning when the same model is trained at first for binary multiplication then for base-4 and only then for decimal.

The basic idea of Neural GPU architecture is promising, and in this paper, we will we work on making it more powerful.

## 3The Model

Neural GPU was introduced by [14]. It is a recurrent network with a multi-dimensional state where a Convolution Gated Recurrent Unit (CGRU) is applied to the state at every time-step. CGRU is a combination of convolution operation and GRU[5] which computes the state at time from the state at time according to the following rules:

In the above equations, , , are convolution kernel banks, , , are bias vectors; these are the parameters that will be learned. denotes a convolution of a kernel bank with a state ; denotes element-wise vector multiplication and is the sigmoid function.

Given an input of length , it is embedded into the first state, each symbol independently, producing a state with its first dimension equal to , then CGRU is applied to it several times, and output is read from the last state by using a softmax loss for each symbol.

While keeping this general architecture, we introduce some changes that include both simplifications and enhancements. So we will give more details immediately regarding our new architecture while mentioning the points where it deviates from the original.

### 3.1Simplifications

The original architecture by Kaiser and Sutskever uses a 3-dimensional state with its third dimension being of fixed size equal to 4. This extra dimension is not essential for the network’s performance. We use a simpler 2-dimensional state of shape [,] where is the length of input and is the number of maps. To match the 2-dimensional state, our convolution kernel banks are of shape [3,,]. We fix the filter length to 3. We confirmed experimentally that this is the optimal setting for all considered tasks.

We use applications of the convolutional unit, all with the same set of parameters. Original implementation uses applications with two sets of parameters. Hence our network is less deep and contains fewer parameters to be learned.

We do not use parameter sharing relaxation (6 parameter sets that were slowly pulled together during training). With enhancements described below, our model does not need this feature and the learning schedule becomes simpler.

### 3.2Diagonal gates

The gating mechanism incorporated in the CGRU facilitates data copying to the same cell in the next time-step. This is essential for bringing together features separated in time during training. However, for most tasks, it is also required to bring together features from both ends of the input. Therefore, we introduce gates that copy data to a neighboring cell in the next time-step. We call these diagonal gates. We split all maps of a state into 3 parts . The first part has a gate from the same cell in the previous time-step as in a CGRU, the second part uses gate from the left neighbor cell, and the third part uses gate from the right neighbor cell.

To implement the diagonal gates, we need to shift the parts and to the right and left respectively and then apply a CGRU to the result. Shifting can be conveniently expressed as a convolution. Right shift corresponds to convolution with filter [1,0,0], left shift to convolution with [0,0,1] and no shift to convolution with [0,1,0]. We define the Diagonal Convolutional Gated Recurrent Unit (DCGRU), which we will use instead of CGRU, as follows:

Definitions of and are the same as for CGRU. The division of maps into 3 parts is only conceptual; an implementation uses a depthwise convolution operating directly on convolving each map with the required convolution filter independently.

To see that diagonal gates have impact, we can inspect execution trace of a trained model on some input. An execution trace is a collection of state values arising in the computation over all time steps. It is visualized as an image for each map where the input is given at the top, and the result is read from the bottom row of the image. In Figure 1 we can see three maps of an execution trace performing binary multiplication on two 50 digit random numbers where each image is taken from each of the parts with a different gate direction. We can notice computing patterns that are aligned with the gate direction. The full set of maps is given in Appendix 1. Figure 2 shows maps from execution trace of a sorting task where 100 numbers in range 0 to 5 are sorted. These images look different than multiplication images, but patterns aligned with the gate direction are evident as well(see also Appendix 2 for the full set of maps).

Using gates that operate in different directions is not a novel idea. A similar mechanism is used in Grid LSTM [15] where different units perform gating along different dimensions of the grid. But introduction such gates in a convolutional architecture is new.

### 3.3Hard non-linearities

[14] have found that introducing gate cutoff improves performance. We go further and use hard_tanh and hard_sigmoid functions for all nonlinearities in the DCGRU. They are piecewise linear approximations of tanh and , namely

Hard nonlinearities train faster since their gradient does not approach zero. A drawback is the dying neuron problem that was observed for ReLU units where some neurons saturate for essentially all inputs. In the case of hard_tanh and hard_ units, this problem is even more pronounced since the range of the unit is bounded on both sides. Also, we use a relatively high learning rate that amplifies the problem. Therefore, we add an extra cost to the loss function to keep the units out of deadly saturation. For one unit we define

with a parameter slightly less than 1 to keep the unit in its linear range. A value works well in our case. We calculate the saturation cost for each application of hard_tanh() and hard_, sum all of them together and add to the loss function with an appropriately small weight. We choose the weight such that the total saturation cost is 100 times smaller than the error loss.

Application of saturation cost increases the training time slightly, but training becomes more robust. The experimental analysis presented later in this paper shows that hard nonlinearities are essential to obtain good generalization and saturation cost allows to use larger learning rate safely. Note that we cannot use techniques like leaky ReLU [19], parametric ReLU [10] or exponential linear unit[6] since they contain negative parts and the definition of GRU relies on values contained in their defined ranges. In particular, gate values have to be in the range [0,1] for gates to function properly.

This technique of hard nonlinearities together with saturation cost can be applied to ordinary LSTM and GRU networks potentially yielding to improved performance. Also, it can be applied to ReLU networks, although it is not yet analyzed if it gives an advantage to the mentioned alternatives.

### 3.4Training

We train the model on inputs of all lengths simultaneously. As in the original architecture, we instantiate several bins of different lengths and place each training example into the smallest bin it fits and pad the remaining length. However, instead of training each bin separately, we sum their losses together and use one optimizer for the total loss. In this way, we avoid scheduling of bins and obtain faster convergence since, typically, several bins contribute to progress at each training step.

We use a larger learning rate, i.e., for a network with 96 maps; that is about 5 times larger than typically used. Training is not only faster but also more robust since the process can jump out of poor local minima. To keep learning converging, we use AdaMax optimizer [16]. It has a strong guarantee that each parameter value will change by no more than at each step. With more maps, we have to use a proportionally smaller learning rate since the sum over all maps contributes to each particular value in the next time step, and with more maps, the total contribution gets larger. We decrease the learning rate if no progress is made for 600 steps.

We integrate gradient clipping into AdaMax optimizer. We clip each variable separately to the range proportional to its decayed maximum that is used internally by the optimizer. In this way, we do not need to set some predetermined clipping threshold. Gradient clipping is not strictly required for convergence since AdaMax optimizer is able to limit the step size even with high peaks in gradient. However, such high gradient sets a large decayed maximum and slows down further training. Clipping limits the growth of the decayed maximum and speeds up training.

We use gradient noise of magnitude proportional to the learning rate. It was shown to give positive impact [20].

### 3.5Dropout

Dropout improves training and generalization of the Neural GPU. To limit memory loss [14] had to use dropout probability inversely proportional to the sequence length.

We apply dropout only to the update vector of the CGRU as proposed in [23]. That helps to avoid memory loss over many time steps. We chose a small constant dropout probability around 0.1. Practical experiments given in the next section confirms this choice as the best one.

## 4Evaluation

We have implemented the proposed architecture in tensorflow. The code is available on GitHub^{1}

We choose the binary multiplication task as the basis for the evaluation. It is the most complex task which can be learned by NGPU; more complex (of the studied tasks) being only decimal multiplication. Sorting, addition and other considered tasks can be learned more easily.

We use NGPU implementation provided by the authors with settings proposed in the paper [14]. We set the number of maps , dropout probability = 0.09. Other parameters we leave to default values used in the provided code.

For DNGPU we use the number of maps to match the data amount carried in one state of NGPU (which use 24 maps in 4 rows). We use learning rate , and dropout probability = 0.1.

We use essentially the same training set as in the [14] consisting of 10000 examples of every length up to 41 (two 20 bit numbers are multiplied). We trained 5 models with a random initialization and measured their accuracy on a test set containing random inputs of length 401 (two 200 bit numbers are multiplied). In this way, we can show both training speed and generalization in one graph. We used a computer with Intel Xeon E312 2.4 GHz processor, 64GB RAM and a Tesla K40 GPU card for testing.

### 4.1Performance and generalization

To compare DNGPU with NGPU, we plot the accuracy of both models on the test for each step of training, see Figure 3. The solid lines show the average of all runs and the shaded area shows the scatter among different runs. Accuracy is defined as the percentage of correctly predicted output bits over all examples. We can see that the DNGPU converges much faster and achieves near 100% accuracy in all runs. It requires only about 800 steps to reach 99% accuracy.

For a fair comparison, we have to take time per step into account. It is larger for our implementation since we evaluate gradient on all bins instead of one. Figure 4 shows the same data plot depending on the training time. DNGPU still has a significant advantage. Both implementations were run on the same hardware, and we excluded the time spent for accuracy evaluation.

To explore generalization beyond length 401, see Figure 5 which shows the accuracy of both architectures depending on input length. We see that DNGPU generalizes much better. All trained DNGPU models exceeded 90% accuracy, and two out of five exceeded 99% accuracy on length 4001.

Although the accuracy achieved by DNGPU would be perfectly acceptable in other domain, demands for algorithm learning are higher. To speak that we have truly inferred an algorithm, it should generalize to arbitrary input length without a single error. This is not yet achieved. Figure 6 shows the number of incorrectly predicted outputs (maximum value 1024) of the same trained models where we consider the output to be correct only if all its bits are correct. We can see that many of the longer examples contain errors (although the errors are few, the accuracy to be high). Moreover, most models have a few errors even in short examples, sometimes even on examples present in the training set. We do not know the exact causes for this, but some thoughts are given in [21]. Good news is that one model out of 5 gave only 19 outputs containing errors on length 4001, so by training more models it could be possible to find one which generalizes perfectly.

To summarize, our architecture outperforms the original by a wide margin both in terms of training speed and of generalization. Our implementation consistently reaches 99% accuracy on the test set of length 401 inputs in less than 15 minutes. The original NGPU trains slower and achieves 90% accuracy only on some runs. Our findings about NGPU are consistent with a much more massive evaluation in [20] Table 6 which shows that only a small fraction of its trained instances generalize well to length 401. Note that generalization of both models can be improved by increasing dropout probability together with the number of maps.

### 4.2Improvement impact analysis

We tried to understand how much each of the proposed enhancements contributes to the improved performance. Figure 7 shows how the model performs when one of the proposed features is turned off. A model trained without hard nonlinearities leads to especially poor performance. A closer look reveals that these models managed to fit the training set easily but generalized poorly to length 401. So, the key factor for achieving generalization is using hard nonlinearities. The same figure shows that hard nonlinearities without saturation cost also perform poorly. Training is unstable and does not converge to 100% accuracy. Continuing training beyond 6000 steps slowly degrades the performance of the model. We can also see that without the diagonal gates the training becomes slower and more unstable.

In Figure 8 we analyze different dropout options. We can see that dropout suggested by [23] performs better than the one employed in NGPU. Training without dropout is the worst option.

### 4.3Decimal multiplication

Our model can learn base-4 multiplication with consistently good generalization if we increase the number of maps to 192. However, like our predecessors, we did not succeed on the decimal multiplication task in its originally proposed form. But our architecture can learn decimal multiplication if we encode each decimal digit in binary. We use 4 bits per digit and mark the start of each digit with a different encoding of its first bit. Such encoding produces 4 times longer inputs and outputs. We implemented this encoding in input/output data generation part, but equivalently it can be implemented inside the Neural GPU itself by appropriate adjustment of its input and output layers.

For evaluation, we increased the number of maps to 192 and performed training on examples of length 41 (multiplication of two 5 digit decimal numbers) and tested on examples of length 401 (multiplication of two 50 digit decimal numbers) as before. Figure 9 shows the results.

We were surprised to see that it generalized so well despite training only on very short examples containing two 5 digit numbers. Additionally, two models out of 5 generalized to length 401 with less than 1% error. Of course, for better generalization we have to train on longer inputs.

The binary encoding allows easier training. However, it comes with a significant overhead, i.e., a 4x increase in the input length which leads to a 16x increase in the unrolled model and a proportional increase training time and memory requirements.

## 5Conclusions

We have presented several improvements to the Neural GPU architecture that substantially decrease training time and improve generalization. The main improvements are hard nonlinearities with saturation cost and a diagonal gating mechanism. We have shown that the hard nonlinearities with saturation cost contribute the most to obtaining better generalization. They may find further applications also in ordinary reccurent networks such as LSTM and GRU.

A larger learning rate together with AdaMax optimizer also helps the training performance, but the introduced saturation cost is essential to keep the learning convergent.

The improved architecture can easily learn a variety of tasks including the binary multiplication on which other architectures struggle. If we increase the number of maps to 192, we can also learn base-4 multiplication with consistently good generalization. Furthermore, if we encode the decimal input/output digits in binary, the architecture can also learn decimal multiplication end-to-end.

The improved architecture is considerably simpler than the original NeuralGPU, enabling an easier extension to handle harder problems. One such possible extension could be scaling the model to solve tasks requiring more than slots of memory or more than time steps. Simply enlarging the size of the model did not work well. So we leave the question of proper scaling of the model for future work.

The correct generalization of the learned models to arbitrary large inputs is still an open problem, and it is not even clear why some models generalize, and others do not. With the proposed simpler model and faster training, it will be possible to address this question more effectively.

## Acknowledgements

We would like to thank the IMCS UL Scientific Cloud for the computing power and Leo Trukšāns for the technical support.

## Appendix 1

All 96 maps of an execution trace performing binary multiplication on two 50 digit random numbers. We can notice computing patterns that are aligned with the gate direction. Every 4 image rows correspond to maps with a different gate direction.

## Appendix 2

All 48 maps of an execution trace performing sorting where 100 numbers in range 0 to 5 are sorted. We can notice computing patterns that are aligned with the gate direction. Every 16 images correspond to maps with a different gate direction.

### Footnotes

### References

**Deep speech 2: End-to-end speech recognition in english and mandarin.**

Amodei, Dario, Anubhai, Rishita, Battenberg, Eric, Case, Carl, Casper, Jared, Catanzaro, Bryan, Chen, Jingdong, Chrzanowski, Mike, Coates, Adam, Diamos, Greg, et al. arXiv preprint arXiv:1512.02595**Learning efficient algorithms with hierarchical attentive memory.**

Andrychowicz, Marcin and Kurach, Karol. arXiv preprint arXiv:1602.03218**Inductive inference: Theory and methods.**

Angluin, Dana and Smith, Carl H. ACM Computing Surveys (CSUR)**Neural machine translation by jointly learning to align and translate.**

Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. arXiv preprint arXiv:1409.0473**Learning phrase representations using rnn encoder-decoder for statistical machine translation.**

Cho, Kyunghyun, Van Merriënboer, Bart, Gulcehre, Caglar, Bahdanau, Dzmitry, Bougares, Fethi, Schwenk, Holger, and Bengio, Yoshua. arXiv preprint arXiv:1406.1078**Fast and accurate deep network learning by exponential linear units (elus).**

Clevert, Djork-Arné, Unterthiner, Thomas, and Hochreiter, Sepp. arXiv preprint arXiv:1511.07289**Language identification in the limit.**

Gold, E Mark. Information and control**Neural turing machines.**

Graves, Alex, Wayne, Greg, and Danihelka, Ivo. arXiv preprint arXiv:1410.5401**Learning to transduce with unbounded memory.**

Grefenstette, Edward, Hermann, Karl Moritz, Suleyman, Mustafa, and Blunsom, Phil. In*Advances in Neural Information Processing Systems*, pp. 1828–1836, 2015.**Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.**

He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. In*Proceedings of the IEEE international conference on computer vision*, pp. 1026–1034, 2015.**Long short-term memory.**

Hochreiter, Sepp and Schmidhuber, Jürgen. Neural computation**Inferring algorithmic patterns with stack-augmented recurrent nets.**

Joulin, Armand and Mikolov, Tomas. In*Advances in neural information processing systems*, pp. 190–198, 2015.**Can active memory replace attention?**

Kaiser, Łukasz and Bengio, Samy. In*Advances in Neural Information Processing Systems*, pp. 3774–3782, 2016.**Neural gpus learn algorithms.**

Kaiser, Łukasz and Sutskever, Ilya. arXiv preprint arXiv:1511.08228**Grid long short-term memory.**

Kalchbrenner, Nal, Danihelka, Ivo, and Graves, Alex. arXiv preprint arXiv:1507.01526**Adam: A method for stochastic optimization.**

Kingma, Diederik and Ba, Jimmy. arXiv preprint arXiv:1412.6980**Imagenet classification with deep convolutional neural networks.**

Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. In*Advances in neural information processing systems*, pp. 1097–1105, 2012.**Neural random-access machines.**

Kurach, Karol, Andrychowicz, Marcin, and Sutskever, Ilya. ERCIM News**Rectifier nonlinearities improve neural network acoustic models.**

Maas, Andrew L, Hannun, Awni Y, and Ng, Andrew Y. In*Proc. ICML*, volume 30, 2013.**Adding gradient noise improves learning for very deep networks.**

Neelakantan, Arvind, Vilnis, Luke, Le, Quoc V, Sutskever, Ilya, Kaiser, Lukasz, Kurach, Karol, and Martens, James. arXiv preprint arXiv:1511.06807**Extensions and limitations of the neural gpu.**

Price, Eric, Zaremba, Wojciech, and Sutskever, Ilya. arXiv preprint arXiv:1611.00736**Neural programmer-interpreters.**

Reed, Scott and De Freitas, Nando. arXiv preprint arXiv:1511.06279**Recurrent dropout without memory loss.**

Semeniuta, Stanislau, Severyn, Aliaksei, and Barth, Erhardt. In*COLING*, 2016.**Pointer networks.**

Vinyals, Oriol, Fortunato, Meire, and Jaitly, Navdeep. In*Advances in Neural Information Processing Systems*, pp. 2692–2700, 2015.**Reinforcement learning neural turing machines-revised.**

Zaremba, Wojciech and Sutskever, Ilya. arXiv preprint arXiv:1505.00521**Learning simple algorithms from examples.**

Zaremba, Wojciech, Mikolov, Tomas, Joulin, Armand, and Fergus, Rob. In*Proceedings of the International Conference on Machine Learning*, 2016.