Learning to Execute
Abstract
Recurrent Neural Networks (RNNs) with Long ShortTerm Memory units (LSTM) are widely used because they are expressive and are easy to train. Our interest lies in empirically evaluating the expressiveness and the learnability of LSTMs in the sequencetosequence regime by training them to evaluate short computer programs, a domain that has traditionally been seen as too complex for neural networks. We consider a simple class of programs that can be evaluated with a single lefttoright pass using constant memory. Our main result is that LSTMs can learn to map the characterlevel representations of such programs to their correct outputs. Notably, it was necessary to use curriculum learning, and while conventional curriculum learning proved ineffective, we developed a new variant of curriculum learning that improved our networks’ performance in all experimental conditions. The improved curriculum had a dramatic impact on an addition problem, making it possible to train an LSTM to add two 9digit numbers with 99% accuracy.
myfnsymbols** ††‡‡§§‖∥¶¶
1 Introduction
Execution of computer programs requires dealing with a number of nontrivial concepts. To execute a program, a system has to understand numerical operations, ifstatements, variable assignments, the compositionality of operations, and many more.
We show that Recurrent Neural Networks (RNN) with Long ShortTerm Memory (LSTM) units can accurately evaluate short simple programs in the sequencetosequence framework of Sutskever et al. (2014). The LSTM reads the program characterbycharacter and computes the program’s output. We consider a constrained set of computer programs that can be evaluated in linear time and constant memory, because the LSTM reads the program only once and its memory capacity is limited (Section 3).
We found it difficult to train LSTMs to execute computer programs, so we used curriculum learning to simplify the learning problem. We design a curriculum procedure which outperforms both conventional training that uses no curriculum learning (baseline) as well as the naive curriculum learning of strategy of Bengio et al. (2009) (Section 4). We provide a plausible explanation for the effectiveness of our procedure relative to naive curriculum learning (Section 7).
Finally, in addition to curriculum learning strategies, we examine two simple input transformations that further simplify the sequencetosequence learning problem. We show that, in many cases, reversing the input sequence (Sutskever et al., 2014) and replicating the input sequence improves the LSTM’s performance on a memorization task (Section 3.2).
The code for replicating most of the experiments in this work can be found in https://github.com/wojciechz/learning_to_execute.
2 Related work
There has been related research that used Tree Neural Networks (also known as Recursive Neural Networks) to evaluate symbolic mathematical expressions and logical formulas (Zaremba et al., 2014a; Bowman et al., 2014; Bowman, 2013), which is close in spirit to our work. Computer programs are more complex than mathematical or logical expressions because it is possible to simulate either with an appropriate computer program.
From a methodological perspective, we formulate the program evaluation task as a sequencetosequence learning problem with a recurrent neural network (Sutskever et al., 2014) (see also (Mikolov, 2012; Sutskever, 2013; Pascanu et al., 2013)). Other interesting applications of recurrent neural networks include speech recognition (Robinson et al., 1996; Graves et al., 2013), machine translation (Cho et al., 2014; Sutskever et al., 2014), handwriting recognition (Pham et al., 2013; Zaremba et al., 2014b), and many more.
Maddison & Tarlow (2014) trained a language model of program text, and Mou et al. (2014) used a neural network to determine whether two programs are equivalent. Both of these approaches require the parse trees of programs, while the input to our model is a string of character representing our program.
Predicting program output requires that the model deals with long term dependencies that arise from variable assignment. For this reason, we chose to use the Long ShortTerm Memory model (Hochreiter & Schmidhuber, 1997), although there are many other RNN variants that perform well on tasks with long term dependencies (Cho et al., 2014; Jaeger et al., 2007; Koutník et al., 2014; Martens, 2010; Bengio et al., 2013).
Initially, we found it difficult to train LSTMs to accurately evaluate programs. The compositional nature of computer programs suggests that the LSTM would learn faster if we first taught it about the individual operators and how to combine them. This approach can be implemented with curriculum learning (Bengio et al., 2009; Kumar et al., 2010; Lee & Grauman, 2011), which prescribes to gradually increase the “difficulty level” of the examples presented to the LSTM. It is partially motivated by fact that humans and animals learn much faster when they are given hard but manageable tasks. Unfortunately, we found the naive curriculum learning strategy of Bengio et al. (2009) to sometimes be harmful. One of our key contributions is the formulation of a new curriculum learning strategy that substantially improves the speed and the quality of training in every experimental setting that we considered.
3 Program Subclass
We train RNNs on the class of short programs that can be evaluated in time and constant memory. This restriction is dictated by the computational structure of the RNN itself, as it can only perform a single pass over the program and its memory is limited. Our programs use the Python syntax and are constructed from a small number of operations and their compositions (nesting). We allow the following operations: addition, subtraction, multiplication, variable assignments, ifstatements, and forloops, but we forbid double loops. Every program ends with a single “print” statement whose output is an integer. Two example programs are shown in Figure 1.
We select our programs from a family of distributions parametrized by their length and nesting. The length parameter is the number of digits in the integers that appear in the programs (so the integers are chosen uniformly from ). The appendix presents the pseudocode 1 of the algorithm used to generate our programs. For example, two programs that are generated with and are shown in Figure 1.
We impose restrictions on the operands of multiplication and on the ranges of forloop, since they pose a greater difficulty to our model. We constrain one of the arguments of multiplication and the range of forloops to be chosen uniformly from the much smaller range . We do so since our models are able to perform lineartime computation while generic integer multiplication requires superlinear time. Similar considerations apply to forloops, since nested forloops can implement integer multiplication.
The nesting parameter is the number of times we are allowed to combine the operations with each other. Higher values of nesting yield programs with deeper parse trees. Nesting makes the task much harder for the LSTMs, because they do not have a natural way of dealing with compositionality, unlike Tree Neural Networks. It is surprising that the LSTMs can handle nested expressions at all. The programs also do not receive an external input.
It is important to emphasize that the LSTM reads the entire input one character at a time and produces the output one character at a time. The characters are initially meaningless from the model’s perspective; for instance, the model does not know that “+” means addition or that is followed by . In fact, scrambling the input characters (e.g., replacing “a” with “q”, “b” with “w”, etc.,) has no effect on the model’s ability to solve this problem. We demonstrate the difficulty of the task by presenting an inputoutput example with scrambled characters in Figure 2.
Finally, we wanted to verify that our program are not trivial to evaluate, by ensuring that the bias coming from Benford’s law (Hill, 1995) is not too strong. Our setup has possible output characters, that is digits, the end of sequence character, and minus. Their output distribution is not uniform, which can be seen by noticing that the minus sign and the dot do not occur with the same frequency as the other digits. If we assume that the output characters are independent, the probability of guessing the correct character is . The most common character is which occurs with probability over the entire output.
However, there is a bias in the distribution of the first character. There are possible choices, which can be randomly guessed with a probability of . The most common character is , and it occurs with a probability in its first position, indicating a strong bias. Still, this value is far below our model prediction accuracy. Moreover, the most probable second character in the first position of the output occurs with probability , which is indistinguishable from probability distribution of digits in the other positions. The last character is always the end of sequence. The most common digit prior to the last character is , and it occures with probability . These statistics are computed with randomly generated programs with and . The absence of a strong bias for this configuration suggests that there will be even less bias in with greater nesting and longer digits, which we have also confirmed numerically.
3.1 Addition Task
It is difficult to intuitively assess the accuracy of an LSTM on a program evaluation task. For example, it is not clear whether an accuracy of is impressive. Thus, we also evaluate our models on a more familiar addition task, where the difficulty is measured by the length of the inputs. We consider the addition of only two numbers of the same length (Figure 3) that are chosen uniformly from . Adding two number of the same length is simpler than adding variable length numbers. Model doesn’t need to align them.
Input:
Target: 823443
3.2 Memorization Task
In addition to program evaluation and addition, we also investigate the task of memorizing a random sequence of numbers. Given an example input , the LSTM reads it one character at a time, stores it in memory, and then outputs one character at a time. We present and explore two simple performance enhancing techniques: input reversing Sutskever et al. (2014) and input doubling.
The idea of input reversing is to reverse the order of the input () while keeping the desired output unchanged (). It may appear to be a neutral operation because the average distance between each input and its corresponding target does not change. However, input reversing introduces many short term dependencies that make it easier for the LSTM to learn to make correct predictions. This strategy was first introduced by Sutskever et al. (2014).
The second performance enhancing technique is input doubling, where we present the input sequence twice (so the example input becomes ), while the output remains unchanged (). This method is meaningless from a probabilistic perspective as RNNs approximate the conditional distribution , yet here we attempt to learn . Still, it gives noticeable performance improvements. By processing the input several times before producing the output, the LSTM is given the opportunity to correct any mistakes or omissions it made before.
4 Curriculum Learning
Our program generation procedure is parametrized by length and nesting. These two parameters allow us control the complexity of the program. When length and nesting are large enough, the learning problem becomes nearly intractable. This indicates that in order to learn to evaluate programs of a given and , it may help to first learn to evaluate programs with and . We evaluate the following curriculum learning strategies:
No curriculum learning (baseline) The baseline approach does not use curriculum learning. This means that we generate all the training samples with and . This strategy is the most “sound” from statistical perspective, since it is generally recommended to make the training distribution identical to test distribution.
Naive curriculum strategy (naive) We begin with and . Once learning stops making progress on the validation set, we increase length by 1. We repeat this process until its length reaches , in which case we increase nesting by one and reset length to . We can also choose to first increase nesting and then length. However, it does not make a noticeable difference in performance. We skip this option in the rest of paper, and increase length first in all our experiments. This strategy is has been examined in previous work on curriculum learning (Bengio et al., 2009). However, we show that sometimes it gives even worse performance than baseline.
Mixed strategy (mix) To generate a random sample, we first pick a random length from and a random nesting from independently for every sample. The Mixed strategy uses a balanced mixture of easy and difficult examples, so at every point during training, a sizable fraction of the training samples will have the appropriate difficulty for the LSTM.
Combining the mixed strategy with naive curriculum strategy (combined) This strategy combines the mix strategy with the naive strategy. In this approach, every training case is obtained either by the naive strategy or by the mix strategy. As a result, the combined strategy always exposes the network at least to some difficult examples, which is the key way in which it differs from the naive curriculum strategy. We noticed that it always outperformed the naive strategy and would generally (but not always) outperform the mix strategy. We explain why our new curriculum learning strategies outperform the naive curriculum strategy in Section 7.
5 Lstm
In this section we briefly describe the deep LSTM (Section 5). All vectors are dimensional unless explicitly stated otherwise. Let be a hidden state in layer in timestep . Let be a biased linear mapping ( for some and ). We let be elementwise multiplication and let be the input to the deep LSTM at timestep . We use the activations at the top layer (namely ) to predict where is the depth of our LSTM.
The structure of the LSTM allows it to train on problems with long term dependencies relatively easily. The “long term” memory is stored in a vector of memory cells . Although many LSTM architectures differ slightly in their connectivity structure and activation functions, all LSTM architectures have additive memory cells that make it easy to learn to store information for long periods of time. We used an LSTM described by the following equations (from Graves et al. (2013)):
6 Experiments
In this section, we report the results of our curriculum learning strategies on the program evaluation and memorization tasks. In both experiments, we used the same LSTM architecture.
Our LSTM has two layers and is unrolled for steps in both experiments. It has cells per layer and its parameters are initialized uniformly in . This gives total M parameters. We initialize the hidden states to zero. We then use the final hidden states of the current minibatch as the initial hidden state of the subsequent minibatch. Thus it is possible that a program and its output could be separated across different minibatches. The size of minibatch is . We constrain the norm of the gradients (normalized by minibatch size) to be no greater than (Mikolov et al., 2010). We keep the learning rate equal to until we reach the target length and nesting (we only vary the length, i.e., the number of digits, in the memorization task).
After reaching the target accuracy () we decrease the learning rate by . We keep the learning rate on the same level until there is no improvement on the training set. We decrease it again, when there is no improvement on training set. The only difference between experiments is the termination criteria. For the program output prediction, we stop when learning rate becomes smaller than . For copying task, we stop training after epochs, where each epoch has M samples.
We begin training with and (or length=1 for the memorization task). We ensure that the training, validation, and test sets are disjoint. It is achieved computing the hash value of each sample and taking it modulo 3.
Important note on error rates: We use teacher forcing when we compute the accuracy of our LSTMs. That is, when predicting the th digit of the target, the LSTM is provided with the correct first digits of the target. This is different from using the LSTM to generate the entire output on its own, as done by Sutskever et al. (2014), which would almost surely result in lower numerical accuracies. To help make intuitive sense of our results, we present a large number of test cases and the outputs computed by the LSTM, albeit with teacher forcing.
6.1 Results on Program Evaluation
We train our LSTMs using the four strategies described in Section 4:

No curriculum learning (baseline),

Naive curriculum strategy (naive)

Mixed strategy (mix), and

Combined strategy (combined).
Figure 4 shows the absolute performance of the baseline strategy
(training on the original target distribution), and
of the best performing strategy, combined. Moreover, Figure
5 shows the performance of the three curriculum strategies relative to
baseline. Finally, we provide several example
predictions on test data in the supplementary materials.
The accuracy of a random predictor would be , since there are
possible output symbols.
6.2 Results on the Addition Task
Figure 6 presents the accuracy achieved by the LSTM with the various curriculum strategies on the addition task. Remarkably, the combined curriculum strategy resulted in 99% accuracy on the addition of 9digit long numbers, which is a massive improvement over the naive curriculum.
6.3 Results on the Memorization Task


Recall that the goal of the memorization task is to read a sequence of digits into the hidden state and then to reconstruct it from the hidden state. Namely, given an input such as , the goal is to produce the output . The model processes the input one input character at the time and has to reconstruct the output only after loading the entire input into its memory. This task provides insight into the LSTM’s ability to learn to remember. We have evaluated our model on sequences of lengths ranging from to . We use the four curriculum strategies of Section 4. In addition, we investigate two strategies to modify the input which increase performance:

Inverting input (Sutskever et al., 2014)

Doubling Input
Both strategies are described in Section 3.2. Figure 7 shows the absolute performance of the baseline strategy and of the combined strategy. This Figure shows the performance at convergence. We further present in Supplementary material (Section Supplementary material) results after epochs (Figure 8).
For this task, the combined strategy no longer outperforms the mixed strategy in every experimental setting, although both strategies are always better than using no curriculum and the naive curriculum strategy. Each graph contains settings, which correspond to the possible combinations of input inversion and input doubling. The result clearly shows that the simultaneously doubling and reversing the input achieves the best results. Random guessing would achieve an accuracy of , since there are possible output symbols.
7 Hidden State Allocation Hypothesis
Our experimental results suggest that a proper curriculum learning strategy is critical for achieving good performance on very hard problems where conventional stochastic gradient descent (SGD) performs poorly. The results on both of our problems (Sections 6.3 and 6.1) show that the combined strategy is better than all other curriculum strategies, including both naive curriculum learning, and training on the target distribution. We have a plausible explanation for why this is the case.
It seems natural to train models with examples of increasing difficulty. This way the models have a chance to learn the correct intermediate concepts, and then utilize them for the more difficult problem instances. Otherwise, learning the full task might be just too difficult for SGD from a random initialization. This explanation has been proposed in previous work on curriculum learning Bengio et al. (2009). However, based the on empirical results, the naive strategy of curriculum learning can sometimes be worse than learning with the target distribution.
In our tasks, the neural network has to perform a lot of memorization. The easier examples usually require less memorization than the hard examples. For instance, in order to add two digit numbers, one has to remember at least digits before producing any output. The best way to accurately memorize numbers could be to spread them over the entire hidden state / memory cell (i.e., use a distributed representation). Indeed, the network has no incentive to utilize only a fraction of its state, and it is always better to make use of its entire memory capacity. This implies that the harder examples would require a restructuring of its memory patterns. It would need to contract its representations of digit numbers in order to free space for the th number. This process of memory pattern restructuring might be difficult to implement, so it could be the reason for the sometimes poor performance of the naive curriculum learning strategy relative to baseline.
The combined strategy reduces the need to restructure the memory patterns. The combined strategy is a combination of the naive curriculum strategy and of the mix strategy, which is a mixture of examples of all difficulties. The examples produced by the naive curriculum strategy help to learn the intermediate inputoutput mapping, which is useful for solving the target task, while the extra samples from the mix strategy prevent the network from utilizing all the memory on the easy examples, thus eliminating the need to restructure its memory patterns.
8 Critique
Perfect prediction of program output requires a complete understanding of all operands and concepts, and of the precise way in which they are combined. However, imperfect prediction might be achieved in a multitude of ways, and could heavily rely on memorization, without a genuine understanding of the underlying concepts. For instance, perfect addition is relatively intricate, as the LSTM needs to know the order of numbers and to correctly compute the carry.
There are many alternatives to the addition algorithm if perfect output is not required. For instance, one can perform elementwise addition, and as long as there is no carry then the output would be perfectly correct. Another alternative, which requires more memory, but is also more simpler, is to memorize all results of addition for digit numbers. Then multidigit addition can be broken down to multiple digits additions elementwise. Once again, such an algorithm would have a reasonably high prediction accuracy, although it would be far from correct.
We do not know how heavily our model relies on memorization and how far the learned algorithm is from the actual, correct algorithm. This could be tested by creating a big discrepancy between the training and test data, but in this work, the training and the test distributions are the same. We plan to examine how well our models would generalize on very different new examples in future work.
9 Discussion
We have shown that it is possible to learn to evaluate programs with limited prior knowledge. This work demonstrate the power and expressiveness of sequencetosequence LSTMs. We also showed that correct curriculum learning is crucial for achieving good results on very difficult tasks that cannot be optimized with standard SGD. We also found that the general method of doubling the input reliably improves the performance of sequencetosequence LSTMs.
Our results are encouraging but they leave many questions open. For example, we are not able to evaluate arbitrary programs (e.g., ones that run in more than time). This cannot be achieved with conventional RNNs or LSTMs due to their runtime restrictions. We also do not know the optimal curriculum learning strategy. To understand it, it may be necessary to identify the training samples that are most beneficial to the model.
10 Acknowledgments
We wish to thank Oriol Vinyals for useful discussions, and to Koray Kavukcuoglu for help during code development. Moreover, we wish to acknowledge Marc’Aurelio Ranzato for useful comments on the first version of the paper. Some chunks of our code origin from Google Deepmind repository. We thank to unknown developers of LSTM function, and auxiliary functions.
References
 Bengio et al. (2009) Bengio, Yoshua, Louradour, Jérôme, Collobert, Ronan, and Weston, Jason. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp. 41–48. ACM, 2009.
 Bengio et al. (2013) Bengio, Yoshua, BoulangerLewandowski, Nicolas, and Pascanu, Razvan. Advances in optimizing recurrent networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 8624–8628. IEEE, 2013.
 Bowman (2013) Bowman, Samuel R. Can recursive neural tensor networks learn logical reasoning? arXiv preprint arXiv:1312.6192, 2013.
 Bowman et al. (2014) Bowman, Samuel R, Potts, Christopher, and Manning, Christopher D. Recursive neural networks for learning logical semantics. arXiv preprint arXiv:1406.1827, 2014.
 Cho et al. (2014) Cho, Kyunghyun, van Merrienboer, Bart, Gulcehre, Caglar, Bougares, Fethi, Schwenk, Holger, and Bengio, Yoshua. Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
 Graves et al. (2013) Graves, Alex, Mohamed, Abdelrahman, and Hinton, Geoffrey. Speech recognition with deep recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 6645–6649. IEEE, 2013.
 Hill (1995) Hill, Theodore P. A statistical derivation of the significantdigit law. Statistical Science, pp. 354–363, 1995.
 Hochreiter & Schmidhuber (1997) Hochreiter, Sepp and Schmidhuber, Jürgen. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 Jaeger et al. (2007) Jaeger, Herbert, Lukoševičius, Mantas, Popovici, Dan, and Siewert, Udo. Optimization and applications of echo state networks with leakyintegrator neurons. Neural Networks, 20(3):335–352, 2007.
 Koutník et al. (2014) Koutník, Jan, Greff, Klaus, Gomez, Faustino, and Schmidhuber, Jürgen. A clockwork rnn. arXiv preprint arXiv:1402.3511, 2014.
 Kumar et al. (2010) Kumar, M Pawan, Packer, Benjamin, and Koller, Daphne. Selfpaced learning for latent variable models. In Advances in Neural Information Processing Systems, pp. 1189–1197, 2010.
 Lee & Grauman (2011) Lee, Yong Jae and Grauman, Kristen. Learning the easy things first: Selfpaced visual category discovery. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pp. 1721–1728. IEEE, 2011.
 Maddison & Tarlow (2014) Maddison, Chris J and Tarlow, Daniel. Structured generative models of natural source code. arXiv preprint arXiv:1401.0514, 2014.
 Martens (2010) Martens, James. Deep learning via hessianfree optimization. In Proceedings of the 27th International Conference on Machine Learning (ICML10), pp. 735–742, 2010.
 Mikolov (2012) Mikolov, Tomáš. Statistical language models based on neural networks. PhD thesis, Ph. D. thesis, Brno University of Technology, 2012.
 Mikolov et al. (2010) Mikolov, Tomas, Karafiát, Martin, Burget, Lukas, Cernockỳ, Jan, and Khudanpur, Sanjeev. Recurrent neural network based language model. In INTERSPEECH, pp. 1045–1048, 2010.
 Mou et al. (2014) Mou, Lili, Li, Ge, Liu, Yuxuan, Peng, Hao, Jin, Zhi, Xu, Yan, and Zhang, Lu. Building program vector representations for deep learning. arXiv preprint arXiv:1409.3358, 2014.
 Pascanu et al. (2013) Pascanu, Razvan, Gulcehre, Caglar, Cho, Kyunghyun, and Bengio, Yoshua. How to construct deep recurrent neural networks. arXiv preprint arXiv:1312.6026, 2013.
 Pham et al. (2013) Pham, Vu, Kermorvant, Christopher, and Louradour, Jérôme. Dropout improves recurrent neural networks for handwriting recognition. arXiv preprint arXiv:1312.4569, 2013.
 Robinson et al. (1996) Robinson, Tony, Hochberg, Mike, and Renals, Steve. The use of recurrent neural networks in continuous speech recognition. In Automatic speech and speaker recognition, pp. 233–258. Springer, 1996.
 Sutskever (2013) Sutskever, Ilya. Training Recurrent Neural Networks. PhD thesis, University of Toronto, 2013.
 Sutskever et al. (2014) Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc V. Sequence to sequence learning with neural networks. arXiv preprint arXiv:1409.3215, 2014.
 Zaremba et al. (2014a) Zaremba, Wojciech, Kurach, Karol, and Fergus, Rob. Learning to discover efficient mathematical identities. arXiv preprint arXiv:1406.1584, 2014a.
 Zaremba et al. (2014b) Zaremba, Wojciech, Sutskever, Ilya, and Vinyals, Oriol. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014b.
Supplementary material
Appendix A Additional Results on the Memorization Problem




We present the algorithm for generating the training cases, and present an extensive qualitative evaluation of the samples and the kinds of predictions made by the trained LSTMs.
We emphasize that these predictions rely on teacher forcing. That is, even if the LSTM made an incorrect prediction in the th output digit, the LSTM will be provided as input the correct th output digit for predicting the th digit. While teacher forcing has no effect whenever the LSTM makes no errors at all, a sample that makes an early error and gets the remainder of the digits correctly needs to be interpreted with care.
Appendix B Qualitative evaluation of the curriculum strategies
b.1 Examples of program evaluation prediction. Length = 4, Nesting = 1
{mdframed}Input:
Target:  6652. 
”Baseline” prediction:  6 6 5 2 . 
”Naive” prediction:  6 6 5 2 . 
”Mix” prediction:  6 6 5 2 . 
”Combined” prediction:  6 6 5 2 . 
Input:
Target:  5259. 
”Baseline” prediction:  5 1 0 1 . 
”Naive” prediction:  5 1 0 1 . 
”Mix” prediction:  5 2 4 9 . 
”Combined” prediction:  5 2 2 9 . 
Input:
Target:  49136. 
”Baseline” prediction:  4 9 3 3 6 . 
”Naive” prediction:  4 8 6 7 6 . 
”Mix” prediction:  5 7 0 2 6 . 
”Combined” prediction:  4 9 6 2 6 . 
Input:
Target:  2327. 
”Baseline” prediction:   2 3 2 0 . 
”Naive” prediction:   2 2 0 1 . 
”Mix” prediction:   2 3 7 7 . 
”Combined” prediction:   2 3 1 7 . 
Input:
Target:  10344. 
”Baseline” prediction:  1 0 3 4 4 . 
”Naive” prediction:  1 0 3 2 4 . 
”Mix” prediction:  1 0 3 4 4 . 
”Combined” prediction:  1 0 3 4 4 . 
Input:
Target:  5176. 
”Baseline” prediction:  5 1 9 6 . 
”Naive” prediction:  5 1 0 4 . 
”Mix” prediction:  4 2 4 6 . 
”Combined” prediction:  5 1 9 6 . 
Input:
Target:  4849. 
”Baseline” prediction:  4 8 4 9 . 
”Naive” prediction:  4 8 4 9 . 
”Mix” prediction:  4 8 4 9 . 
”Combined” prediction:  4 8 4 9 . 
Input:
Target:  28216. 
”Baseline” prediction:  2 8 2 1 6 . 
”Naive” prediction:  2 8 1 1 6 . 
”Mix” prediction:  2 8 2 1 6 . 
”Combined” prediction:  2 8 2 1 6 . 
Input:
Target:  622. 
”Baseline” prediction:   6 8 8 . 
”Naive” prediction:   6 2 8 . 
”Mix” prediction:   6 9 2 . 
”Combined” prediction:   6 3 2 . 
Input:
Target:  48369. 
”Baseline” prediction:  4 8 0 1 7 . 
”Naive” prediction:  4 8 0 1 1 . 
”Mix” prediction:  4 8 1 0 1 . 
”Combined” prediction:  4 8 0 0 9 . 
b.2 Examples of program evaluation prediction. Length = 4, Nesting = 2
{mdframed}Input:
Target:  95007. 
”Baseline” prediction:  9 4 0 9 3 . 
”Naive” prediction:  9 0 0 1 3 . 
”Mix” prediction:  9 5 0 1 5 . 
”Combined” prediction:  9 4 1 0 3 . 
Input:
Target:  14478. 
”Baseline” prediction:  1 4 4 9 8 . 
”Naive” prediction:  1 4 4 4 4 . 
”Mix” prediction:  1 4 4 8 2 . 
”Combined” prediction:  1 4 4 7 8 . 
Input:
Target:  35224. 
”Baseline” prediction:  3 4 0 4 4 . 
”Naive” prediction:  3 2 1 8 0 . 
”Mix” prediction:  3 3 2 8 4 . 
”Combined” prediction:  3 3 0 0 4 . 
Input:
Target:  63179. 
”Baseline” prediction:   6 2 0 4 9 . 
”Naive” prediction:   6 3 1 1 7 . 
”Mix” prediction:   6 2 0 1 3 . 
”Combined” prediction:   6 2 0 0 9 . 
Input:
Target:  7159. 
”Baseline” prediction:  7 0 0 9 . 
”Naive” prediction:  7 0 1 9 . 
”Mix” prediction:  7 9 9 5 . 
”Combined” prediction:  7 0 7 9 . 
Input:
Target:  328468. 
”Baseline” prediction:  3 1 8 0 0 4 . 
”Naive” prediction:  3 3 8 0 8 8 . 
”Mix” prediction:  3 2 9 2 2 0 . 
”Combined” prediction:  3 3 8 0 8 0 . 
Input:
Target:  21096. 
”Baseline” prediction:  2 1 2 6 6 . 
”Naive” prediction:  1 0 0 4 6 . 
”Mix” prediction:  1 0 6 0 6 . 
”Combined” prediction:  2 0 4 0 2 . 
Input:
Target:  119613. 
”Baseline” prediction:  1 1 8 3 1 3 . 
”Naive” prediction:  1 1 8 0 1 1 . 
”Mix” prediction:  1 1 7 6 6 9 . 
”Combined” prediction:  1 1 9 5 3 3 . 
Input:
Target:  5129. 
”Baseline” prediction:  4 0 1 3 . 
”Naive” prediction:  5 0 3 5 . 
”Mix” prediction:  4 0 1 5 . 
”Combined” prediction:  4 0 0 9 . 
Input:
Target:  9948. 
”Baseline” prediction:   1 9 9 6 . 
”Naive” prediction:   1 6 1 0 . 
”Mix” prediction:   1 8 8 2 . 
”Combined” prediction:   1 9 8 0 . 
b.3 Examples of program evaluation prediction. Length = 4, Nesting = 3
{mdframed}Input:
Target:  65958. 
”Baseline” prediction:   1 3 2 6 2 . 
”Naive” prediction:   7 3 1 9 4 . 
”Mix” prediction:   4 0 1 8 8 . 
”Combined” prediction:   1 2 0 0 4 . 
Input:
Target:  36217. 
”Baseline” prediction:   3 7 5 1 5 . 
”Naive” prediction:   3 8 6 0 9 . 
”Mix” prediction:   3 5 8 9 3 . 
”Combined” prediction:   3 5 0 5 5 . 
Input:
Target:  3043. 
”Baseline” prediction:  3 0 4 3 . 
”Naive” prediction:  3 0 4 3 . 
”Mix” prediction:  3 0 4 3 . 
”Combined” prediction:  3 0 4 3 . 
Input:
Target:  6391. 
”Baseline” prediction:   5 5 5 . 
”Naive” prediction:  6 3 2 9 . 
”Mix” prediction:  6 4 6 1 . 
”Combined” prediction:  6 1 0 5 . 
Input:
Target:  7192. 
”Baseline” prediction:  7 1 9 2 . 
”Naive” prediction:  7 1 9 2 . 
”Mix” prediction:  7 1 9 2 . 
”Combined” prediction:  7 1 9 2 . 
Input:
Target:  7200. 
”Baseline” prediction:  7 2 0 0 . 
”Naive” prediction:  7 2 0 0 . 
”Mix” prediction:  7 2 0 0 . 
”Combined” prediction:  7 2 0 0 . 
Input:
Target:  47736. 
”Baseline” prediction:   0 6 6 6 . 
”Naive” prediction:  1 1 2 6 2 . 
”Mix” prediction:  4 8 6 6 6 . 
”Combined” prediction:  4 8 7 6 6 . 
Input:
Target:  13203. 
”Baseline” prediction:  1 3 0 1 5 . 
”Naive” prediction:  1 2 0 0 7 . 
”Mix” prediction:  1 3 3 7 9 . 
”Combined” prediction:  1 3 2 0 5 . 
Input:
Target:  7251. 
”Baseline” prediction:  7 1 1 1 . 
”Naive” prediction:  7 0 9 9 . 
”Mix” prediction:  7 5 9 5 . 
”Combined” prediction:  7 6 9 9 . 
Input:
Target:  97899. 
”Baseline” prediction:   9 6 9 9 1 . 
”Naive” prediction:   1 9 9 5 9 . 
”Mix” prediction:   9 5 5 5 1 . 
”Combined” prediction:   9 6 3 9 7 . 
b.4 Examples of program evaluation prediction. Length = 6, Nesting = 1
{mdframed}Input:
Target:  477319. 
”Baseline” prediction:   4 7 2 1 2 2 . 
”Naive” prediction:   4 7 7 5 9 1 . 
”Mix” prediction:   4 7 9 7 0 5 . 
”Combined” prediction:   4 7 5 0 0 9 . 
Input:
Target:  1508. 
”Baseline” prediction:  1 5 0 8 . 
”Naive” prediction:  1 5 0 8 . 
”Mix” prediction:  1 5 0 8 . 
”Combined” prediction:  1 5 0 8 . 
Input:
Target:  1375853. 
”Baseline” prediction:  1 3 7 9 9 2 0 . 
”Naive” prediction:  1 3 7 8 9 9 1 . 
”Mix” prediction:  1 3 7 5 1 1 9 . 
”Combined” prediction:  1 3 7 5 1 7 3 . 
Input:
Target:  151108. 
”Baseline” prediction:  1 5 4 9 7 3 . 
”Naive” prediction:  1 5 1 1 0 8 . 
”Mix” prediction:  1 5 1 1 0 8 . 
”Combined” prediction:  1 5 1 1 0 8 . 
Input:
Target:  1859300. 
”Baseline” prediction:   1 8 4 0 8 3 1 . 
”Naive” prediction:   1 8 4 0 0 0 0 . 
”Mix” prediction:   1 9 7 9 7 2 0 . 
”Combined” prediction:   1 8 2 0 7 0 0 . 
Input:
Target:  881880. 
”Baseline” prediction:  8 8 0 4 7 5 . 
”Naive” prediction:  8 8 1 6 6 6 . 
”Mix” prediction:  8 8 0 1 9 0 . 
”Combined” prediction:  8 8 5 9 2 0 . 
Input:
Target:  853821. 
”Baseline” prediction:  8 5 1 2 3 3 . 
”Naive” prediction:  8 6 7 1 1 3 . 
”Mix” prediction:  8 5 5 6 1 5 . 
”Combined” prediction:  8 5 3 0 0 9 . 
Input:
Target:  3550354. 
”Baseline” prediction:   3 5 7 1 9 9 8 . 
”Naive” prediction:   3 6 9 9 9 9 3 . 
”Mix” prediction:   3 8 9 9 2 2 0 . 
”Combined” prediction:   3 5 0 7 7 9 0 . 
Input:
Target:  6291704. 
”Baseline” prediction:  6 2 7 0 8 0 4 . 
”Naive” prediction:  6 2 7 1 9 0 4 . 
”Mix” prediction:  6 2 9 7 6 4 4 . 
”Combined” prediction:  6 2 7 0 0 0 4 . 
Input:
Target:  71732. 
”Baseline” prediction:   6 1 0 8 6 . 
”Naive” prediction:   7 3 5 8 2 . 
”Mix” prediction:   1 9 0 0 0 . 
”Combined” prediction:   7 2 8 4 2 . 
b.5 Examples of program evaluation prediction. Length = 6, Nesting = 2
{mdframed}Input:
Target:  455975. 
”Baseline” prediction:  5 5 9 9 1 7 . 
”Naive” prediction:  4 3 8 8 8 7 . 
”Mix” prediction:  4 5 8 9 9 3 . 
”Combined” prediction:  4 5 0 0 3 1 . 
Input:
Target:  1250513. 
”Baseline” prediction:  1 2 5 0 9 3 9 . 
”Naive” prediction:  1 2 4 0 7 1 9 . 
”Mix” prediction:  1 2 3 0 8 8 1 . 
”Combined” prediction:  1 2 4 0 5 5 1 . 
Input:
Target:  948950. 
”Baseline” prediction:  9 4 8 9 5 0 . 
”Naive” prediction:  9 4 8 9 5 0 . 
”Mix” prediction:  9 4 8 9 5 0 . 
”Combined” prediction:  9 4 8 9 5 0 . 
Input:
Target:  7513764. 
”Baseline” prediction:   7 4 2 2 7 5 6 . 
”Naive” prediction:   7 0 1 1 0 4 8 . 
”Mix” prediction:   2 6 1 7 7 7 7 . 
”Combined” prediction:   7 1 0 1 1 4 6 . 
Input:
Target:  116026. 
”Baseline” prediction:  1 3 2 4 4 0 . 
”Naive” prediction:  1 0 1 4 8 8 . 
”Mix” prediction:  1 1 4 9 8 8 . 
”Combined” prediction:  1 2 5 6 8 2 . 
Input:
Target:  267900. 
”Baseline” prediction:  2 6 7 9 0 0 . 
”Naive” prediction:  2 6 7 9 0 0 . 
”Mix” prediction:  2 6 7 9 0 0 . 
”Combined” prediction:  2 6 7 9 0 0 . 
Input:
Target:  597058. 
”Baseline” prediction:  5 9 0 0 0 6 . 
”Naive” prediction:  6 9 0 0 0 4 . 
”Mix” prediction:  5 9 9 8 1 6 . 
”Combined” prediction:  5 9 9 9 9 0 . 
Input:
Target:  3266708. 
”Baseline” prediction:  3 2 4 9 9 9 8 . 
”Naive” prediction:  3 1 3 1 7 9 8 . 
”Mix” prediction:  3 3 9 0 1 5 8 . 
”Combined” prediction:  3 1 0 0 3 8 8 . 
Input:
Target:  449699. 
”Baseline” prediction:  4 4 9 6 9 9 . 
”Naive” prediction:  4 4 9 6 9 9 . 
”Mix” prediction:  4 4 9 6 9 9 . 
”Combined” prediction:  4 4 9 6 9 9 . 
Input:
Target:  11332. 
”Baseline” prediction:  1 1 3 3 2 . 
”Naive” prediction:  1 1 3 3 2 . 
”Mix” prediction:  1 1 3 3 2 . 
”Combined” prediction:  1 1 3 3 2 . 
b.6 Examples of program evaluation prediction. Length = 6, Nesting = 3
{mdframed}Input:
Target:  6953514. 
”Baseline” prediction:  1 0 9 9 5 2 2 . 
”Naive” prediction:  7 7 7 3 3 6 2 . 
”Mix” prediction:  6 9 9 3 1 2 4 . 
”Combined” prediction:  1 0 4 4 4 4 4 . 
Input:
Target:  765618. 
”Baseline” prediction:  8 0 0 9 8 8 . 
”Naive” prediction:  7 6 5 6 4 4 . 
”Mix” prediction:  7 6 5 6 1 6 . 
”Combined” prediction:  8 6 5 6 1 8 . 
Input:
Target:  7292860. 
”Baseline” prediction:  1 7 7 4 6 4 0 . 
”Naive” prediction:  7 1 3 4 6 6 0 . 
”Mix” prediction:  7 2 9 2 8 6 0 . 
”Combined” prediction:  7 2 9 2 8 6 0 . 
Input:
Target:  4074683. 
”Baseline” prediction:  1 3 2 0 5 5 4 4 . 
”Naive” prediction:   4 0 1 1 8 9 9 . 
”Mix” prediction:   4 4 2 2 9 0 9 . 
”Combined” prediction:   4 0 4 8 3 8 1 . 
Input:
Target:  445994. 
”Baseline” prediction:   3 3 3 1 5 3 . 
”Naive” prediction:   4 8 8 7 2 4 . 
”Mix” prediction:   4 4 0 8 8 0 . 
”Combined” prediction:   4 4 7 9 4 4 . 
Input:
Target:  576599. 
”Baseline” prediction:  1 7 6 5 9 9 . 
”Naive” prediction:  5 7 6 5 9 9 . 
”Mix” prediction:  5 7 6 5 9 9 . 
”Combined” prediction:  5 7 6 5 9 9 . 
Input:
Target:  10017. 
”Baseline” prediction:  1 2 1 1 5 . 
”Naive” prediction:   1 1 2 3 . 
”Mix” prediction:   0 0 0 . . 
”Combined” prediction:   0 0 3 3 . 
Input:
Target:  523084. 
”Baseline” prediction:  5 2 3 0 8 4 . 
”Naive” prediction:  5 2 3 0 8 4 . 
”Mix” prediction:  5 2 3 0 8 4 . 
”Combined” prediction:  5 2 3 0 8 4 . 
Input:
Target:  263838. 
”Baseline” prediction:   2 7 8 7 9 7 . 
”Naive” prediction:   2 4 1 1 4 4 . 
”Mix” prediction:   2 5 2 0 8 0 . 
”Combined” prediction:   2 7 7 8 8 2 . 
Input:
Target:  1684940. 
”Baseline” prediction:  1 6 0 2 2 2 1 . 
”Naive” prediction:  1 7 9 9 8 9 2 . 
”Mix” prediction:  1 6 7 7 7 8 8 . 
”Combined” prediction:  1 6 1 1 8 8 8 . 
b.7 Examples of predicting result of addition.
Length = 6
{mdframed}
Input:
Target:  566171. 
”Baseline” prediction:  5 6 6 1 9 9 . 
”Naive” prediction:  5 6 6 1 5 1 . 
”Mix” prediction:  5 6 6 1 7 1 . 
”Combined” prediction:  5 6 6 1 7 1 . 
Input:
Target:  1039705. 
”Baseline” prediction:  1 0 3 9 7 1 2 . 
”Naive” prediction:  1 0 3 9 6 0 5 . 
”Mix” prediction:  1 0 3 9 6 0 5 . 
”Combined” prediction:  1 0 3 9 7 0 5 . 
Input:
Target:  1397692. 
”Baseline” prediction:  1 3 9 7 6 9 4 . 
”Naive” prediction:  1 3 9 7 6 6 2 . 
”Mix” prediction:  1 3 9 7 7 9 2 . 
”Combined” prediction:  1 3 9 7 6 9 2 . 
Input:
Target:  1381508. 
”Baseline” prediction:  1 3 8 1 4 0 1 . 
”Naive” prediction:  1 3 8 1 5 1 8 . 
”Mix” prediction:  1 3 8 1 5 0 8 . 
”Combined” prediction:  1 3 8 1 5 0 8 . 
Input:
Target:  1126026. 
”Baseline” prediction:  1 1 2 6 0 2 0 . 
”Naive” prediction:  1 1 2 6 0 0 6 . 
”Mix” prediction:  1 1 2 5 0 2 6 . 
”Combined” prediction:  1 1 2 6 0 2 6 . 
Input:
Target:  181257. 
”Baseline” prediction:  1 8 1 3 9 8 . 
”Naive” prediction:  1 8 1 2 8 7 . 
”Mix” prediction:  1 8 1 2 5 7 . 
”Combined” prediction:  1 8 1 2 5 7 . 
Input:
Target:  1099826. 
”Baseline” prediction:  1 0 9 9 7 0 8 . 
”Naive” prediction:  1 0 9 9 8 2 6 . 
”Mix” prediction:  1 0 9 9 8 2 6 . 
”Combined” prediction:  1 0 9 9 8 2 6 . 
Input:
Target:  1292561. 
”Baseline” prediction:  1 2 9 2 5 8 9 . 
”Naive” prediction:  1 2 9 2 5 7 1 . 
”Mix” prediction:  1 2 9 2 5 6 1 . 
”Combined” prediction:  1 2 9 2 5 6 1 . 
Input:
Target:  1564795. 
”Baseline” prediction:  1 5 6 4 7 6 9 . 
”Naive” prediction:  1 5 6 4 7 7 5 . 
”Mix” prediction:  1 5 6 4 7 9 5 . 
”Combined” prediction:  1 5 6 4 7 9 5 . 
Input:
Target:  1183063. 
”Baseline” prediction:  1 1 8 3 0 0 0 . 
”Naive” prediction:  1 1 8 3 0 6 3 . 
”Mix” prediction:  1 1 8 3 0 6 3 . 
”Combined” prediction:  1 1 8 3 0 6 3 . 
b.8 Examples of predicting result of addition.
Length = 8
{mdframed}
Input:
Target:  128756369. 
”Baseline” prediction:  1 2 8 8 9 9 9 9 7 . 
”Naive” prediction:  1 2 8 7 5 6 6 6 9 . 
”Mix” prediction:  1 2 8 7 5 6 3 6 9 . 
”Combined” prediction:  1 2 8 7 5 6 3 6 9 . 
Input:
Target:  96136550. 
”Baseline” prediction:  9 6 1 2 9 9 9 9 . 
”Naive” prediction:  9 6 1 3 6 0 5 0 . 
”Mix” prediction:  9 6 1 3 6 5 5 0 . 
”Combined” prediction:  9 6 1 3 6 5 5 0 . 
Input:
Target:  139544807. 
”Baseline” prediction:  1 3 9 6 7 9 0 9 0 . 
”Naive” prediction:  1 3 9 5 4 4 7 0 7 . 
”Mix” prediction:  1 3 9 5 4 4 8 0 7 . 
”Combined” prediction:  1 3 9 5 4 4 8 0 7 . 
Input:
Target:  58726235. 
”Baseline” prediction:  5 8 7 9 8 5 2 3 . 
”Naive” prediction:  5 8 7 2 6 0 3 5 . 
”Mix” prediction:  5 8 7 2 6 2 3 5 . 
”Combined” prediction:  5 8 7 2 6 2 3 5 . 
Input:
Target:  70026696. 
”Baseline” prediction:  6 0 0 1 4 0 2 2 . 
”Naive” prediction:  7 0 0 2 6 4 9 6 . 
”Mix” prediction:  6 0 0 2 6 6 9 6 . 
”Combined” prediction:  7 0 0 2 6 6 9 6 . 
Input:
Target:  76598546. 
”Baseline” prediction:  7 6 6 9 9 7 7 7 . 
”Naive” prediction:  7 6 5 9 8 2 4 6 . 
”Mix” prediction:  7 6 5 9 8 5 4 6 . 
”Combined” prediction:  7 6 5 9 8 5 4 6 . 
Input:
Target:  105838392. 
”Baseline” prediction:  1 0 5 9 9 9 8 8 2 . 
”Naive” prediction:  1 0 5 8 3 8 2 9 2 . 
”Mix” prediction:  1 0 5 8 3 8 3 9 2 . 
”Combined” prediction:  1 0 5 8 3 8 3 9 2 . 
Input:
Target:  43112517. 
”Baseline” prediction:  4 3 1 7 8 4 4 1 . 
”Naive” prediction:  4 3 1 1 2 9 1 7 . 
”Mix” prediction:  4 3 1 1 2 5 1 7 . 
”Combined” prediction:  4 3 1 1 2 5 1 7 . 
Input:
Target:  136882728. 
”Baseline” prediction:  1 3 6 8 6 0 0 8 7 . 
”Naive” prediction:  1 3 6 8 8 3 9 2 8 . 
”Mix” prediction:  1 3 6 8 8 2 7 2 8 . 
”Combined” prediction:  1 3 6 8 8 2 7 2 8 . 
Input:
Target:  24017572. 
”Baseline” prediction:  2 4 0 0 0 3 4 9 . 
”Naive” prediction:  2 4 0 1 8 8 7 2 . 
”Mix” prediction:  2 3 0 1 7 5 7 2 . 
”Combined” prediction:  2 4 0 1 7 5 7 2 . 