Neural Arithmetic Expression Calculator
Abstract
This paper presents a pure neural solver for arithmetic expression calculation (AEC) problem. Previous work utilizes the powerful capabilities of deep neural networks and attempts to build an endtoend model to solve this problem. However, most of these methods can only deal with the additive operations. It is still a challenging problem to solve the complex expression calculation problem, which includes the adding, subtracting, multiplying, dividing and bracketing operations. In this work, we regard the arithmetic expression calculation as a hierarchical reinforcement learning problem. An arithmetic operation is decomposed into a series of subtasks, and each subtask is dealt with by a skill module. The skill module could be a basic module performing elementary operations, or interactive module performing complex operations by invoking other skill models. With curriculum learning, our model can deal with a complex arithmetic expression calculation with the deep hierarchical structure of skill models. Experiments show that our model significantly outperforms the previous models for arithmetic expression calculation.
Neural Arithmetic Expression Calculator
Kaiyu Chen, Yihan Dong, Xipeng Qiu^{†}^{†}thanks: Corresponding Author, xpqiu@fudan.edu.cn, Zitian Chen Shanghai Key Laboratory of Intelligent Information Processing, Fudan University School of Computer Science, Fudan University {15307130233, 15302010054, xpqiu, ztchen13}@fudan.edu.cn
noticebox[b]Preprint. Work in progress.\end@float
1 Introduction
Developing pure neural models to automatically solve arithmetic expression calculation (AEC) is an interest and challenging task. Recent research includes Neural GPUs (Kaiser and Sutskever, 2015; Freivalds and Liepins, 2017), Grid LSTM (Kalchbrenner et al., 2015), Neural Turing Machines (Graves et al., 2014), and Neural RandomAccess Machines (Kurach et al., 2016). Most of these models just can deal with the addition calculation. Although Neural GPU has an ability to learn multidigit binary multiplication, it does not work well in decimal multiplication (Kaiser and Sutskever, 2015). The difficulty of multidigit decimal multiplication lies in the fact that multiplication involves a complicated structure of arithmetic operations, which is hard for neural networks to learn. Considering how electronic circuit or human beings do multiplication, multidigit multiplication can be decomposed into several subgoals, as shown in Figure 1. Highlevel arithmetic tasks like multiplication iteratively use lowlevel operations like the addition to complete highlevel tasks.
The incapability of current models in solving arithmetic expression is because they fail to use two key properties of arithmetic operation: reusability and hierarchy. The arithmetic operation can be decomposed into a series of suboperations, which form a hierarchical structure. Most of the suboperations are reusable. When dealing with a complex arithmetic operation, we do not need to train a model from scratch. For the example in Figure 1, the multidigit multiplication involves several reusable suboperations, such as and .
To leverage reusability and hierarchy in the arithmetic operation, we formulate this task as a Hierarchical Reinforcement Learning (HRL) problem (Sutton et al., 1999; Dietterich, 2000), in which the task policy can be decomposed into several subtask policies. Each subtask policy is implemented by a skill module, which can be used recursively. The skill module can be divided into two groups: basic skill module performing elementary singledigit operations, and interactive skill model performing complex operations by selectively invoking other skill modules. There are two differences to the standard HRL. (1) One is that each invoked skill module can be executed with only its input, regardless of external environment state. Therefore, we propose Interactive Skill Modules (ISM) that can selectively interact with other skill modules by sending a partial expression and receiving answers returned. (2) Another is that the task hierarchy is multilevel, which is difficult to be learned from scratch. Therefore, we propose Curriculum Teacher and Continuallearning Student (CTCS) framework to overcome this problem. The skill modules are trained in a particular order, from easy to difficult tasks. The finally skill module would be a deep hierarchical structure. The experiments show that our model has a strong capability to calculate arithmetic expressions.
The main contributions of the paper are:

We propose a pure neural model to solve the (decimal) expression calculation problem, involving the operations. Both the input and output of our model is character sequence. To the best of our knowledge, this study is the first work to solve this challenging problem.

We regard arithmetic learning as a Multilevel Hierarchical Reinforcement Learning (MHRL) problem, and factorize a complex arithmetic operation into several simpler operations. The main component is the interactive skill module. A highlevel interactive skill module can invoke the lowlevel skill modules by sending and receiving messages.

We introduce Curriculum Teacher and Continuallearning Student (CTCS), an automatic teacherstudent framework that enables the model to be easier learned for the complex tasks.
2 Related Work
Arithmetic Learning
In recent years, several models have attempted to learn arithmetic in deep learning. Grid LSTM Kalchbrenner et al. (2015) expands LSTM in multiple dimensions and can learn multidigit addition. Zaremba et al. (2016) use reinforcement learning to learn singledigit multiplication and multidigit addition. Neural GPU Kaiser and Sutskever (2015) is noticeably promising in arithmetic learning and can learn binary multiplication. Price et al. (2016) and Freivalds and Liepins (2017) improve Neural GPU to do multidigit multiplication with curriculum learning. Nevertheless, there is no successful attempt to learn division or expression calculation.
Hierarchical Reinforcement Learning
The first popular hierarchical reinforcement learning model may date back to the options framework Sutton et al. (1999). The options framework considers the problem to have a twolevel hierarchy. Recent work combines neural networks with this twolevel hierarchy and has made promising results in challenging environments with sparse rewards, like Minecraft Tessler et al. (2017) and ATARI games Baranes and Oudeyer (2013). In contrast to the twolevel hierarchy, the skill modules in our framework can selectively use other skill modules, which finally form a deep multilevel hierarchical structure.
Curriculum Learning
Work by Bengio et al. (2009) brings general interests to curriculum learning. Recently, it has been widely used in many tasks, like learning to play firstperson shooter games Wu and Tian (2017), and helping robots learn object manipulation Baranes and Oudeyer (2013). It is noteworthy that the teacherstudent curriculum learning framework proposed by Matiisen et al. (2017) can automatically sample tasks according to student’s performance. However, it is limited to sampling data and can not help the student adapt to task switching with parameter adjustment.
Continual Lifelong Learning
As proposed in Tessler et al. (2017), a continual lifelong learning model needs the ability to choose relevant prior knowledge for solving new tasks, which is named selective transfer. The main issue of continual learning models is that they are prone to catastrophic forgetting Mcclelland et al. (1995); Parisi et al. (2018), which means the model forgets previous knowledge when learning new tasks. To achieve continual lifelong learning, Progressive Neural Networks (PNN) Rusu et al. (2016) allocate a new module with access to prior knowledge to learn a new task. With this approach, prior knowledge can be used, and former modules are not influenced. Our model extends PNN with the ability to use helpful modules selectively.
3 Model
Task Definition
We first formalize the task of arithmetic expression calculation (AEC) as follows. Given a character sequence, consisting of decimal digits and arithmetic operators of , the goal is to output a sequence of digit characters representing the result, for example:
3.1 Multilevel Hierarchical Reinforcement Learning
As analyzed before, the arithmetic calculation can be decomposed into several subtasks, including singledigit multiplication, multidigit addition and more. Assuming we already have several modules for the simple arithmetic calculations, the key challenge is how to organize them to solve a more complex arithmetic calculation. In this paper, we propose a multilevel hierarchical reinforcement learning (MHRL) framework to perform this task.
Hierarchical Reinforcement Learning (HRL)
In HRL, the policy of an agent can be decomposed into several subpolicies from the set . At time , the policy is a mapping from state to a probability distribution over subpolicies. Assuming the th subpolicy is chosen, the action is determined by .
The arithmetic calculation is a multilevel hierarchical reinforcement learning, in which the subpolicy could be further decomposed into subsubpolicies. Suppose that each (sub)policy is implemented by a skill module. There are two different kinds of modules: basic skill modules (BSM) and interactive skill modules (ISM). All modules use character sequences as inputs and produce character sequences as outputs.
Basic Skill Modules (BSM)
The basic skill modules perform fundamental arithmetic operations like singledigit’s addition or multiplication. The structure of basic skill modules is illustrated in Figure (a)a. Given a sequence containing decimal and arithmetic characters of length . We firstly map the sequence with character embeddings to . Then the inputs are fed into a bidirectional RNN (BiRNN). Outputs are generated by choosing characters with the max probability after functions. Basic skill modules are trained in a supervised approach.
Each BSM provides a deterministic policy , where is the calculated result in form of a digit character sequence.
Interactive Skill Modules (ISM)
The interactive skill modules perform the arithmetic operations by invoking other skill modules. An example of interactive skill modules is shown in Figure (b)b. The policy of ISM is to select other skill modules to complete the partial arithmetic calculation. Different from the standard HRL, each skill module performs a local arithmetic operation, and need not observe the global environment state. Therefore, when a skill module chooses another skill module as subpolicy, module just sends character sequences to module and receives character sequences as answers.
It is hard to train skill modules from scratch, so we use curriculum learning, which will be described in Section 3.3, to train skill modules in the order of increasing difficulty. Suppose that we already have welltrained skill modules , the th ISM is described as follows.
3.1.1 Structure of Interactive Skill Module
The detailed structure of ISM is shown in Figure 3.
First, each ISM is equipped with a memory to hold temporary information. Memory is composed of character slots with length . When module receives an expression , first stores into memory.
The policy of ISM can be decomposed into three subpolicies: (1) selecting skill module, (2) reading memory, and (3) writing memory.
At time , the memory contains characters , we first use a BiRNN to encode the state of memory.
(1) 
where is the embedding of character for .
The state of the environment is modeled by a forward RNN,
(2) 
where is onelayer forward neural network.
Given the state , the agent chooses three actions according the three following subpolicies,
(3)  
(4)  
(5) 
where denote the chosen module, the read pointers and write pointers at time . , , and are pointer functions described in Pointer Networks Vinyals et al. (2015). Practically, there are two pairs of read pointers and one pair of write pointers specifying start and end positions of reading and writing. Additionally, Positional Embedding Vaswani et al. (2017) is combined with character embedding to provide the model with relative positional information.
Then the read pointer reads a subexpression from the memory and sends to the selected module .
(6)  
(7) 
where is output of module , which is further written into memory.
(8) 
3.2 Optimization
When the ISM generates the whole actions trajectory , where number of the select skill modules, it can output an answer.
Finally, the ISM gets reward 1 when it gives the correct answer. If not, the reward is negative, based on characterlevel similarity to the solution.
Among reinforcement learning methods, Proximal Policy Optimization (PPO) Schulman et al. (2017) is an online policy gradient approach that achieves stateoftheart on many benchmark tasks. Therefore, we implement PPO to train ISMs. We sample from policies where denotes model parameters. With every state and sampled action , we compute gradients to maximize the following objective function:
(9) 
where is the advantage function representing the discounted reward, is entropy regularizer to discourage premature convergence to suboptimal policies Mnih et al. (2016), and is the coefficient to balance the ExplorationExploitation, which will be mentioned in CTCS framework (see Section 3.3).
3.3 Curriculum Teacher and Continuallearning Student (CTCS)
We propose Curriculum Teacher and Continuallearning Student (CTCS) framework to help the model acquire knowledge efficiently.
The CTCS framework is illustrated in Figure 4. Given a set of tasks that are ordered by increasing difficulty. Each task contains data samples: ). The curriculum teacher gives tasks in the order of , switching to the next task only when the student performs perfectly in the current task. In learning every task, the curriculum teacher uses difficulty sampling strategy to sample from data samples.
Difficulty Sampling encourages learning difficult samples. Unlike most problems, arithmetic learning needs precise calculation, which requires complete mastery of training samples. However, the model tends to gain good performance, but not perfect scores. Inspired by Deliberate Practice, a common learning method for human beings, we use difficulty sampling to help the student achieve complete mastery.
To formalize, a difficulty score is the total number of incorrect attempts of sample . Then the probability of each sample is determined by a parameterized function:
(10) 
Parameter Adjustment encourages or discourages the exploration of the student. In reinforcement learning, adding entropy controlled by a coefficient to loss is a commonly used technique Mnih et al. (2016) to discourage premature convergence to suboptimal policies. However, to what extent should we encourage the student to explore is a longstanding issue of ExplorationExploitation Dilemma Kaelbling et al. (1996). Intuitively, exploration should be encouraged when the student has difficulty doing some samples. Therefore, we employ the teacher to help the student change exploration strategy in keeping with its performance. To be specific, the entropy coefficient is:
(11) 
where is the difficulty score described in difficulty sampling, and . As shown in Section 5, difficulty sampling and parameter adjustment methods are critical in achieving the perfect performance.
4 Experiments
Arithmetic Expression Calculation
To train our model to calculate arithmetic expressions with curriculum learning, we define several subtasks, from basic tasks like the singledigit addition to compositional tasks like multidigit division. Then we train our model with tasks in the order of increasing difficulty. The full curriculum list is shown in the appendix. The code is available here^{1}^{1}1The code is at (Anonymous)..
The arithmetic expression data is generated through a random process. An expression of length 10 contains approximately 3 arithmetic operators of in average.
We compare our model with two baseline models:

Neural GPU: An arithmetic algorithm learning model proposed by Kaiser and Sutskever (2015). We use their open source implementation posted on Github.
To make an objective comparison, we also apply the same curriculum learning method to baseline models. The results are shown in Figure 5 and Figure 6.





As the result shows, both the baseline models are striving to remember training samples, achieving relatively high accuracy in the training set, but nearly zero accuracy in the test set. The LSTM model shows a powerful capability of remembering training samples. Every time the task switches, the performance suddenly drops down to zero and then increases to a high level. Although the Neural GPU seems to have better generalization ability, it still performs poorly in the test set.
In contrast, our model achieves almost 100 percent correctness in the experiment, which shows the effectiveness and generalization ability of our model.
Subtask Performance
We evaluate our model with different subtasks to see the performance of various arithmetic operations. The results are shown in Table 1. It’s noteworthy that the answer of the division is relatively small, so the models can guess the answer, resulting in nearly 20% correctness in the division. As the result shows, our model achieves 100% mastery much more than baseline models, especially in expression calculation task.
Hyperparameters
The gradientbased optimization is performed using the Adam update rule Kingma and Ba (2014). Every RNN in our model is GRU Chung et al. (2014) with hidden size 100. used in Equation 10 is 10. The consecutively sample number described in difficulty sampling is 64. In PPO, the reward discount parameter is 0.99, and the clipping parameter is 0.2.
5 Discussion
5.1 Ablation Study
Curriculum Learning and Continual Learning
To test if our model can make use of prior knowledge when meeting a new task, we challenge our model with learning a new arithmetic operation: modular. We compare our proposed model with a baseline model that learns from scratch. The results are shown in Figure 7. Without curriculum learning and continual learning, the model fails to give any correct solutions. It shows the necessity of curriculum learning and continual learning.
Difficulty Sampling and Parameter Adjustment
In Curriculum Teacher Continuallearning Student (CTCS) framework, we present difficulty sampling and parameter adjustment to help the model produce the perfect performance. The effectiveness of them is illustrated in Figure 8. Without difficulty sampling and parameter adjustment, the model shows convergence in suboptimal strategy. It shows that difficulty sampling and parameter adjustment are important in helping the model to achieve perfect mastery.
6 Conclusion
In this paper, we propose a pure neural model to solve the arithmetic expression calculation problem. Specifically, we use the Multilevel Hierarchical Reinforcement Learning (MHRL) framework to factorize a complex arithmetic operation into several simpler operations. We also present Curriculum Teacher Continuallearning Student (CTCS) framework where the teacher adopts difficulty sampling and parameter adjustment strategies to supervise the student. All these above contribute to solving the arithmetic expression calculation problem. Experiments show that our model significantly outperforms previous methods for arithmetic expression calculation.
References
 Kaiser and Sutskever [2015] Łukasz Kaiser and Ilya Sutskever. Neural GPUs learn algorithms. arXiv preprint arXiv:1511.08228, 2015.
 Freivalds and Liepins [2017] Karlis Freivalds and Renars Liepins. Improving the neural GPU architecture for algorithm learning. arXiv preprint arXiv:1702.08727, 2017.
 Kalchbrenner et al. [2015] Nal Kalchbrenner, Ivo Danihelka, and Alex Graves. Grid long shortterm memory. CoRR, abs/1507.01526, 2015.
 Graves et al. [2014] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
 Kurach et al. [2016] Karol Kurach, Marcin Andrychowicz, and Ilya Sutskever. Neural randomaccess machines. ERCIM News, 2016, 2016.
 Sutton et al. [1999] Richard S. Sutton, Doina Precup, and Satinder P. Singh. Between mdps and semimdps: A framework for temporal abstraction in reinforcement learning. Artif. Intell., 112:181–211, 1999.
 Dietterich [2000] Thomas G. Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. J. Artif. Intell. Res., 13:227–303, 2000.
 Zaremba et al. [2016] Wojciech Zaremba, Tomas Mikolov, Armand Joulin, and Rob Fergus. Learning simple algorithms from examples. In ICML, 2016.
 Price et al. [2016] Eric Price, Wojciech Zaremba, and Ilya Sutskever. Extensions and limitations of the neural gpu. CoRR, abs/1611.00736, 2016.
 Tessler et al. [2017] Chen Tessler, Shahar Givony, Tom Zahavy, Daniel J Mankowitz, and Shie Mannor. A deep hierarchical approach to lifelong learning in minecraft. In AAAI, volume 3, page 6, 2017.
 Baranes and Oudeyer [2013] Adrien Baranes and PierreYves Oudeyer. Active learning of inverse models with intrinsically motivated goal exploration in robots. Robotics and Autonomous Systems, 61:49–73, 2013.
 Bengio et al. [2009] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In ICML, 2009.
 Wu and Tian [2017] Yuxin Wu and Yuandong Tian. Training agent for firstperson shooter game with actorcritic curriculum learning. 2017.
 Matiisen et al. [2017] Tambet Matiisen, Avital Oliver, Taco Cohen, and John Schulman. Teacherstudent curriculum learning. arXiv preprint arXiv:1707.00183, 2017.
 Mcclelland et al. [1995] J L Mcclelland, B L Mcnaughton, and Randall C. O’Reilly. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychological review, 102 3:419–57, 1995.
 Parisi et al. [2018] German I. Parisi, Ronald Kemker, Jose L. Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review. CoRR, abs/1802.07569, 2018.
 Rusu et al. [2016] Andrei A. Rusu, Neil C. Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. CoRR, abs/1606.04671, 2016.
 Vinyals et al. [2015] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In NIPS, 2015.
 Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.
 Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 Mnih et al. [2016] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937, 2016.
 Kaelbling et al. [1996] Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey. Journal of artificial intelligence research, 4:237–285, 1996.
 Sutskever et al. [2014] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
 Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9 8:1735–80, 1997.
 Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Chung et al. [2014] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.