Neural Arithmetic Expression Calculator

Neural Arithmetic Expression Calculator

Kaiyu Chen, Yihan Dong, Xipeng Qiu, Zitian Chen
Shanghai Key Laboratory of Intelligent Information Processing, Fudan University
School of Computer Science, Fudan University
{15307130233, 15302010054, xpqiu, ztchen13}
Corresponding Author,

This paper presents a pure neural solver for arithmetic expression calculation (AEC) problem. Previous work utilizes the powerful capabilities of deep neural networks and attempts to build an end-to-end model to solve this problem. However, most of these methods can only deal with the additive operations. It is still a challenging problem to solve the complex expression calculation problem, which includes the adding, subtracting, multiplying, dividing and bracketing operations. In this work, we regard the arithmetic expression calculation as a hierarchical reinforcement learning problem. An arithmetic operation is decomposed into a series of sub-tasks, and each sub-task is dealt with by a skill module. The skill module could be a basic module performing elementary operations, or interactive module performing complex operations by invoking other skill models. With curriculum learning, our model can deal with a complex arithmetic expression calculation with the deep hierarchical structure of skill models. Experiments show that our model significantly outperforms the previous models for arithmetic expression calculation.


Neural Arithmetic Expression Calculator

  Kaiyu Chen, Yihan Dong, Xipeng Qiuthanks: Corresponding Author,, Zitian Chen Shanghai Key Laboratory of Intelligent Information Processing, Fudan University School of Computer Science, Fudan University {15307130233, 15302010054, xpqiu, ztchen13}


noticebox[b]Preprint. Work in progress.\end@float

1 Introduction

Developing pure neural models to automatically solve arithmetic expression calculation (AEC) is an interest and challenging task. Recent research includes Neural GPUs (Kaiser and Sutskever, 2015; Freivalds and Liepins, 2017), Grid LSTM (Kalchbrenner et al., 2015), Neural Turing Machines (Graves et al., 2014), and Neural Random-Access Machines (Kurach et al., 2016). Most of these models just can deal with the addition calculation. Although Neural GPU has an ability to learn multi-digit binary multiplication, it does not work well in decimal multiplication (Kaiser and Sutskever, 2015). The difficulty of multi-digit decimal multiplication lies in the fact that multiplication involves a complicated structure of arithmetic operations, which is hard for neural networks to learn. Considering how electronic circuit or human beings do multiplication, multi-digit multiplication can be decomposed into several subgoals, as shown in Figure 1. High-level arithmetic tasks like multiplication iteratively use low-level operations like the addition to complete high-level tasks.

Figure 1: An example of reusability and hierarchy in multiplication. denotes multi-digit addition, while means single-digit multiplication.

The incapability of current models in solving arithmetic expression is because they fail to use two key properties of arithmetic operation: reusability and hierarchy. The arithmetic operation can be decomposed into a series of sub-operations, which form a hierarchical structure. Most of the sub-operations are reusable. When dealing with a complex arithmetic operation, we do not need to train a model from scratch. For the example in Figure 1, the multi-digit multiplication involves several reusable sub-operations, such as and .

To leverage reusability and hierarchy in the arithmetic operation, we formulate this task as a Hierarchical Reinforcement Learning (HRL) problem (Sutton et al., 1999; Dietterich, 2000), in which the task policy can be decomposed into several sub-task policies. Each sub-task policy is implemented by a skill module, which can be used recursively. The skill module can be divided into two groups: basic skill module performing elementary single-digit operations, and interactive skill model performing complex operations by selectively invoking other skill modules. There are two differences to the standard HRL. (1) One is that each invoked skill module can be executed with only its input, regardless of external environment state. Therefore, we propose Interactive Skill Modules (ISM) that can selectively interact with other skill modules by sending a partial expression and receiving answers returned. (2) Another is that the task hierarchy is multi-level, which is difficult to be learned from scratch. Therefore, we propose Curriculum Teacher and Continual-learning Student (CTCS) framework to overcome this problem. The skill modules are trained in a particular order, from easy to difficult tasks. The finally skill module would be a deep hierarchical structure. The experiments show that our model has a strong capability to calculate arithmetic expressions.

The main contributions of the paper are:

  • We propose a pure neural model to solve the (decimal) expression calculation problem, involving the operations. Both the input and output of our model is character sequence. To the best of our knowledge, this study is the first work to solve this challenging problem.

  • We regard arithmetic learning as a Multi-level Hierarchical Reinforcement Learning (MHRL) problem, and factorize a complex arithmetic operation into several simpler operations. The main component is the interactive skill module. A high-level interactive skill module can invoke the low-level skill modules by sending and receiving messages.

  • We introduce Curriculum Teacher and Continual-learning Student (CTCS), an automatic teacher-student framework that enables the model to be easier learned for the complex tasks.

2 Related Work

Arithmetic Learning

In recent years, several models have attempted to learn arithmetic in deep learning. Grid LSTM Kalchbrenner et al. (2015) expands LSTM in multiple dimensions and can learn multi-digit addition. Zaremba et al. (2016) use reinforcement learning to learn single-digit multiplication and multi-digit addition. Neural GPU Kaiser and Sutskever (2015) is noticeably promising in arithmetic learning and can learn binary multiplication. Price et al. (2016) and Freivalds and Liepins (2017) improve Neural GPU to do multi-digit multiplication with curriculum learning. Nevertheless, there is no successful attempt to learn division or expression calculation.

Hierarchical Reinforcement Learning

The first popular hierarchical reinforcement learning model may date back to the options framework Sutton et al. (1999). The options framework considers the problem to have a two-level hierarchy. Recent work combines neural networks with this two-level hierarchy and has made promising results in challenging environments with sparse rewards, like Minecraft Tessler et al. (2017) and ATARI games Baranes and Oudeyer (2013). In contrast to the two-level hierarchy, the skill modules in our framework can selectively use other skill modules, which finally form a deep multi-level hierarchical structure.

Curriculum Learning

Work by Bengio et al. (2009) brings general interests to curriculum learning. Recently, it has been widely used in many tasks, like learning to play first-person shooter games Wu and Tian (2017), and helping robots learn object manipulation Baranes and Oudeyer (2013). It is noteworthy that the teacher-student curriculum learning framework proposed by Matiisen et al. (2017) can automatically sample tasks according to student’s performance. However, it is limited to sampling data and can not help the student adapt to task switching with parameter adjustment.

Continual Lifelong Learning

As proposed in Tessler et al. (2017), a continual lifelong learning model needs the ability to choose relevant prior knowledge for solving new tasks, which is named selective transfer. The main issue of continual learning models is that they are prone to catastrophic forgetting Mcclelland et al. (1995); Parisi et al. (2018), which means the model forgets previous knowledge when learning new tasks. To achieve continual lifelong learning, Progressive Neural Networks (PNN) Rusu et al. (2016) allocate a new module with access to prior knowledge to learn a new task. With this approach, prior knowledge can be used, and former modules are not influenced. Our model extends PNN with the ability to use helpful modules selectively.

3 Model

Task Definition

We first formalize the task of arithmetic expression calculation (AEC) as follows. Given a character sequence, consisting of decimal digits and arithmetic operators of , the goal is to output a sequence of digit characters representing the result, for example:

3.1 Multi-level Hierarchical Reinforcement Learning

As analyzed before, the arithmetic calculation can be decomposed into several sub-tasks, including single-digit multiplication, multi-digit addition and more. Assuming we already have several modules for the simple arithmetic calculations, the key challenge is how to organize them to solve a more complex arithmetic calculation. In this paper, we propose a multi-level hierarchical reinforcement learning (MHRL) framework to perform this task.

Hierarchical Reinforcement Learning (HRL)

In HRL, the policy of an agent can be decomposed into several sub-policies from the set . At time , the policy is a mapping from state to a probability distribution over sub-policies. Assuming the -th sub-policy is chosen, the action is determined by .

The arithmetic calculation is a multi-level hierarchical reinforcement learning, in which the sub-policy could be further decomposed into sub-sub-policies. Suppose that each (sub-)policy is implemented by a skill module. There are two different kinds of modules: basic skill modules (BSM) and interactive skill modules (ISM). All modules use character sequences as inputs and produce character sequences as outputs.

(a) Basic Skill Module (BSM)
(b) Interactive Skill Module (ISM)
Figure 2: Demonstration of skill modules. (a) the basic skill module (BSM) is a neural network with Bi-RNN, which takes the character sequence as input and outputs the character sequence of the result. (b) The interactive skill module (ISM) interacts with other skill models to do the calculation. This illustrated module can perform multi-digit single-digit calculation, using two BSMs (single-digit addition), (single-digit multiplication) and another ISM for multi-digit addition.
Basic Skill Modules (BSM)

The basic skill modules perform fundamental arithmetic operations like single-digit’s addition or multiplication. The structure of basic skill modules is illustrated in Figure (a)a. Given a sequence containing decimal and arithmetic characters of length . We firstly map the sequence with character embeddings to . Then the inputs are fed into a bi-directional RNN (Bi-RNN). Outputs are generated by choosing characters with the max probability after functions. Basic skill modules are trained in a supervised approach.

Each BSM provides a deterministic policy , where is the calculated result in form of a digit character sequence.

Interactive Skill Modules (ISM)

The interactive skill modules perform the arithmetic operations by invoking other skill modules. An example of interactive skill modules is shown in Figure (b)b. The policy of ISM is to select other skill modules to complete the partial arithmetic calculation. Different from the standard HRL, each skill module performs a local arithmetic operation, and need not observe the global environment state. Therefore, when a skill module chooses another skill module as sub-policy, module just sends character sequences to module and receives character sequences as answers.

Figure 3: Detailed structure of Interactive Skill Modules (ISM). In each time stamp, a bi-RNN encodes memory and output . The first and last outputs are concatenated as memory representations and are fed into the central RNN. The outputs of central RNN are used to select modules, generate read pointers and write pointers. The read pointers are used to read the sub-expression from memory. The selected module deal with the sub-expression and returns an answer. Then the answer is written to memory at positions indicated by write pointers. The dashed orange lines represent processes that are related to the policy of ISM.

It is hard to train skill modules from scratch, so we use curriculum learning, which will be described in Section 3.3, to train skill modules in the order of increasing difficulty. Suppose that we already have well-trained skill modules , the -th ISM is described as follows.

3.1.1 Structure of Interactive Skill Module

The detailed structure of ISM is shown in Figure 3.

First, each ISM is equipped with a memory to hold temporary information. Memory is composed of character slots with length . When module receives an expression , first stores into memory.

The policy of ISM can be decomposed into three sub-policies: (1) selecting skill module, (2) reading memory, and (3) writing memory.

At time , the memory contains characters , we first use a Bi-RNN to encode the state of memory.


where is the embedding of character for .

The state of the environment is modeled by a forward RNN,


where is one-layer forward neural network.

Given the state , the agent chooses three actions according the three following sub-policies,


where denote the chosen module, the read pointers and write pointers at time . , , and are pointer functions described in Pointer Networks Vinyals et al. (2015). Practically, there are two pairs of read pointers and one pair of write pointers specifying start and end positions of reading and writing. Additionally, Positional Embedding Vaswani et al. (2017) is combined with character embedding to provide the model with relative positional information.

Then the read pointer reads a sub-expression from the memory and sends to the selected module .


where is output of module , which is further written into memory.


3.2 Optimization

When the ISM generates the whole actions trajectory , where number of the select skill modules, it can output an answer.

Finally, the ISM gets reward 1 when it gives the correct answer. If not, the reward is negative, based on character-level similarity to the solution.

Among reinforcement learning methods, Proximal Policy Optimization (PPO) Schulman et al. (2017) is an online policy gradient approach that achieves state-of-the-art on many benchmark tasks. Therefore, we implement PPO to train ISMs. We sample from policies where denotes model parameters. With every state and sampled action , we compute gradients to maximize the following objective function:


where is the advantage function representing the discounted reward, is entropy regularizer to discourage premature convergence to suboptimal policies Mnih et al. (2016), and is the coefficient to balance the Exploration-Exploitation, which will be mentioned in CTCS framework (see Section 3.3).

3.3 Curriculum Teacher and Continual-learning Student (CTCS)

We propose Curriculum Teacher and Continual-learning Student (CTCS) framework to help the model acquire knowledge efficiently.

The CTCS framework is illustrated in Figure 4. Given a set of tasks that are ordered by increasing difficulty. Each task contains data samples: ). The curriculum teacher gives tasks in the order of , switching to the next task only when the student performs perfectly in the current task. In learning every task, the curriculum teacher uses difficulty sampling strategy to sample from data samples.

Figure 4: Curriculum Teacher and Continual-learning Student (CTCS) framework.

Difficulty Sampling encourages learning difficult samples. Unlike most problems, arithmetic learning needs precise calculation, which requires complete mastery of training samples. However, the model tends to gain good performance, but not perfect scores. Inspired by Deliberate Practice, a common learning method for human beings, we use difficulty sampling to help the student achieve complete mastery.

To formalize, a difficulty score is the total number of incorrect attempts of sample . Then the probability of each sample is determined by a parameterized function:


Parameter Adjustment encourages or discourages the exploration of the student. In reinforcement learning, adding entropy controlled by a coefficient to loss is a commonly used technique Mnih et al. (2016) to discourage premature convergence to suboptimal policies. However, to what extent should we encourage the student to explore is a long-standing issue of Exploration-Exploitation Dilemma Kaelbling et al. (1996). Intuitively, exploration should be encouraged when the student has difficulty doing some samples. Therefore, we employ the teacher to help the student change exploration strategy in keeping with its performance. To be specific, the entropy coefficient is:


where is the difficulty score described in difficulty sampling, and . As shown in Section 5, difficulty sampling and parameter adjustment methods are critical in achieving the perfect performance.

4 Experiments

Arithmetic Expression Calculation

To train our model to calculate arithmetic expressions with curriculum learning, we define several sub-tasks, from basic tasks like the single-digit addition to compositional tasks like multi-digit division. Then we train our model with tasks in the order of increasing difficulty. The full curriculum list is shown in the appendix. The code is available here111The code is at (Anonymous)..

The arithmetic expression data is generated through a random process. An expression of length 10 contains approximately 3 arithmetic operators of in average.

We compare our model with two baseline models:

  • Seq2seq LSTM: A sequence to sequence model Sutskever et al. (2014) with LSTM Hochreiter and Schmidhuber (1997) as encoder and decoder.

  • Neural GPU: An arithmetic algorithm learning model proposed by Kaiser and Sutskever (2015). We use their open source implementation posted on Github.

To make an objective comparison, we also apply the same curriculum learning method to baseline models. The results are shown in Figure 5 and Figure 6.

Figure 5: Accuracy in training set of size 1000. Training set is changing from easy tasks to difficult tasks as curriculum learning is applied. Every sudden drop indicates a task switching.
Figure 6: Accuracy in test set of size 1000. Test set contains arithmetic expressions of length 10. As the figure shows, the accuracy of both Neural GPU and LSTM is constantly nearly zero.
Length 5 10 20
Ours 100% 100% 17%
Neural GPU 100% 98% 84%
LSTM 61% 33% 16%
Length 5 10 20
Ours 100% 100% 19%
Neural GPU 100% 72% 43%
LSTM 95% 48% 20%
Length 5 10 20
Ours 100% 100% 0%
Neural GPU 30% 3% 0%
LSTM 12% 3% 0%
Length 5 10 20
Ours 100% 27% 15%
Neural GPU 30% 29% 21%
LSTM 28% 23% 19%
Length 5 10 20
Ours 100% 100% 78%
Neural GPU 48% 2% 0%
LSTM 5% 0% 0%
Table 1: Results of addition, subtraction, multiplication, division, and expression calculation at different length of arithmetic expression in the test set. Task expressions contain one specific arithmetic operator except for the expression calculation task.

As the result shows, both the baseline models are striving to remember training samples, achieving relatively high accuracy in the training set, but nearly zero accuracy in the test set. The LSTM model shows a powerful capability of remembering training samples. Every time the task switches, the performance suddenly drops down to zero and then increases to a high level. Although the Neural GPU seems to have better generalization ability, it still performs poorly in the test set.

In contrast, our model achieves almost 100 percent correctness in the experiment, which shows the effectiveness and generalization ability of our model.

Sub-task Performance

We evaluate our model with different sub-tasks to see the performance of various arithmetic operations. The results are shown in Table 1. It’s noteworthy that the answer of the division is relatively small, so the models can guess the answer, resulting in nearly 20% correctness in the division. As the result shows, our model achieves 100% mastery much more than baseline models, especially in expression calculation task.


The gradient-based optimization is performed using the Adam update rule Kingma and Ba (2014). Every RNN in our model is GRU Chung et al. (2014) with hidden size 100. used in Equation 10 is 10. The consecutively sample number described in difficulty sampling is 64. In PPO, the reward discount parameter is 0.99, and the clipping parameter is 0.2.

5 Discussion

5.1 Ablation Study

Curriculum Learning and Continual Learning

To test if our model can make use of prior knowledge when meeting a new task, we challenge our model with learning a new arithmetic operation: modular. We compare our proposed model with a baseline model that learns from scratch. The results are shown in Figure 7. Without curriculum learning and continual learning, the model fails to give any correct solutions. It shows the necessity of curriculum learning and continual learning.

Figure 7: Results for ablation of curriculum learning and continual learning. The test set contains arithmetic expressions of length 10 with the modular operator.
Difficulty Sampling and Parameter Adjustment

In Curriculum Teacher Continual-learning Student (CTCS) framework, we present difficulty sampling and parameter adjustment to help the model produce the perfect performance. The effectiveness of them is illustrated in Figure 8. Without difficulty sampling and parameter adjustment, the model shows convergence in suboptimal strategy. It shows that difficulty sampling and parameter adjustment are important in helping the model to achieve perfect mastery.

Figure 8: Results for ablation of Difficulty Sampling and Parameter Adjustment. The test set contains arithmetic expressions of length 10.

6 Conclusion

In this paper, we propose a pure neural model to solve the arithmetic expression calculation problem. Specifically, we use the Multi-level Hierarchical Reinforcement Learning (MHRL) framework to factorize a complex arithmetic operation into several simpler operations. We also present Curriculum Teacher Continual-learning Student (CTCS) framework where the teacher adopts difficulty sampling and parameter adjustment strategies to supervise the student. All these above contribute to solving the arithmetic expression calculation problem. Experiments show that our model significantly outperforms previous methods for arithmetic expression calculation.


  • Kaiser and Sutskever [2015] Łukasz Kaiser and Ilya Sutskever. Neural GPUs learn algorithms. arXiv preprint arXiv:1511.08228, 2015.
  • Freivalds and Liepins [2017] Karlis Freivalds and Renars Liepins. Improving the neural GPU architecture for algorithm learning. arXiv preprint arXiv:1702.08727, 2017.
  • Kalchbrenner et al. [2015] Nal Kalchbrenner, Ivo Danihelka, and Alex Graves. Grid long short-term memory. CoRR, abs/1507.01526, 2015.
  • Graves et al. [2014] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
  • Kurach et al. [2016] Karol Kurach, Marcin Andrychowicz, and Ilya Sutskever. Neural random-access machines. ERCIM News, 2016, 2016.
  • Sutton et al. [1999] Richard S. Sutton, Doina Precup, and Satinder P. Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artif. Intell., 112:181–211, 1999.
  • Dietterich [2000] Thomas G. Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. J. Artif. Intell. Res., 13:227–303, 2000.
  • Zaremba et al. [2016] Wojciech Zaremba, Tomas Mikolov, Armand Joulin, and Rob Fergus. Learning simple algorithms from examples. In ICML, 2016.
  • Price et al. [2016] Eric Price, Wojciech Zaremba, and Ilya Sutskever. Extensions and limitations of the neural gpu. CoRR, abs/1611.00736, 2016.
  • Tessler et al. [2017] Chen Tessler, Shahar Givony, Tom Zahavy, Daniel J Mankowitz, and Shie Mannor. A deep hierarchical approach to lifelong learning in minecraft. In AAAI, volume 3, page 6, 2017.
  • Baranes and Oudeyer [2013] Adrien Baranes and Pierre-Yves Oudeyer. Active learning of inverse models with intrinsically motivated goal exploration in robots. Robotics and Autonomous Systems, 61:49–73, 2013.
  • Bengio et al. [2009] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In ICML, 2009.
  • Wu and Tian [2017] Yuxin Wu and Yuandong Tian. Training agent for first-person shooter game with actor-critic curriculum learning. 2017.
  • Matiisen et al. [2017] Tambet Matiisen, Avital Oliver, Taco Cohen, and John Schulman. Teacher-student curriculum learning. arXiv preprint arXiv:1707.00183, 2017.
  • Mcclelland et al. [1995] J L Mcclelland, B L Mcnaughton, and Randall C. O’Reilly. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychological review, 102 3:419–57, 1995.
  • Parisi et al. [2018] German I. Parisi, Ronald Kemker, Jose L. Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review. CoRR, abs/1802.07569, 2018.
  • Rusu et al. [2016] Andrei A. Rusu, Neil C. Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. CoRR, abs/1606.04671, 2016.
  • Vinyals et al. [2015] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In NIPS, 2015.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.
  • Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • Mnih et al. [2016] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937, 2016.
  • Kaelbling et al. [1996] Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey. Journal of artificial intelligence research, 4:237–285, 1996.
  • Sutskever et al. [2014] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
  • Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9 8:1735–80, 1997.
  • Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Chung et al. [2014] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description