Sequential Coordination of Deep Models for Learning Visual Arithmetic
Abstract
Achieving machine intelligence requires a smooth integration of perception and reasoning, yet models developed to date tend to specialize in one or the other; sophisticated manipulation of symbols acquired from rich perceptual spaces has so far proved elusive. Consider a visual arithmetic task, where the goal is to carry out simple arithmetical algorithms on digits presented under natural conditions (e.g. handwritten, placed randomly). We propose a twotiered architecture for tackling this problem. The lower tier consists of a heterogeneous collection of information processing modules, which can include pretrained deep neural networks for locating and extracting characters from the image, as well as modules performing symbolic transformations on the representations extracted by perception. The higher tier consists of a controller, trained using reinforcement learning, which coordinates the modules in order to solve the highlevel task. For instance, the controller may learn in what contexts to execute the perceptual networks and what symbolic transformations to apply to their outputs. The resulting model is able to solve a variety of tasks in the visual arithmetic domain, and has several advantages over standard, architecturally homogeneous feedforward networks including improved sample efficiency.
1 Introduction
Recent successes in machine learning have shown that difficult perceptual tasks can be tackled efficiently using deep neural networks (LeCun et al., 2015). However, many challenging tasks may be most naturally solved by combining perception with symbol manipulation. The act of grading a question on a mathematics exam, for instance, requires both sophisticated perception (identifying discrete symbols rendered in many different writing styles) and complex symbol manipulation (confirming that the rendered symbols correspond to a correct answer to the given question). In this work, we address the question of creating machine learning systems that can be trained to solve such perceptuosymbolic problems from a small number of examples. In particular, we consider, as a first step toward fullblown exam question grading, the visual arithmetic task, where the goal is to carry out basic arithmetic algorithms on handwritten digits embedded in an image, with the wrinkle that an additional symbol in the image specifies which of a handful of algorithms (e.g. max, min, +, *) should be performed on the provided digits.
One straightforward approach to solving the visual arithmetic task with machine learning would be to formulate it as a simple classification problem, with the image as input, and an integer giving the correct answer to the arithmetic problem posed by the image as the label. A convolutional neural network (CNN; LeCun et al., 1998) could then be trained via stochastic gradient descent to map from input images to correct answers. However, it is clear that there is a great deal of structure in the problem which is not being harnessed by this simple approach, and which would likely improve the sample efficiency of any learning algorithm that was able to exploit it. While the universal approximation theorem (Hornik et al., 1989) suggests that an architecturally homogeneous network such as a CNN should be able to solve any task when it is made large enough and given sufficient data, imposing model structure becomes important when one is aiming to capture humanlike abilities of strong generalization and learning from small datasets (Lake et al., 2016).
In particular, in this instance we would like to provide the learner with access to modules implementing information processing functions that are relevant for the task at hand — for example, modules that classify individual symbols in the image, or modules that perform symbolic computations on stored representations. However, it is not immediately clear how to include such modules in standard deep networks; the classifiers need to somehow be applied to the correct portion of the image, while the symbolic transformations need to be applied to the correct representations at the appropriate time and, moreover, will typically be nondifferentiable, precluding the possibility of training via backpropogation.
In this work we propose an approach that solves this type of task in two steps. First, the machine learning practitioner identifies a collection of modules, each performing an elementary information processing function that is predicted to be useful in the domain of interest, and assembles them into a designed information processing machine called an interface (Zaremba et al., 2016) that is coupled to the external environment. Second, reinforcement learning (RL) is used to train a controller to make use of the interface; use of RL alleviates any need for the interface to be differentiable. For example, in this paper we make use of an interface for the visual arithmetic domain that contains: a discrete attention mechanism; three pretrained perceptual neural networks that classify digits/classify arithmetic symbols/detect salient locations (respectively); several modules performing basic arithmetic operations on stored internal representations. Through the use of RL, a controller learns to sequentially combine these components to solve visual arithmetic tasks.
Contributions. We propose a novel recipe for constructing agents capable of solving complex tasks by sequentially combining provided information processing modules. The role of the system designer is limited to choosing a pool of modules and gathering training data in the form of inputoutput examples for the target task. A controller is then trained by RL to use the provided modules to solve tasks. We evaluate our approach on a family of visual arithmetic tasks wherein the agent is required to perform arithmetical reduction operations on handwritten digits in an image. Our experiments show that the proposed model can learn to solve tasks in this domain using significantly fewer training examples than unstructured feedforward networks.
The remainder of the article is organized as follows. In Section 2 we describe our general approach and lay down the required technical machinery. In Section 3 we describe the visual arithmetic task domain in detail, and show how our approach may be applied there. In Section 4 we present empirical results demonstrating the advantages of our approach as it applies to visual arithmetic, before reviewing related work in Section 5 and concluding with a discussion in Section 6.
2 Technical Approach
2.1 Partially Observable Markov Decision Processes
Our approach makes use of standard reinforcement learning formalisms (Sutton and Barto, 1998). The external world is modelled as a Partially Observable Markov Decision Process (POMDP), . Each time step is in a state , based upon which it emits an observation that is sent to the learning agent. The agent responds with an action , which causes the environment to emit a reward . Finally, the state is stochastically updated according to ’s dynamics, . This process repeats for time steps. The agent is assumed to choose according to a parameterized policy that maps from observationaction histories to distributions over actions, i.e. , where and is a parameter vector.
2.2 Interfaces
We make extensive use of the idea of an interface as proposed in Zaremba et al. (2016). An interface is a designed, domainspecific machine that mediates a learning agent’s interaction with the external world, providing a representation (observation and action spaces) which is intended to be more conducive to learning than the raw representation provided by the external world. In this work we formalize an interface as a POMDP distinct from , with its own state, observation and action spaces. The interface is assumed to be coupled to the external world in a particular way; each time step sends an observation to , which potentially alters its state, after which emits its own observation to the agent. When the agent responds with an action, it is first processed by , which once again has the opportunity to change its state, after which sends an action to . The agent may thus be regarded as interacting with a POMDP comprised of the combination of and . ’s observation and action spaces are the same as those of , its state is the concatenation of the states of and , and its dynamics are determined by the nature of the coupling between and .
Zaremba et al. (2016) learn to control interfaces in order to solve purely algorithmic tasks, such as copying lists of abstractly (rather than perceptually) represented digits. One of the main insights of the current work is that the idea of interfaces can be extended to perceptuallyrich domains by incorporating pretrained deep networks to handle the perceptual components.
2.3 ActorCritic Reinforcement Learning
We train controllers using the actorcritic algorithm (see e.g. (Sutton and Barto, 1998; Degris et al., 2012; Mnih et al., 2016; Schulman et al., 2015)). We model the controller as a policy that is differentiable with respect to its parameters . Assume from the outset that the goal is to maximize the expected sum of discounted rewards when following :
(1) 
where is a trajectory, is the probability of that trajectory under , and is a discount factor. We look to maximize this objective using gradient ascent; however, it is not immediately clear how to compute , since the probability of a trajectory is a function of the environment dynamics, which are generally unknown. Fortunately, it can be shown that an unbiased estimate of can be obtained by differentiating a surrogate objective function that can be estimated from sample trajectories. Letting , the surrogate objective is:
(2) 
The standard REINFORCE algorithm (Williams, 1992) consists in first sampling a batch of trajectories using , then forming an empirical estimate of . is then computed, and the parameter vector updated using standard gradient ascent or one of its more sophisticated counterparts (e.g. ADAM; Kingma and Ba (2014)).
The above gradient estimate is unbiased (i.e. ) but can suffer from high variance. This variance can be somewhat reduced by the introduction of a baseline function into Equation (2):
(3) 
It can be shown that including does not bias the gradient estimate and may lower its variance if chosen appropriately. The value function for the current policy, , is a good choice for (Sutton et al., 2000) — however this will rarely be known. A typical compromise is to train a function , parameterized by a vector , to approximate at the same time as we are training . Specifically, this is achieved by minimizing a samplebased estimate of .
We employ two additional standard techniques (Mnih et al., 2016). First, we have share the majority of its parameters with , i.e. . This allows the controller to learn useful representations even in the absence of reward, which can speed up learning when reward is sparse. Second, we include in the objective a term which encourages to have high entropy, thereby favouring exploration.
Overall, given a batch of sample trajectories from policy , we update in the direction of the gradient of the following surrogate objective:
(4) 
where and are positive hyperparameters, , and care is taken not to backpropagate through as the first term does not provide a useful training signal for .
3 Visual Arithmetic
We now describe the Visual Arithmetic task domain in detail, as well as the steps required to apply our approach there. We begin by describing the external environment , before describing the interface , and conclude the section with a specification of the manner in which and are coupled in order to produce the POMDP with which the controller ultimately interacts.
3.1 External Environment
Tasks in the Visual Arithmetic domain can be cast as image classification problems. For each task, each input image consists of an grid, where each grid cell is either blank or contains a digit or letter from the Extended MNIST dataset (Cohen et al., 2017). Unless indicated otherwise, we use . The correct label corresponding to an input image is the integer that results from applying a specific (taskvarying) reduction operation to the digits present in the image. We consider 5 tasks within this domain, grouped into two kinds. Changing the task may be regarded as changing the external environment .
Single Operation Tasks. In the first kind of task, each input image contains a randomly selected number of digits (2 or 3 unless otherwise stated) placed randomly on the grid, and the agent is required to output an answer that is a function of the both the specific task being performed and the digits displayed in the image. We consider 4 tasks of this kind: Sum, Product, Maximum and Minimum. Example input images are shown in Figure 1, Top Row.
Combined Task. We next consider a task that combines the four Single Operation tasks. Each input example now contains a capital EMNIST letter in addition to 2to3 digits. This letter indicates which reduction operation should be performed on the digits: indicates add/sum, indicates multiplication/product, indicates maximum, indicates minimum. Example input images are shown in Figure 1, Bottom Row. Succeeding on this task requires being able to both carry out all the required arithmetic algorithms and being able to identify, for any given input instance, which of the possible algorithms should be executed.
3.2 Interface
We now describe the interface that is used to solve tasks in this domain. The first step is to identify information processing functions that we expect to be useful. We can immediately see that for Visual Arithmetic, it will be useful to have modules implementing the following functions:

Detect and attend to salient locations in the image.

Classify a digit or letter in the attended region.

Manipulate symbols to produce an answer.
We select modules to perform each of these functions and then assemble them into an interface which will be controlled by an agent trained via reinforcement learning. A single interface, depicted in Figure 2, is used to solve the various Visual Arithmetic tasks described in the previous section.
This interface includes 3 pretrained deep neural networks. Two of these are instances of LeNet (LeCun et al., 1998), each consisting of two convolutional/maxpool layers followed by a fullyconnected layer with 128 hidden units and RELU nonlinearities. One of these LeNets, the op classifier, is pretrained to classify capital letters from the EMNIST dataset. The other LeNet, the digit classifier, is pretrained to classify EMNIST digits. The third network is the salience detector, a multilayer perceptron with 3 hidden layers of 100 units each and RELU nonlinearities. The salience network is pretrained to output a salience map when given as input scenes consisting of randomly scattered EMNIST characters (both letters and digits).
3.3 Coupling and Reinforcement Learning Details
In the Visual Arithmetic setting, may be regarded as a degenerate POMDP which emits the same observation, the image containing the EMNIST letters/digits, every time step. sends the contents of its store field (see Figure 2) to every time step as its action. During training, responds to this action with a reward that depends on both the time step and whether the action sent to corresponds to the correct answer to the arithmetic problem represented by the input image. Specifically, for all but the final time step, a reward of is provided if the answer is correct, and otherwise. On the final time step, a reward of is provided if the answer is correct, and otherwise. Each episode runs for time steps. At test time, no rewards are provided and the contents of the interface’s store field on the final time step is taken as the agent’s guess for the answer to the arithmetic problem posed by the input image. For the controller, we employ a Long ShortTerm Memory (LSTM) (Hochreiter and Schmidhuber, 1997) with 128 hidden units. This network accepts observations provided by the interface (see Figure 2) as input, and yields as output both (specifically a softmax distribution) from which an action is sampled, and (which is only used during training). The weights of the LSTM are updated according to the actorcritic algorithm discussed in Section 2.3.
4 Experiments
In this section, we consider experiments applying our approach to the Visual Arithmetic domain. These experiments involve the highlevel tasks described in Section 3.1. For all tasks, our reinforcement learning approach makes use of the interface described in Section 3.2 and the details provided in Section 3.3.
Our experiments look primarily at how performance is influenced by the number of external environment training samples provided. For all sample sizes, training the controller with reinforcement learning requires many thousands of experiences, but all of those experiences operate on the small provided set of inputoutput training samples from the external environment. In other words, we assume the learner has access to a simulator for the interface, but not one for the external environment. We believe this to be a reasonable assumption given that the interface is designed by the machine learning practitioner and consists of a collection of information processing modules which will, in most cases, take the form of computer programs that can be executed as needed.
We compare our approach against convolutional networks trained using crossentropy loss, the de facto standard when applying deep learning to image classification tasks. These feedforward networks can be seen as interacting directly with the external environment (omitting the interface) and running for a single time step, i.e. . The particular feedforward architecture we experiment with is the LeNet (LeCun et al., 1998), with either 32, 128 or 512 units in the fullyconnected layer. Larger, more complex networks of course exist, but these will likely require much larger amounts of data to train, and here we are primarily concerned with performance at small sample sizes. For training the convolutional networks, we treat all tasks as classification problems with 101 classes. The first 100 classes correspond to the integers 099. All integers greater than or equal to 100 (e.g. when multiplying 3 digits) are subsumed under the 101st class.
Experiments showing the sample efficiency of the candidate models on the Single Operation tasks are shown in Figure 3. Similar results for the Combined task are shown in Figure 4. In both cases, our reinforcement learning approach is able to leverage the inductive bias provided by the interface to achieve good performance at significantly smaller sample sizes than the relatively architecturally homogeneous feedforward convolutional networks.
5 Related Work
Production Systems & Cognitive Architectures. The idea of using a central controller to coordinate the activity of a collection of information processing components has its roots in research on production systems in classical artificial intelligence, and cognitive architectures in cognitive science. Production systems date back to Newell and Simon’s pioneering work on the study of highlevel human problem solving, manifested most clearly in the General Problem Solver (GPS; Newell and Simon, 1963). While production systems have fallen out of favour in mainstream AI, they still enjoy a strong following in the cognitive science community, forming the core of nearly every prominent cognitive architecture including ACTR (Anderson, 2009), SOAR (Newell, 1994; Laird, 2012), and EPIC (Kieras and Meyer, 1997). However, the majority of these systems use handcoded, symbolic controllers; we are not aware of any work that has applied recent advances in reinforcement learning to learn controllers for these systems for difficult tasks.
Sequential Models for Supervised Learning. A related body of work concerns recurrent neural networks applied to supervised learning problems. Mnih et al. (2014), for example, use reinforcement learning to train a recurrent network to control the location of an attentional window in order to classify images containing MNIST digits placed on cluttered backgrounds. Our approach may be regarded as providing a recipe for building similar kinds of models while placing greater emphasis on tasks with a difficult algorithmic components and the use of structured interfaces.
Neural Abstract Machines. One obvious point of comparison to the current work is recent research on deep neural networks designed to learn to carry out algorithms on sequences of discrete symbols. Some of these frameworks, including the Differentiable Forth Interpreter (Riedel and Rocktäschel, 2016) and TerpreT (Gaunt et al., 2016), achieve this by explicitly generating code, while others, including the Neural Turing Machine (NTM; Graves et al., 2014), Neural RandomAccess Machine (NRAM; Kurach et al., 2015), Neural Programmer (NP; Neelakantan et al., 2015), and Neural ProgrammerInterpreter (NPI; Reed and De Freitas, 2015) avoid generating code and generally consist of a controller network that learns to perform actions using a differentiable external computational medium (i.e. a differentiable interface) in order to carry out an algorithm.
Our approach is most similar to the latter category, the main difference being that we have elected not to require the external computational medium to be differentiable, which provides it with greater flexibility in terms of the components that can be included in the interface. In fact, our work is most similar to Zaremba et al. (2016), which also uses reinforcement learning to learn algorithms, and from which we borrowed the idea of an interface, the main difference being that we have included deep networks in our interfaces in order to tackle tasks with nontrivial perceptual components.
Visual Arithmetic. Past work has looked at learning arithmetic operations from visual input. Hoshen and Peleg (2016) train a multilayer perceptron to map from images of two 7digit numbers to an image of a number that is some taskspecific function of the numbers in the input images. Specifically, they look at addition, subtraction, multiplication and RomanNumeral addition. Howeover, they do not probe the sample efficiency of their method, and the digits are represented using a fixed computer font rather than being handwritten, making the perceptual portion of the task significantly easier.
Gaunt et al. (2017) addresses a task domain that is similar to Visual Arithmetic, and makes use of a differentiable codegeneration method built on top of TerpreT. Their work has the advantage that their perceptual modules are learned rather than being pretrained, but is perhaps less general since it requires all components to be differentiable. Moreover, we do not view our reliance on pretrained modules as particularly problematic given the wide array of tasks deep networks have been used for. Indeed, we view our approach as a promising way to make further use of any trained neural network, especially as facilities for sharing neural weights mature and enter the mainstream.
Modular Neural Networks. Additional work has focused more directly on the use of neural modules and adaptively choosing groups of modules to apply depending on the input. EndtoEnd Module Networks (Hu et al., 2017) use reinforcement learning to train a recurrent neural network to lay out a feedforward neural network composed of elements of a stored library of neural modules (which are themselves learnable). Our work differs in that rather than having the layout of the network depend solely on the input, the module applied at each stage (i.e. the topology of the network) may depend on past module applications within the same episode, since a decision about which module to apply is made at every time step. Systems built using our framework can, for example, use their modules to gather information about the environment in order to decide which modules to apply at a later time, a feat that is not possible with Module Networks.
In Deep Sequential Neural Networks (DSNN; Denoyer and Gallinari (2014)), each edge of a fixed directed acyclic graph (DAG) is a trainable neural module. Running the network on an input consists in moving through the DAG starting from the root while maintaining a feature vector which is repeatedly transformed by the neural modules associated with the traversed edges. At each node of the DAG, an outgoing edge is stochastically selected for traversal by a learned controller which takes the current features as input. This differs from our work, where each module may be applied many times rather than just once as is the case for the entirely feedforward DSNNs (where no module appears more than once in any path through the DAG connecting the input to the output). Finally, PathNet is a recent advance in the use of modular neural networks applied to transfer learning in RL (Fernando et al., 2017). An important difference from our work is that modules are recruited for the entire duration of a task, rather than on the more finegrained stepbystep basis used in our approach.
6 Discussion and Conclusion
There are number of possible future directions related to the current work, including potential benefits of our approach that were not explored here. These include the ability to take advantage of conditional computation; in principle, only the subset of the interface needed to carry out the chosen action needs to be executed every time step. If the interface contains many large networks or other computationally intensive modules, large speedups can likely be realized along these lines. A related idea is that of adaptive computation time; in the current work, all episodes ran for a fixed number of time steps, but it should be possible to have the controller decide when it has the correct answer and stop computation at that point, saving valuable computational resources. Furthermore, it may be beneficial to train the perceptual modules and controller simultaneously, allowing the modules to adapt to better perform the uses that the controller finds for them. Finally, the ability of reinforcement learning to make use of discrete and nondifferentiable modules opens up a wide array of possible interface components; for instance, a discrete knowledge base may serve as a long term memory. Any generally intelligent system will need many individual competencies at its disposal, both perceptual and algorithmic; in this work we have proposed one path by which a system may learn to coordinate such competencies.
We have proposed a novel approach for solving tasks that require both sophisticated perception and symbolic computation. This approach consists in first designing an interface that contains information processing modules such as pretrained deep neural networks for processing perceptual data and modules for manipulating stored symbolic representations. Reinforcement learning is then used to train a controller to use the interface to solve tasks. Using the Visual Arithmetic task domain as an example, we demonstrated empirically that the interface acts as a source of inductive bias that allows tasks to be solved using a much smaller number of training examples than required by traditional approaches.
References
 John R Anderson. How can the human mind occur in the physical universe? Oxford University Press, 2009.
 Gregory Cohen, Saeed Afshar, Jonathan Tapson, and André van Schaik. Emnist: an extension of mnist to handwritten letters. arXiv preprint arXiv:1702.05373, 2017.
 Thomas Degris, Patrick M Pilarski, and Richard S Sutton. Modelfree reinforcement learning with continuous action in practice. In American Control Conference (ACC), 2012, pages 2177–2182. IEEE, 2012.
 Ludovic Denoyer and Patrick Gallinari. Deep sequential neural network. arXiv preprint arXiv:1410.0510, 2014.
 Chrisantha Fernando, Dylan Banarse, Charles Blundell, Yori Zwols, David Ha, Andrei A Rusu, Alexander Pritzel, and Daan Wierstra. Pathnet: Evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734, 2017.
 Alexander L Gaunt, Marc Brockschmidt, Rishabh Singh, Nate Kushman, Pushmeet Kohli, Jonathan Taylor, and Daniel Tarlow. Terpret: A probabilistic programming language for program induction. arXiv preprint arXiv:1608.04428, 2016.
 Alexander L Gaunt, Marc Brockschmidt, Nate Kushman, and Daniel Tarlow. Differentiable programs with neural libraries. In International Conference on Machine Learning, pages 1213–1222, 2017.
 Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
 Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
 Yedid Hoshen and Shmuel Peleg. Visual learning of arithmetic operation. In AAAI, pages 3733–3739, 2016.
 Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. Learning to reason: Endtoend module networks for visual question answering. arXiv preprint arXiv:1704.05526, 2017.
 David E Kieras and David E Meyer. An overview of the EPIC architecture for cognition and performance with application to humancomputer interaction. HumanComputer Interaction, 4(12):391–438, 1997.
 Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Karol Kurach, Marcin Andrychowicz, and Ilya Sutskever. Neural randomaccess machines. arXiv preprint arXiv:1511.06392, 2015.
 John E Laird. The Soar cognitive architecture. MIT Press, 2012.
 Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building machines that learn and think like people. arXiv preprint arXiv:1604.00289, 2016.
 Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
 Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. Recurrent models of visual attention. In Advances in neural information processing systems, pages 2204–2212, 2014.
 Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937, 2016.
 Arvind Neelakantan, Quoc V Le, and Ilya Sutskever. Neural programmer: Inducing latent programs with gradient descent. arXiv preprint arXiv:1511.04834, 2015.
 Allen Newell. Unified theories of cognition. Harvard University Press, 1994.
 Allen Newell and Herbert Alexander Simon. Gps: A program that simulates human thought. In E. A. Feldman and J. Feigenbaum, editors, Computers and Thought, Computers and thought. McGrawHill, New York, 1963.
 Scott Reed and Nando De Freitas. Neural programmerinterpreters. arXiv preprint arXiv:1511.06279, 2015.
 Sebastian Riedel and Tim Rocktäschel. Programming with a differentiable forth interpreter. 2016.
 John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
 Richard S Sutton and Andrew G Barto. Introduction to reinforcement learning. MIT Press, 1998. ISBN 0262193981.
 Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
 Ronald J Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(34):229–256, 1992.
 Wojciech Zaremba, Tomas Mikolov, Armand Joulin, and Rob Fergus. Learning simple algorithms from examples. In Proceedings of the International Conference on Machine Learning, 2016.