Ranked Reward: Enabling Self-Play Reinforcement Learning for Combinatorial Optimization
Adversarial self-play in two-player games has delivered impressive results when used with reinforcement learning algorithms that combine deep neural networks and tree search. Algorithms like AlphaZero and Expert Iteration learn tabula-rasa, producing highly informative training data on the fly. However, the self-play training strategy is not directly applicable to single-player games. Recently, several practically important combinatorial optimization problems, such as the traveling salesman problem and the bin packing problem, have been reformulated as reinforcement learning problems, increasing the importance of enabling the benefits of self-play beyond two-player games. We present the Ranked Reward (R2) algorithm which accomplishes this by ranking the rewards obtained by a single agent over multiple games to create a relative performance metric. Results from applying the R2 algorithm to instances of a two-dimensional bin packing problem show that it outperforms generic Monte Carlo tree search, heuristic algorithms and reinforcement learning algorithms not using ranked rewards.
Ranked Reward: Enabling Self-Play Reinforcement Learning for Combinatorial Optimization
Alexandre Laterre firstname.lastname@example.org Yunguan Fu email@example.com Mohamed Khalil Jabri firstname.lastname@example.org Alain–Sam Cohen email@example.com David Kas firstname.lastname@example.org Karl Hajjar email@example.com Torbjørn S. Dahl firstname.lastname@example.org Amine Kerkeni email@example.com Karim Beguir firstname.lastname@example.org
1 Introduction and Motivation
Reinforcement learning (RL) algorithms that combine neural networks and tree search have delivered outstanding successes in two-player games such as Go, Chess, Shogi, and Hex. One of the main strengths of algorithms like AlphaZero  and Expert Iteration  is their capacity to learn tabula rasa through self-play. Historically, self-play has also produced great results in the game of Backgammon . Using this strategy removes the need for training data from human experts and provides an agent with a well-matched adversary, which facilitates learning.
While self-play algorithms have proven successful for two-player games, there has been little work on applying similar principles to single-player games . These games include several well-known combinatorial problems that are particularly relevant to industry and represent real-world optimization challenges, such as the traveling salesman problem (TSP) and the bin packing problem.
This paper describes the Ranked Reward (R2) algorithm and results from its application to a 2D bin packing problem formulated as a single-player Markov decision process (MDP). The R2 algorithm uses a deep neural network to estimate a policy and a value function, as well as Monte Carlo tree search (MCTS) for policy improvement. In addition, it uses a reward ranking mechanism to build a single-player training curriculum that provides advantages comparable to those produced by self-play in competitive multi-agent environments.
The R2 algorithm offers a new generic method for producing approximate solutions to NP-hard optimization problems. Generic optimization approaches are typically based on algorithms such as integer programming , that provide optimality guarantees at a high computational expense, or heuristic methods that are lighter in terms of computation but may produce unsatisfactory suboptimal solutions. The R2 algorithm has the advantage of outperforming heuristic approaches while scaling better than optimization solvers. We present results showing that it surpasses a range of existing algorithms on a 2D bin packing problem, including MCTS , the Lego heuristic algorithm , as well as RL algorithms such as A3C  and PPO .
In Section 2 of this paper, we summarize the current state-of-the-art in deep learning for games with large search spaces. Then, in Section 3, we present a single-player MDP formulation of the 2D bin packing problem. In Section 4 we describe the R2 algorithm using deep RL and tree search along with a reward ranking mechanism. Section 5 presents our experiments and results, and discusses the implications of using different reward ranking thresholds. Finally, Section 6 summarizes current limitations of our algorithm and future research directions.
2 Deep Learning for Combinatorial Optimization
Combinatorial optimization problems are widely studied in computer science and mathematics. A large number of them belongs to the NP-hard class of problems. For this reason, they have traditionally been solved using heuristic methods [13, 4]. However, these approaches may need hand-crafted adaptations when applied to new problems because of their problem-specific nature.
Deep learning algorithms potentially offer an improvement on traditional optimization methods as they have provided remarkable results on classification and regression tasks . Nevertheless, their application to combinatorial optimization is not straightforward. A particular challenge is how to represent these problems in ways that allow the deployment of deep learning solutions.
One way to overcome this challenge was introduced by Vinyals et al.  through Pointer Networks, a neural architecture representing combinatorial optimization problems as sequence-to-sequence learning problems. Early Pointer Networks were trained using supervised learning methods and yielded promising results on the TSP but required datasets containing optimal solutions which can be expensive, or even impossible, to build. Using the same network architecture, but training with actor-critic methods, removed this requirement .
Unfortunately, the constraints inherent to the bin packing problem prohibit its representation as a sequence in the same way as the TSP. In order to get around this, Hu et al.  combined a heuristic approach with RL to solve a 3D version of the problem. The main role of the heuristic is to transform the output sequence produced by the RL algorithm into a feasible solution so that its reward signal can be computed. This technique outperformed previous well-designed heuristics.
2.1 Deep Learning with Tree Search and Self-Play
Policy iteration algorithms that combine deep neural networks and tree search in a self-training loop, such as AlphaZero  and Expert Iteration , have exceeded human performance on several two-player games. These algorithms use a neural network with weights to provide a policy and/or a state value estimate for every state of the game. The tree search uses the neural network’s output to focus on moves with both high probabilities according to the policy and high-value estimates. The value function also removes any need for Monte Carlo roll-outs when evaluating leaf nodes. Therefore, using a neural network to guide the search reduces both the breadth and the depth of the searches required, leading to a significant speedup. The tree search, in turn, helps to raise the performance of the neural network by providing improved MCTS-based policies during training.
Self-play allows these algorithms to learn from the games played by both players. It also removes the need for potentially expensive training data, often produced by human experts. Such data may be biased towards human strategies, possibly away from better solutions. Another significant benefit of self-play is that an agent will always face an opponent with a similar performance level. This facilitates learning by providing the agent with just the right curriculum in order for it to keep improving . If the opponent is too weak, anything the agent does will result in a win and it will not learn to get better. If the opponent is too strong, anything the agent does will result in a loss and it will never know what changes in its strategy could produce an improvement.
The main contribution of the R2 algorithm is a relative reward mechanism for single-player games, providing the benefits of self-play in single-player MDPs and potentially making policy iteration algorithms with deep neural networks and tree search effective on a range of combinatorial optimization problems.
3 Bin Packing as a Markov Decision Problem
The bin packing problem consists of a set of items to be packed into fixed-sized bins in a way that minimizes a cost function, e.g., the number of bins required. The work presented here considers an alternative version of the 2D bin packing problem. Like in the work of Hu et al. , this problem involves a set of rectangular items where and denote the width and height of item . Items can be rotated of and denotes whether the -th item is rotated or not. The bottom-left corner of an item placed inside the bin is denoted by with the bottom-left corner of the bin set to . The problem also includes additional constraints, complexifying the environment and reducing the number of available positions in which an item can be placed. In particular, items may not overlap and an item’s center of gravity needs physical support. A solution to this problem is a sequence of triplets where all items are placed inside the bin while satisfying all the constraints. An example of how the solution is constructed is shown on Figure 2.
We formulate the problem as an MDP in which the state encodes the items and their current placement while the actions encode the possible positions and rotations of the unplaced items. The goal of the agent is to select actions in a way that minimizes the side of the minimal square bounding box, . This is reflected in the terminal reward, , after all items have been placed. As defined in Equation 1 and illustrated in Figure 1, all non-terminal states receive a reward of while terminal states receive a reward, which is a function of the side of the optimal bounding box , the minimal square bounding box , and the side of the bin :
Note that we only use the knowledge of the side of the optimal packing to compute the reward. Information related to the optimal position of the items is not exploited in the algorithm. Knowing this allows us to calculate how close to the optimum a given solution is. The algorithm can be made generally applicable by changing the reward function when the size of the optimal solution to the problem is not known, e.g. a function of the percentage of the empty space.
An initial analysis of the problem shows its exponential complexity in the number of items. Figure 3 illustrates how the number of legal moves changes at each step of the game and Figure 3 illustrates how the number of possible games grows with the number of items. A conservative upper bound for the number of possible games is:
The term represents the number of items left to play, while the term stands for the maximum number of playable positions and the factor accounts for the possible rotations. Decision problems with large branching factors cannot be solved optimally by brute force search. Tree search algorithms have thus emerged as a general method for identifying the best possible solution.
4 The Ranked Reward Algorithm
When using self-play in two-player games, an agent faces a perfectly suited adversary at all times because no matter how weak or strong it is, the opponent always provides just the right level of opposition for the agent to learn from . The R2 algorithm reproduces the benefits of self-play for generic single-player MDPs by reshaping the rewards of a single agent according to its relative performance over recent games. A detailed description is given by Algorithm 1.
4.1 Ranked Rewards
The ranked reward mechanism compares each of the agent’s solutions to its recent performance so that no matter how good it gets, it will have to surpass itself to get a positive reward. Recent MDP rewards, as given in Equation 1, are used to compute a threshold value . This value is based on a given percentile of the recent rewards, e.g. the threshold value is the reward value at the th percentile of the recent rewards. The agent’s recent solutions are each given a ranked reward of or according to whether or not it surpasses the threshold value: . Doing this ensures that % of the games used to compute the threshold will get a ranked reward of and the rest a ranked reward of . This way, the player is provided with samples of recent games labeled relatively to the agent’s current performance, providing information on which policies will improve its present capabilities.
The ranked rewards are then used as targets for the value head of a policy-value network and as the value of the end-game nodes of the MCTS. More precisely, we consider a policy-value network with parameters and MCTS which uses for guiding the move selection during the search and evaluating states without performing Monte Carlo roll-outs . The network takes a state as input, and outputs probabilities over the action space as well as an estimate of the ranked reward of the current game, i.e., . Finally, the neural network is updated to minimize the cross-entropy loss between predicted ranked reward and true ranked reward , as well as the cross-entropy loss between the neural network policy and the MCTS-based improved policy , plus an regularization term.
4.2 Neural Network Architecture
The neural network architecture used in this work has been kept general to emphasize the wider applicability of our approach. This was in spite of more problem-specific architectures performing better and converging faster on the problem considered.
Our network uses a visual representation of the bin and items. To represent the bin we use a binary occupancy grid indicating the presence or absence of items at discrete locations, as illustrated in Figures 4 and 4. Similarly, each item is represented by two binary feature planes, one for each rotation, as illustrated in Figures 4 and 4. If an item has already been placed in the bin, both planes are set to zero. The complete network input consists of the bin representation of size and an feature stack representing the individual items. Historical features (previous bin states) are not necessary as the environment is fully observable and strictly Markov.
An embedding of the bin representation is produced by feeding it to a number of convolutional layers and the item features are processed by multiple in-plane convolutional layers—with each item and its rotation processed independently. This is followed by aggregate operations ensuring that the embedding doesn’t depend on the order of the items. The embeddings of the bin and of the items are then concatenated and fed to a residual tower111One residual block applies the following transformations sequentially to the input: a convolution of 64 filters of kernel size 5x5 with stride 1, batch normalization, an ELU non-linearity, a convolution of 64 filters of kernel size 5x5 with stride 1, batch normalization, a skip connection that adds the input to the layer and an ELU non-linearity . followed by separate policy and value heads representing the full joint probability distribution over the action space () and a state value estimate. This architecture contains approximately parameters.
5 Experiments and Results
In order to evaluate the effectiveness of our approach, we considered the 2D bin packing problem described above, with ten items and a bin of side . Problem instances were created by progressively and randomly dividing a quarter of the bin area into items to produce an optimal solution with no empty spaces and side .
For each experiment, we ran the R2 algorithm for iterations222Each experiment was run using an NVIDIA V100 GPU for the training and inference of the neural network, and an Intel Xeon to execute the search algorithm.. At each iteration, new games were randomly generated. The neural network parameters were optimized using the Adam optimizer  with a learning rate of . Mini-batches of size were sampled from a buffer of size . At each step of a game, MCTS used simulations to select moves. The algorithm was then evaluated on a set of new games. To ensure the diversity during training, actions were sampled from , whereas, during evaluation, they were selected greedily, i.e. the action with the largest visit count was executed. Since the problem is deterministic, when evaluating the algorithm, the tree search returned the sequence of actions leading to the best game outcome reached during the entire search rather than the best outcome from the last 300 simulations only.
5.1 Ranked Reward Performance
Our experiments compare the performance of the R2 algorithm for -percentiles of , and . The experiments also include a version of the algorithm that used the MDP-reward without ranking as the target for the value estimate. The performances of the different algorithms are presented in Table 1 and the learning curves are displayed333The learning curve for the reward threshold is not included in Figure 5 to improve the readability of the graph. in Figure 5.
The results show that R2 outperforms its rank-free counterpart. The latter quickly plateaued at a value close to 0.88, whereas R2 surpassed that, with the threshold version reaching . This represents an improvement of , with more than half of the problems solved optimally. In addition, faster and more stable learning is observed for R2 compared to its rank-free version. These results validate the importance of the ranking mechanism within the algorithm.
|Algorithm||Mean ( std)||Median||Optimality|
In order to compare the performance of the R2 algorithm to existing approaches, our experiments also included a plain MCTS agent using Monte-Carlo roll-outs for state value estimation ; the Lego heuristic search algorithm ; two successful reinforcement learning methods: the asynchronous advantage actor-critic (A3C) algorithm  and the proximal policy optimization (PPO) algorithm ; and a supervised learning algorithm:
Plain MCTS The plain MCTS agent used simulations per move just like R2 and executed a single Monte Carlo roll-out per simulation to estimate state values.
Lego Heuristic The Lego algorithm worked sequentially by first selecting the item minimizing the wasted space in the bin, and then selecting the orientation and position of the chosen item to minimize the bin size.
Reinforcement Learning We considered the A3C  and PPO algorithms , and adapted the implementations provided in the Ray package  to our problem. In each experiment, we used exactly the same network as in the R2 algorithm. We ran iterations for both A3C and PPO, and each iteration performed steps of optimization with a mini-batch size of .
Supervised Learning Because the bin packing problem instances are generated in a way that provides a known optimal solution for each problem, we designed a Lego-like heuristic algorithm defining a corresponding optimal sequence of actions resulting in this optimal solution. The state-action pairs were used to train the policy-head of the R2 neural network as a one-class classification problem: given state , the policy network should choose action with maximum probability, i.e. the target is a one-hot encoding of the action .
The performances of these algorithms are also given in Table 1 and in Figure 5. Both A3C and PPO reached a significantly lower performance level than R2 and MCTS, suggesting there is a clear advantage in using a tree search algorithm as a policy improvement mechanism. The same neural network was used in A3C, PPO, and R2, and was also trained in a supervised fashion as described above. The supervised learning policy was superior in performance to A3C and PPO, but relies on knowledge of optimal sequences of actions which are in practice unavailable.
Lego is faster to run than the other algorithms but performs worse than R2. The rank-free version of R2 achieves the same level of performance as MCTS, which suggests that the combination of its trained neural network with tree search provides neither an advantage nor a disadvantage. On the other hand, the neural network trained using ranked rewards as target for the value head leads to a significant improvement in the MCTS performance.
5.2 The Effects of Ranking Thresholds on Learning
The performance level and the learning behavior of R2 are both sensitive to the -percentile value. Figure 6 illustrates the effect of different reward thresholds on the distribution of rewards received across 50 games.
The impact of the percentile on the performance follows our intuitive understanding of human learning. Setting the threshold at is equivalent to making the agent play against an opponent of exactly the same level, as it has a predetermined chance of winning. Increasing the percentile value corresponds to improving the opponent’s level, as it makes it harder to obtain a reward of . In our context, when the percentile changes from to , the probability of winning falls to . Interestingly, this threshold produces a faster learning and attains a better final level of performance. Taking inspiration from sports, we can expect that the learning process would be improved by playing against a slightly stronger adversary, because it would push learners to the limit of their abilities.
In general, higher thresholds lead to faster learning, i.e. the proportion of high-reward games increases faster. However, Figure 6d shows that, for a threshold of , large amounts of low-reward games re-appear, especially during the last parameter updates. These instabilities result in weaker final performance despite some good short-lived peaks. To explain this, we can hypothesize that if the opponent is too strong, the learning process will suffer because the agent can very rarely affect the outcome of the game even when it manages to play significantly better than its current mean performance level.
6 Discussion and Future Work
The results presented above show that R2 outperforms the selected alternatives on the given problem. Yet, these results have limitations that we discuss here.
First, our implementation of the 2D bin packing problem only produces problems with known optimal solutions that do not contain any empty space, i.e., square packings with no gaps. Even though this helps us to evaluate the algorithm’s performance, it introduces an undesirable bias. Future research should evaluate the algorithm on a wider range of problems, for which the optimal solution is unknown and not necessarily square.
Secondly, our results are presented for instances of ten items only. Although this represents a problem space of possible solutions, more than this can be handled by current optimizers used in industry, such as the IBM CPLEX Optimizer444https://www.ibm.com/analytics/cplex-optimizer.. Therefore, experimenting on larger problems is a necessary step towards demonstrating the superiority of R2 over the other algorithms from Section 5.1.
Furthermore, regarding the scalability of our approach, the capacity of our network can be increased at an acceptable computational cost. In particular, we only use two residual blocks for the policy-value network which is significantly less than what was used to master the game of Go . A more thorough exploration of the threshold space may also improve performance and scalability.
In this paper, we introduced the R2 algorithm and compared its performance to other algorithms on a bin packing problem of ten items. By ranking the rewards obtained over recent games, R2 provides a threshold-based relative performance metric. This enables it to reproduce the benefits of self-play for single-player games, removing the requirement for training data and providing a well-suited adversary throughout the learning process.
Consequently, R2 outperforms the selected alternatives as well as its rank-free counterpart, improving on the performance of the best alternative, plain MCTS, by more than when using a threshold value of . An analysis of the effects of different percentiles has shown that higher thresholds perform better up to a point after which learning becomes unstable and performance decreases.
The R2 algorithm is potentially applicable to a wide range of optimization tasks, though it has so far been used only on the bin packing. In the future, we will consider other optimization problems to further evaluate its effectiveness.
-  Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning and tree search. In Advances in Neural Information Processing Systems (NIPS) 30, pages 5360–5370. 2017.
-  Trapit Bansal, Jakub Pachocki, Szymon Sidor, Ilya Sutskever, and Igor Mordatch. Emergent complexity via multi-agent competition. arXiv:1710.03748, 2017.
-  Irwan Bello, Hieu Pham, Quoc V. Le, Mohammad Norouzi, and Samy Bengio. Neural combinatorial optimization with reinforcement learning. CoRR, abs/1611.09940, 2016.
-  V. Boyer, M. Elkihel, and D. El Baz. Heuristics for the 0–1 multidimensional knapsack problem. European Journal of Operational Research, 199(3):658–664, 2009.
-  C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton. A survey of Monte Carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games, 4(1):1–43, 2012.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, Las Vegas, NV, USA, June 27-30 2016.
-  Haoyuan Hu, Lu Duan, Xiaodong Zhang, Yinghui Xu, and Jiangwen Wei. A multi-task selected learning approach for solving new type 3D bin packing problem. arXiv:1804.06896, 2018.
-  Haoyuan Hu, Xiaodong Zhang, Xiaowei Yan, Longfei Wang, and Yinghui Xu. Solving a new 3D bin packing problem with deep reinforcement learning method. arXiv:1708.05930, 2017.
-  Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.
-  Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proceedings of the 33nd International Conference on Machine Learning (ICML), pages 1928–1937, New York City, NY, USA, June 19-24 2016.
-  Thomas M. Moerland, Joost Broekens, Aske Plaat, and Catholijn M. Jonker. A0C: Alpha zero in continuous action space. arXiv:1805.09613, 2018.
-  Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, William Paul, Michael I Jordan, and Ion Stoica. Ray: A distributed framework for emerging AI applications. arXiv:1712.05889, 2017.
-  César Rego, Dorabela Gamboa, Fred Glover, and Colin Osterman. Traveling salesman problem heuristics: Leading methods, implementations and latest advances. European Journal of Operational Research, 211(3):427–441, 2011.
-  Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–117, 2015.
-  John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv:1707.06347, 2017.
-  David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy P. Lillicrap, Karen Simonyan, and Demis Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv:1712.01815, 2017.
-  David Silver, Julian Schrittandieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of Go without human knowledge. Nature, (550):354–359, 2017.
-  Gerald Tesauro. TD-gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation, 6(2):215–219, 1994.
-  Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In Advances in Neural Information Processing Systems (NIPS) 28, page 2692–2700, Montreal, Quebec, Canada, December 7-12 2015.
-  L. A. Wolsey. Integer programming. Wiley-Interscience, New York, NY, USA, 1998.