Reinforced Continual Learning
Most artificial intelligence models have limiting ability to solve new tasks faster, without forgetting previously acquired knowledge. The recently emerging paradigm of continual learning aims to solve this issue, in which the model learns various tasks in a sequential fashion. In this work, a novel approach for continual learning is proposed, which searches for the best neural architecture for each coming task via sophisticatedly designed reinforcement learning strategies. We name it as Reinforced Continual Learning. Our method not only has good performance on preventing catastrophic forgetting but also fits new tasks well. The experiments on sequential classification tasks for variants of MNIST and CIFAR-100 datasets demonstrate that the proposed approach outperforms existing continual learning alternatives for deep networks.
Reinforced Continual Learning
Ju Xu Center for Data Science, Peking University Beijing, China firstname.lastname@example.org Zhanxing Zhu ††thanks: Corresponding author. Center for Data Science, Peking University & Beijing Institute of Big Data Research (BIBDR) Beijing, China email@example.com
noticebox[b]Preprint. Work in progress.\end@float
Continual learning, or lifelong learning thrun1 (), the ability to learn consecutive tasks without forgetting how to perform previously trained tasks, is an important topic for developing artificial intelligence. The primary goal of continual learning is to overcome the forgetting of learned tasks and to leverage the earlier knowledge for obtaining better performance or faster convergence/training speed on the newly coming tasks.
In deep learning community, two groups of strategies have been developed to alleviate the problem of forgetting the previously trained tasks, distinguished by whether the network architecture changes during learning.
The first category of approaches maintain a fixed network architecture with large capacity. When training the network for consecutive tasks, some regularization term is enforced to prevent the model parameters from deviating too much from the previous learned parameters according to their significance to old tasks kirkpatrick1 (); zenke1 (). In lee2017overcoming (), the authors proposed to incrementally matches the moment of the posterior distribution of the neural network which is trained on the first and the second task, respectively. Alternatively, an episodic memory GradientEpisodicMemory () is budgeted to store the subsets of previous datasets, and then trained together with the new task. Fernando et al. fernando1 () proposed PathNet, in which a neural network has ten or twenty modules in each layer, and three or four modules are picked for one task in each layer by an evolutionary approach. However, these methods typically require unnecessarily large-capacity networks, particularly when the number of tasks is large, since the network architecture is never dynamically adjusted during training.
The other group of methods for overcoming catastrophic forgetting dynamically expand the network to accommodate the new coming task while keeping the parameters of previous architecture unchanging. Progressive networks rusu1 () expand the architectures with a fixed size of nodes or layers, leading to an extremely large network structure particularly faced with a large number of sequential tasks. The resultant complex architecture might be expensive to store and even unnecessary due to its high redundancy. Dynamically Expandable Network (DEN, yoon1 () alleviated this issue slightly by introducing group sparsity regularization when adding new parameters to the original network; unfortunately, there involves many hyperparameters in DEN, including various regularization and thresholding ones, which need to be tuned carefully due to the high sensitivity to the model performance.
In this work, in order to better facilitate knowledge transfer and avoid catastrophic forgetting, we provide a novel framework to adaptively expand the network. Faced with a new task, deciding optimal number of nodes/filters to add for each layer is posed as a combinatorial optimization problem. We provide a sophisticatedly designed reinforcement learning method to solve this problem. Thus, we name it as Reinforced Continual Learning (RCL). In RCL, a controller implemented as a recurrent neural network is adopted to determine the best architectural hyper-parameters of neural networks for each task. We train the controller by an actor-critic strategy guided by a reward signal deriving from both validation accuracy and network complexity. This can maintain the prediction accuracy on older tasks as much as possible while reducing the overall model complexity. To the best of our knowledge, the proposal is the first attempt that employs the reinforcement learning for solving the continual learning problems.
RCL not only differs from adding a fixed number of units to the old network for solving a new task rusu1 (), which might be suboptimal and computationally expensive, but also distinguishes from yoon1 () as well that performs group sparsity regularization on the added parameters. We validate the effectiveness of RCL on various sequential tasks. And the results show that RCL can obtain better performance than existing methods even with adding much less units.
The rest of this paper is organized as follows. In Section 2, we introduce the preliminary knowledge on reinforcement learning. We propose the new method RCL in Section 3, a model to learn a sequence of tasks dynamically based on reinforcement learning. In Section 4, we implement various experiments to demonstrate the superiority of RCL over other state-of-the-art methods. Finally, we conclude our paper in Section 5 and provide some directions for future research.
2 Preliminaries of Reinforcement learning
Reinforcement learning sutton1 () deals with learning a policy for an agent interacting in an unknown environment. It has been applied successfully to various problems, such as games minh1 (); silver1 (), natural language processing yu1 (), neural architecture/optimizer search zoph1 (); bello1 () and so on. At each step, an agent observes the current state of the environment, decides of an action according to a policy , and observes a reward signal . The goal of the agent is to find a policy that maximizes the expected sum of discounted rewards , where is a discount factor that determines the importance of future rewards. The value function of a policy is defined as the expected return and its action-value function as .
Policy gradient methods address the problem of finding a good policy by performing stochastic gradient descent to optimize a performance objective over a given family of parametrized stochastic policies parameterized by . The policy gradient theorem sutton2 () provides expressions for the gradient of the average reward and discounted reward objectives with respect to . In the discounted setting, the objective is defined with respect to a designated start state (or distribution) : . The policy gradient theorem shows that:
3 Our Proposal: Reinforced Continual Learning
In this section, we elaborate on the new framework for continual learning, Reinforced Continual Learning(RCL). RCL consists of three networks, controller, value network, and task network. The controller is implemented as a Long Short-Term Memory network (LSTM) for generating policies and determining how many filters or nodes will be added for each task. We design the value network as a fully-connected network, which approximates the value of the state. The task network can be any network of interest for solving a particular task, such as image classification or object detection. In this paper, we use a convolutional network (CNN) as the task network to demonstrate how RCL adaptively expands this CNN to prevent forgetting, though our method can not only adapt to convolutional networks, but also to fully-connected networks.
3.1 The Controller
Figure 1(a) visually shows how RCL expands the network when a new task arrives. After the learning process of task finishes and task arrives, we use a controller to decide how many filters or nodes should be added to each layer. In order to prevent semantic drift, we withhold modification of network weights for previous tasks and only train the newly added filters. After we have trained the model for task , we timestamp each newly added filter by the shape of every layer to prevent the caused semantic drift. During the inference time, each task only employs the parameters introduced in stage , and does not consider the new filters added in the later tasks.
Suppose the task network has layers, when faced with a newly coming task, for each layer , we specify the the number of filters to add in the range between and . A straightforward idea to obtain the optimal configuration of added filters for layers is to traverse all the combinatorial combinations of actions. However, for an -layer network, the time complexity of collecting the best action combination is , which is NP-hard and unacceptable for very deep architectures such as VGG and ResNet.
To deal with this issue, we treat a series of actions as a fixed-length string. It is possible to use a controller to generate such a string, representing how many filters should be added in each layer. Since there is a recurrent relationship between consecutive layers, the controller can be naturally designed as a LSTM network. At the first step, the controller network receives an empty embedding as input (i.e. the state ) for the current task, which will be fixed during the training. For each task , we equip the network with softmax output, representing the probabilities of sampling each action for layer , i.e. the number of filters to be added. We design the LSTM in an autoregressive manner, as Figure 1(b) shows, the probability in the previous step is fed as input into the next step. This process is circulated until we obtain the actions and probabilities for all the layers. And the policy probability of the sequence of actions follows the product rule,
where denotes the parameters of the controller network.
3.2 The Task Network
We deal with tasks arriving in a sequential manner with training dataset , validation dataset , test dataset at time . For the first task, we train a basic task network that performs well enough via solving a standard supervised learning problem,
We define the well-trained parameters as for task . When the -th task arrives, we have already known the best parameters for task . Now we use the controller to decide how many filters should be added to each layer, and then we obtain an expanded child network, whose parameters to be learned are denoted as (including ). The training procedure for the new task is as follows, keeping fixed and only back-propagating the newly added parameters of . Thus, the optimization formula for the new task is,
We use stochastic gradient descent to learn the newly added filters with as the learning rate,
The expanded child network will be trained until the required number of epochs or convergence are reached. And then we test the child network on the validation dataset and the corresponding accuracy will be returned. The parameters of the expanded network achieving the maximal reward (described in Section 3.3) will be the optimal ones for task , and we store them for later tasks.
3.3 Reward Design
In order to facilitate our controller to generate better actions over time, we need design a reward function to reflect the performance of our actions. Considering both the validation accuracy and complexity of the expanded network, we design the reward for task by the combination of the two terms,
where represents the validation accuracy on , the network complexity as , is the numbers of filters added in layer , and is a parameter to balance between the prediction performance and model complexity. Since is non-differentiable, we use policy gradient to update the controller, described in the following section.
3.4 Training Procedures
The controller’s prediction can be viewed as a list of actions , which means the number of filters added in layers , to design an new architecture for a child network and then be trained in a new task. At convergence, this child network will achieve an accuracy on a validation dataset and the model complexity , finally we can obtain the reward as defined in Eq. (6). We can use this reward and reinforcement learning to train the controller.
To find the optimal incremental architecture the new task , the controller aims to maximize its expected reward,
where is the true value function. In order to accelerate policy gradient training over , we use actor–critic methods with a value network parameterized by to approximate the state value . The REINFORCE algorithm william1 () can be used to learn ,
A Monte Carlo approximation for the above quantity is,
where is the batch size. For the value network, we utilize gradient-based method to update , the gradient of which can be evaluated as follows,
3.5 Comparison with Other Approaches
Compared with DEN, instead of performing selective retraining and network split, RCL keeps the learned parameters for previous tasks fixed and only updates the added parameters. Through this training strategy, RCL can totally prevent catastrophic forgetting due to the freezing parameters for corresponding tasks.
Progressive neural networks expand the architecture with a fixed number of units or filters. To obtain a satisfying model accuracy when number of sequential tasks is large, the final complexity of progressive nets is required to be extremely high. This directly leads to high computational burden both in training and inference, even difficult for the storage of the entire model. To handle this issue, both RCL and DEN dynamically adjust the networks to reach a more economic architecture.
While DEN achieves the expandable network by sparse regularization, RCL adaptively expands the network by reinforcement learning. However, the performance of DEN is quite sensitive to the various hyperparameters, including regularization parameters and thresholding coefficients. RCL largely reduces the number of hyperparameters and boils down to only balancing the average validation accuracy and model complexity when the designed reward function. Through different experiments in Section 4, we demonstrate that RCL could achieve more stable results, and better model performance could be achieved simultaneously with even much less neurons than DEN.
We perform a variety of experiments to access the performance of RCL in continual learning. We will report the accuracy, the model complexity and the training time consumption between our RCL and the state-of-the-art baselines. We implemented all the experiments in Tensorfolw framework on GPU Tesla K80.
(1) MNIST Permutations kirkpatrick1 (). Ten variants of the MNIST data, where each task is transformed by a fixed permutation of pixels. In the dataset, the samples from different task are not independent and identically distributed; (2) MNIST Mix. Five MNIST permutations () and five variants of the MNIST dataset () where each contains digits rotated by a fixed angle between 0 and 180 degrees. These tasks are arranged in the order . (3) Incremental CIFAR-100 icart (). Different from the original CIFAR-100, each task introduces a new set of classes. For the total number of tasks , each new task contains digits from a subset of classes. In this dataset, the distribution of the input is similar for all tasks, but the distribution of the output is different.
For all of the above datasets, we set the number of tasks to be learned as . For the MNIST datasets, each task contains 60000 training examples and 1000 test examples from 10 different classes. For the CIFAR-100 datasets, each task contains 5000 train examples and 1000 examples from 10 different classes. The model observes the tasks one by one, and once the task had been observed, the task will not be observed later during the training.
(1) SN, a single network trained across all tasks; (2) EWC, deep network trained with elastic weight consolidation kirkpatrick1 () for regularization; (3) GEM, gradient episodic memory GradientEpisodicMemory (); (4) PGN, progressive neural network proposed in rusu1 (); (5) DEN, dynamically expandable network yoon1 ().
Base network settings
(1) Fully connected networks for MNIST Permutations and MNIST Mix datasets. We use a three-layer network with 784-312-128-10 neurons with RELU activations; (2) LeNet is used for Incremental CIFAR-100. LeNet has two convolutional layers and three fully-connected layers, the detailed structure of LeNet can be found in lenet1 ().
We evaluate each compared approach by considering average test accuracy on all the tasks, model complexity and training time. Model complexity is measured via the number of model parameters after training all the tasks. We first report the test accuracy and model complexity of baselines and our proposed RCL for the three datasets in Figure 2.
Comparison between fixed-size and expandable networks.
From Figure 2, we can easily observe that the approaches with fixed-size network architectures, such as IN, EWC and GEM, own low model complexity, but their prediction accuracy is much worse than those methods with expandable networks, including PGN, DEN and RCL. This shows that dynamically expanding networks can indeed contribute to the model performance by a large margin.
Comparison between PGN, DEN and RCL.
Regarding to the expandable networks, RCL outperforms PGN and DEN on on both test accuracy and model complexity. Particularly, RCL achieves significant reduction on the number of parameters compared with PGN and DEN, e.g. for incremental Cifar100 data, and parameter reduction, respectively.
To further see the difference of the three methods, we vary the hyperparameters settings and train the networks accordingly, and obtain how test accuracy changes with respect to the number of parameters, as shown in Figure 3. We can clearly observe that RCL can achieve significant model reduction with the same test accuracy as that of PGN and DEN, and remarkable accuracy improvement with same size of networks. This demonstrates the benefits of employing reinforcement learning to adaptively control the complexity of the entire model architecture.
Evaluating the forgetting behavior.
Figure 4 shows the evolution of the test accuracy on the first task as more tasks are learned. RCL and PGN exhibit no forgetting while the approaches without expanding the networks raise catastrophic forgetting. Moreover, DEN can not completely prevent forgetting since it retrains the previous parameters when learning new tasks.
We report the wall clock training time for each compared method in Table 1). Since RCL is based on reinforcement learning, a large number of trials are typically required that leads to more training time than other methods. Improving the training efficiency of reinforcement learning is still an open problem, and we leave it as future work.
Balance between test accuracy and model complexity.
We control the tradeoff between the model performance and complexity through the coefficient in the reward function (6). Figure 5 shows how varying affects the test accuracy and number of model parameters. As expected, with increasing the model complexity drops significantly while the model performance also deteriorate gradually. Interestingly, when is small, accuracy drops much slower compared with the decreasing of the number of parameters. This observation could help to choose a suitable such that a medium-sized network can still achieve a relatively good model performance.
We have proposed a novel framework for continual learning, Reinforced Continual Learning. Our method searches for the best neural architecture for coming task by reinforcement learning, which increases its capacity when necessary and effectively prevents semantic drift. We implement both fully connected and convolutional neural networks as our task networks, and validate them on different datasets. The experiments demonstrate that our proposal outperforms the exiting baselines significantly both on prediction accuracy and model complexity.
As for future works, two directions are worthy of consideration. Firstly, we will develop new strategies for RCL to facilitate backward transfer, i.e. improve previous tasks’ performance by learning new tasks. Moreover, how to reduce the training time of RCL is particularly important for large networks with more layers.
- (1) Irwan Bello, Barret Zoph, Vijay Vasudevan, and Quoc V. Le. Neural optimizer search with reinforcement learning. In International Conference on Machine Learning(ICML), pages 459–468, 2017.
- (2) Chrisantha Fernando, Dylan Banarse, Charles Blundell, Yori Zwols, David Ha, Andrei A. Rusu, Alexander Pritzel, and Daan Wierstra. Pathnet: Evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734, 2017.
- (3) James Kirkpatrick, Razvan Pascanu, Neil C. Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017.
- (4) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- (5) Sang-Woo Lee, Jin-Hwa Kim, Jaehyun Jun, Jung-Woo Ha, and Byoung-Tak Zhang. Overcoming catastrophic forgetting by incremental moment matching. In Advances in Neural Information Processing Systems, pages 4655–4665, 2017.
- (6) David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. In NIPS, 2017.
- (7) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
- (8) Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H. Lampert. icarl: Incremental classifier and representation learning. In CVPR, pages 5533–5542. IEEE Computer Society, 2017.
- (9) Andrei A. Rusu, Neil C. Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
- (10) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
- (11) Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. Cambridge: MIT press, 1998.
- (12) Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, pages 1057–1063, 1999.
- (13) Sebastian Thrun. A lifelong learning perspective for mobile robot control. In International Conference on Intelligent Robots and Systems, 1995.
- (14) Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992.
- (15) J. Yoon and E. Yang. Lifelong learning with dynamically expandable networks. arXiv preprint arXiv:1708.01547, 2017.
- (16) Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, pages 2852–2858, 2017.
- (17) Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In International Conference on Machine Learning (ICML), 2017.
- (18) Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
Appendix A Experiment settings
In this section, we will present the experiments details of our model and baselines. When dealing with dataset MNIST permutations and dataset MNIST mix, we use a three-layer network with 784-312-128-10 neurons, and the learning rate is 0.001, the batch size is 32, the training epochs are 15 for all models. When expanding the network, the size of search space is 30 across all layers for RCL,DEN and PGN. As for CIFAR-100, we use LeNet as our task network. The training epochs are 20 and the learning rate is 0.001. The search space is 5 in convolutional layers, 25 in fully-connected layers for RCL,DEN and PGN.
Our controller is implemented as a LSTM network. The LSTM network has two layers, and the hidden size is 100. Our value network is implemented as a fully-connected network, which has only one layer. The learning rate for our controller is 0.001, for our value network is 0.005.
The in our reward design is 0.0003 for MNIST permutations, 0.0002 for MNIST mix, and 0.001 for dataset CIFAR-100. The l1_lambda is 0.00001, l2_lambda is 0.0001, gl_lambda is 0.001, regular_lambda is 0.5, loss_thr is 0.01, spl_thr is 0.05 in DEN for MNIST permutations and MNIST mix. As for CIFAR-100, the hyperparameters in DEN is the same except regular_lambda is 5.