Large Scale Incremental Learning
Modern machine learning suffers from catastrophic forgetting when learning new classes incrementally. The performance dramatically degrades due to the missing data of old classes. Incremental learning methods have been proposed to retain the knowledge acquired from the old classes, by using knowledge distilling and keeping a few exemplars from the old classes. However, these methods struggle to scale up to a large number of classes. We believe this is because of the combination of two factors: (a) the data imbalance between the old and new classes, and (b) the increasing number of visually similar classes. Distinguishing between an increasing number of visually similar classes is particularly challenging, when the training data is unbalanced. We propose a simple and effective method to address this data imbalance issue. We found that the last fully connected layer has a strong bias towards the new classes, and this bias can be corrected by a linear model. With two bias parameters, our method performs remarkably well on two large datasets: ImageNet (1000 classes) and MS-Celeb-1M (10000 classes), outperforming the state-of-the-art algorithms by 11.1% and 13.2% respectively.
Natural learning systems are inherently incremental where new knowledge is continuously learned over time while existing knowledge is maintained [19, 13]. Many computer vision applications in the real world require incremental learning capabilities. For example, a face recognition system should be able to add new persons without forgetting the faces already learned. However, most deep learning approaches suffer from catastrophic forgetting  - a significant performance degradation, when the past data are not available.
The missing data for old classes introduce two challenges - (a) maintaining the classification performance on old classes, and (b) balancing between old classes and new classes. Distillation [13, 19, 2] has been used to effectively address the former challenge. Recent studies [19, 2] also show that selecting a few exemplars from the old classes can alleviate the imbalance problem. These methods perform well on small datasets. However, they suffer from a significant performance degradation when the number of classes becomes large (e.g. thousands of classes). Fig. 1 demonstrates the performance degradation of these state-of-the-art algorithms, using a non-incremental classifier as the reference. When the number of classes increases from 100 to 1000, both iCaRL  and EEIL have more degradation.
Why is it more challenging to handle a large number of classes for incremental learning? We believe this is due to the coupling of two factors. First, the training data are unbalanced. Secondly, as the number of classes increases, it is more likely to have visually similar classes (e.g. multiple dog classes in ImageNet) across different incremental steps. Under the incremental constraint with data imbalance, the increasing number of visually similar classes is particularly challenging since the small margin around the boundary between classes is too sensitive to the data imbalance. The boundary is pushed to favor classes with more samples.
In this work, we present a method to address the data imbalance problem in large scale incremental learning. Firstly, we found a strong bias towards the new classes in the classifier layer (i.e. the last fully connected layer) of the convolution neural network (CNN). Based upon this finding, we propose a simple and effective method, called BiC (bias correction), to correct the bias. We add a bias correction layer after the last fully connected (FC) layer (shown in Fig. 2), which is a simple linear model with two parameters. The bias correction layer is learned at the second stage, after learning the convolution layers and FC layer at the first stage. The data, including exemplars from the old classes and samples from the new classes, are split into a training set for the first stage and a validation set for the second stage. The validation set is helpful to approximate the real distribution of both old and new classes in the feature space, allowing us to estimate the bias in FC layer. We found that the bias can be effectively corrected with a small validation set.
Our BiC method achieves remarkably good performance, especially on large scale datasets. The experimental results show that our method outperforms state-of-the-art algorithms (iCaRL and EEIL ) on two large datasets (ImageNet ILSVRC 2012 and MS-Celeb-1M) by a large margin. Our BiC method gains 11.1% on ImageNet and 13.2% on MS-Celeb-1M, respectively.
2 Related Work
Incremental learning has been a long standing problem in machine learning [3, 17, 16, 12]. Before the deep learning took off, people had been developing incremental learning techniques by leveraging linear classifiers, ensemble of weak classifiers, nearest neighbor classifiers, etc. Recently, thanks to the exciting progress in deep learning, there has been a lot of research on incremental learning with deep neural network models. The work can be roughly divided into three categories depending on whether they require real data or synthetic data or nothing from the old classes.
Without using old data: Methods in the first category do not require any old data.  presented a method for domain transfer learning. They try to maintain the performance on old tasks by freezing the final layer and discouraging the change of shared weights in feature extraction layers.  proposed a technique to remember old tasks by constraining the important weights when optimizing a new task. One limitation of this approach is that the old and new tasks may conflict on these important weights.  presented a method that applies knowledge distillation  to maintain the performance on old tasks.  separated the old and new tasks in multi-task learning, which is different from learning classifier incrementally.  applied knowledge distillation for learning object detectors incrementally.  utilized autoencoder to retain the knowledge from old tasks. [25, 26] updated knowledge dictionary for new tasks and kept dictionary coefficients for old tasks.
Using synthetic data: Both  and  employed GAN  to replay synthetic data for old tasks.  applied cross entropy loss on synthesis data with the old solver’s response as the target.  utilized a root mean-squared error for learning the response of old tasks on synthetic data. [22, 27] highly depends on the capability of generative models and struggles with complex objects and scenes.
Using exemplars from old data: Methods in the third category require part of the old data.  proposed a method to select a small number of exemplars from each old class.  keeps classifiers for all incremental steps and used them as distillation. It introduces balanced fine-tuning and temporary distillation to alleviate the imbalance between the old and new classes.  proposed a continuous learning framework where the training samples for different tasks are used one by one during training. It constrains the cross entropy loss on softmax outputs of old tasks when the new task comes.  proposed a training method that grows a network hierarchically as new training data are added. Similarly,  increases the number of layers in the network to handle new coming data.
Our BiC method belongs to the third category, we keep exemplars from the old classes in the similar manner to [19, 2]. However, we handle the data imbalance differently. We first locate a strong bias in the classifier layer (the last fully connected layer), and then apply a linear model to correct the bias using a small validation set. The validation set is a small subset of exemplars which is excluded from training and used for bias correction alone. Compared with the state of the art ([19, 2]), our BiC method is more effective on large datasets with 1000+ classes.
3 Baseline: Incremental Learning using Knowledge Distillation
In this section, we introduce a baseline solution for incremental learning using knowledge distillation . This is corresponding to the first stage in Fig. 2. For an incremental step with old class and new classes, we learn a new model to perform classification on classes, by using the knowledge distillation from an old model that classifies the old classes (illustrated in Fig. 3). The new model is learned by using a distilling loss and a classification loss.
Let us denote the samples of the new classes as , where is the number of new samples, and are the image and the label, respectively. The selected exemplars from the old classes are denoted as , where is the number of selected old images (). Let us also denote the output logits of the old and new classifiers as and respectively. The distilling loss is formulated as follows:
where is the temperature scalar. The distilling loss is computed for all samples from the new classes and exemplars from the old classes (i.e. ).
We use the softmax cross entropy as the classification loss, which is computed as follows:
where is the indicator function and is the output probability (i.e. softmax of logits) of the -th class in old and new classes.
The overall loss combines the distilling loss and the classification loss as follows:
where the scalar is used to balance between the two terms. The scalar is set to , where and are the number of old and new classes. is for the first batch since all classes are new. For the extreme case where , is nearly , indicating the importance to maintain the old classes.
4 Diagnosis: FC Layer is Biased
The baseline model has a bias towards the new classes, due to the imbalance between the number of samples from the new classes and the number of exemplars from the old classes. We have a hypothesis that the last fully connected layer is biased as the weights are not shared across classes. To validate this hypothesis, we design an experiment on CIFAR-100 dataset with five incremental batches (each has 20 classes).
First, we train a set of incremental classifiers using the baseline method. The classification accuracy quickly drops as more incremental steps arrive (shown as the bottom curve in Fig. 4-(a)). For the last incremental step (class 81-100), we observe a strong bias towards the newest 20 classes in the confusion matrix (Fig. 4-(b)). Compared to the upper bound, i.e. the classifiers learned using all training data (the top curve in Fig. 4-(a)), the baseline model has a performance degradation.
Then, we conduct another experiment to evaluate if the fully connected layer is heavily biased. This experiment has two steps for each incremental batch: (a) applying the baseline model to learn both the feature and fully connected layers, (b) freezing the feature layers and retrain the fully connected layer alone using all training samples from both old and new classes. Compared to the baseline, the accuracy improves (the second top curve in Fig. 4-(a)). The accuracy on the final classifier on 100 classes improves by 20%. These results validate our hypothesis that the fully connected layer is heavily biased. We also observe the gap between this result and the upper bound, which reflects the bias within the feature layers. In this paper, we focus on correcting the bias in the fully connected layer.
5 Bias Correction (BiC) Method
Based upon our finding that the fully connected layer is heavily biased, we propose a simple and effective bias correction method (BiC). Our method includes two stages in training (shown in Fig. 2). Firstly, we train the convolution layers and the fully connected layer by following the baseline method. At the second stage, we freeze both the convolution and the fully connected layers, and estimate two bias parameters by using a small validation set. In this section, we discuss how the validation set is generated and the details of the bias correction layer.
5.1 Validation Set
We estimate the bias by using a small validation set. The basic idea is to exclude the validation set from training the feature representation, allowing them to reflect the unbiased distribution of both old and new classes on the feature space (shown in Fig. 5). Therefore, we split the exemplars from the old classes and the samples from the new classes into a training set and a validation set. The training set is used to learn the convolution and fully connected layers (see Fig. 2), while the validation set is used for the bias correction.
Fig. 2 illustrates the generation of the validation set. The stored exemplars from the old classes are split into a training subset (referred to ) and a validation subset (referred to ). The samples for the new classes are also split into a training subset (referred to ) and a validation subset (referred to ). and are used to learn the convolution and FC layers (see Fig. 2). and are used to estimate the parameters in the bias correction layer. Note that and are balanced.
5.2 Bias Correction Layer
The bias correction layer should be simple with a small number of parameters, since and have small size. Thus, we use a linear model (with two parameters) to correct the bias. This is achieved by adding a bias correction layer in the network (shown in Fig. 2). We keep the output logits for the old classes () and apply a linear model to correct the bias on the output logits for the new classes () as follows:
where and are the bias parameters on the new classes and (defined in Section 3) is the output logits for the -th class. Note that the bias parameters (, ) are shared by all new classes, allowing us to estimate them with a small validation set. When optimizing the bias parameters, the convolution and fully connected layers are frozen. The classification loss (softmax with cross entropy) is used to optimize the bias parameters as follows:
We found that this simple linear model is effective to correct the bias introduced in the fully connected layer.
We compare our BiC method to the state-of-the-art methods on two large datasets (ImageNet ILSVRC 2012  and MS-Celeb-1M ), and one small dataset (CIFAR-100 ). We also perform ablation experiments to analyze different components of our approach.
We use all data in CIFAR-100 and ImageNet ILSVRC 2012 (referred to ImageNet-1000), and randomly choose 10000 classes in MS-Celeb-1M (referred to Celeb-10000). We follow iCaRL benchmark protocol  to select exemplars. The total number of exemplars for the old classes are fixed. The details of these three datasets are as follows:
CIFAR-100: contains 60k RGB images of 100 object classes. Each class has 500 training images and 100 testing images. 100 classes are split into 5, 10, 20 and 50 incremental batches. 2,000 samples are stored as exemplars.
ImageNet-1000: includes 1,281,167 images for training and 50,000 images for validation. 1000 classes are split into 10 incremental batches. 20,000 samples are stored as exemplars.
Celeb-10000: a random subset of 10,000 classes are selected from MS-Celeb-1M-base  face dataset which has 20,000 classes. MS-Celeb-1M-base is a smaller yet nearly noise-free version of MS-Celeb-1M , which has near 100,000 classes with a total of 1.2 million aligned face images. For the randomly selected 10,000 classes, there are 293,052 images for training and 141,984 images for validation. 10000 classes are split into 10 incremental batches (1000 classes per batch). 50,000 samples are stored as exemplars.
For our BiC method, the ratio of train/validation split on the exemplars is 9:1 for CIFAR-100 and ImageNet-1000. This ratio is obtained from the ablation study (see Section 6.6). We change the split ratio to 4:1 on Celeb-10000, allowing at least one validation image kept per person.
6.2 Implementation Details
Our implementation uses TensorFlow . We use an 18-layer ResNet  for ImageNet-1000 and Celeb-10000 and use a 32-layer ResNet for CIFAR-100. The ResNet implementation is from TensorFlow official models111https://github.com/tensorflow/models/tree/master/official/resnet. The training details for each dataset are listed as follows:
ImageNet-1000 and Celeb-10000: Each incremental training has 100 epochs. The learning rate is set to 0.1 and reduces to 1/10 of the previous learning rate after 30, 60, 80 and 90 epochs. The weight decay is set to 0.0001 and the batch size is 256. Image pre-processing follows the VGG pre-processing steps , including random cropping, horizontal flip and aspect preserving resizing and mean subtraction.
CIFAR-100: Each incremental training has 250 epochs. The learning rate starts from 0.1 initially and reduces to 0.01, 0.001 and 0.0001 after 100, 150 and 200 epochs, respectively. The weight decay is set to 0.0002 and the batch size is 128. Random cropping and horizontal flip is adapted for data augmentation following the original ResNet implementation .
For a fair comparison with iCaRL  and EEIL , we use the same networks, keep the same number of exemplars and follow the same protocols of splitting classes into incremental batches. We use the identical class order generated from iCaRL implementation\@footnotemark for CIFAR-100 and ImageNet-1000. On Celeb-10000, the class order is randomly generated and identical for all comparisons. The temperature scalar in Eq. 1 is set to 2 by following [13, 2].
6.3 Comparison on Large Datasets
In this section, we compare our BiC method with the state-of-the-art methods on two large datasets (ImageNet-1000 and Celeb-10000). The state-of-the-art methods include LwF , iCaRL and EEIL . All of them utilize knowledge distillation to prevent catastrophic forgetting. iCaRL and EEIL keep exemplars for old classes, while LwF does not use any old data.
The incremental learning results on ImageNet-1000 are shown in Table 1 and Figure 6-(a). Our BiC method outperforms both EEIL  and iCaRL  by a large margin. BiC has a small gain for the first couple of incremental batches compared with iCaRL and is worse than EEIL in the first two increments. However, the gain of BiC increases as more incremental batches arrive. Regarding the final incremental classifier on all classes, our BiC method outperforms EEIL  and iCaRL  by 18.5% and 26.5% respectively. On average over 10 incremental batches, BiC outperforms EEIL  and iCaRL  by 11.1% and 19.7% respectively.
Note that the data imbalance increases as more incremental steps arrive. The reason is that the number of exemplars per old class decreases as the incremental step increases, since the total number of exemplars is fixed (by following the fix memory protocol in EEIL  and iCaRL ). The gap between our BiC method and other methods becomes wider as the incremental step increases with more data imbalance. This demonstrates the advantage of our BiC method.
We also observe that EEIL performs better for the second batch (even higher than the first batch) on ImageNet-1000. This is mostly due to the enhanced data augmentation (EDA) in EEIL that is more effective for the first couple of incremental batches when data imbalance is mild. EDA includes random brightness shift, contrast normalization, random cropping and horizontal flipping. In contrast, BiC only applies random cropping and horizontal flipping. EEIL  shows that EDA is effective for early incremental batches when data imbalance is not severe. Even without the enhanced data augmentation, our BiC still outperforms EEIL by a large margin on ImageNet-1000 starting from the third batch.
The incremental learning results on Celeb-10000 are shown in Table 2 and Figure 6-(b). To the best of our knowledge, we have not seen any incremental learning method reporting results on 10,000 or more classes. The results for iCaRL is generated by applying its github implementation222https://github.com/srebuffi/iCaRL on Celeb-10000 dataset. For the first couple of incremental steps, our BiC method is slightly better than () iCaRL. But since the third incremental step, the gap becomes wider. At the last incremental step, BiC outperforms iCaRL by 22.4%. The average gain over 10 incremental batches is 13.2%.
These results demonstrate our BiC method is more effective and robust to deal with a large number of classes. As the number of classes increases, it is more frequent to have visually similar classes across different increment batches with unbalanced data. This introduces a strong bias towards new classes and misclassifies the old classes that are visually similar. Our BiC method is able to effectively reduce this bias and improve the classification accuracy.
6.4 Comparison between Different Scales
In this section, we compare our BiC method with the state-of-the-art on two different scales on ImageNet. The small scale deals with random selected 100 classes (referred to ImageNet-100), while the large scale involves all 1000 classes (referred to ImageNet-1000). Both scales have 10 incremental batches. This follows the same protocol with EEIL  and iCaRL . The results for ImageNet-1000 is the same as in the previous section.
The incremental learning results on Imagenet-100 and ImageNet-1000 are shown in Fig. 7. Our BiC method outperforms the state-of-the-art for both scales in terms of the final incremental accuracy and the average incremental accuracy. But the gain for the large scale is bigger. We also compare the final incremental accuracy (the last step) to the upper bound, which is obtained by training a non-incremental model using all classes and their training data (shown at the last step in Fig. 7). Compared to the upper bound, our BiC method degrades 10.5% and 16.0% on ImageNet-100 and ImageNet-1000 respectively. However, EEIL  degrades 15.1% and 37.2% and iCaRL  degrades 31.1% and 45.2%. Compared with EEIL  and iCaRL , which have more performance degradation from the small scale to large scale, our BiC method is much more consistent. This demonstrates that BiC has better capability to handle the large scale.
6.5 Comparison on a Small Dataset
We also compare our BiC method with the state-of-the-art algorithms on a small dataset - CIFAR-100 . The incremental learning results with four different splits of 5, 10, 20 and 50 classes are shown in Fig. 8. Our BiC method has similar performance with iCaRL  and EEIL . BiC is better on the split of 50 and 20 classes, but is slightly behind EEIL on the split of 10 and 5 classes. The margins are small for all splits.
Although our method focuses on the large scale incremental learning, it is also compelling on the small scale. Note that EEIL has more data augmentation such as brightness augmentation and contrast normalization, which are not utilized in LwF, iCaRL or BiC.
6.6 Ablation Study
We now analyze the components of our BiC method and demonstrate their impact. The ablation study is performed on CIFAR-100 , as incremental learning on large dataset is time consuming. The ablation study is performed on CIFAR-100 with an incremental of 20 classes. The size of the stored exemplars from old classes is 2,000. In the following ablation study, we analyze (a) the impact of bias correction, (b) the split of validation set, and (c) the sensitivity of exemplar selection.
The Impact of Bias Correction
|Variations||cls loss||distilling loss||bias removal||FC retrain||20||40||60||80||100|
We compare our BiC method with two variations of baselines and the upper bound, to analyze the impact of bias correction. The baselines and the upper bound are explained as follows:
baseline-1: the model is trained using the classification loss alone (Eq. 2).
baseline-2: the model is trained using both the distilling loss and the classification loss (Eq. 3). Compared to the baseline-1, the distilling loss is added.
BiC: the model is trained using both the distilling loss and the classification loss, with the bias correction.
upper bound: the model is firstly trained using both the distilling loss and classification loss. Then, the feature layers are frozen and the classifier layer (i.e. the fully connected layer) is retrained using all training data (including the samples from the old classes that are not stored). Although it is infeasible to have all training samples from the old classes, it shows the upper bound for the bias correction in the fully connected layer.
The incremental learning results are shown in Table 3. With the help of the knowledge distillation, baseline-2 is slightly better than baseline-1 since it retains the classification capability on the old classes. However, both baseline-1 and baseline-2 have low accuracy on the final step to classify all 100 classes (about 40%). This is mainly because of the data imbalance between the old and new classes. When using the bias correction, BiC improves the accuracy on all incremental steps. The classification accuracy on the final step (100 classes) is boosted from 40.34% to 56.69%. This demonstrates that the bias is a big issue and our method is effective to address it. Furthermore, our method is close to the upper bound. The small gap (4.24%) from our approach 56.69% to the upper bound 60.93% shows the superiority of our method.
The confusion matrices of these four variations are shown in Fig. 9. Clearly, baseline-1 and baseline-2 suffer from the bias towards the new classes (strong confusions on the last 20 classes). BiC reduces the bias and has similar confusion matrix to the upper bound.
These results validate our hypothesis that there exists a strong bias towards the new classes in the last fully connected layer. In addition, the results demonstrate that the proposed bias correction using a linear model on a small validation set is capable to correct the bias.
The Split of Validation Set We study the impact of different splits of the validation set (see Section 5.1). As illustrated in Fig. 2, our BiC splits the stored exemplars from the old classes into a training set () and a validation set (). The samples from the new classes also have a train/val split ( and ). and are used to learn the convolution layers and the fully connected layer, while and are used to learn the bias correction layer. Note that and are balanced, having the same number of samples per class. Since only a few exemplars (i.e. ) are stored for the old classes, it is critical to find a good split that deals with the trade-off between training the feature representation and correcting the bias in the fully connected layer.
Table 4 shows the incremental learning results for four different splits of . The split of 9:1 has the best classification accuracy for all four incremental steps. The column 20 refers to learning a classifier for the first 20 classes, without incremental learning. As the portion for the validation set increases, the performance drops consistently due to the lack of exemplars (from the old classes) to train the feature layers. A small validation set ( of exemplars) is good enough to estimate the bias parameters ( and in Eq. 4). In this paper, we use split 9:1 for all other experiments except Celeb-10000. The split 4:1 is adopted in Celeb-10000, as each old class only has 5 exemplars for the last incremental step.
The Sensitivity of Exemplar Selection We also study the impact of different exemplar management strategies. We compare two strategies: (a) random selection, and (b) the exemplar management strategy proposed by iCaRL . iCaRL maintains the samples that closed to the class center in the feature space. Both strategies store 2,000 exemplars from old classes. The incremental learning results are shown in Table 5. iCaRL exemplar management strategy performs slightly better than the random selection. The gap is about 1%. This demonstrates that our method is not sensitive to the exemplar selection.
In this paper, we proposed a new method to address the imbalance issue in incremental learning, which is critical when the number of classes becomes large. Firstly, we validated our hypothesis that the classifier layer (the last fully connected layer) has a strong bias towards the new classes, which has substantially more training data than the old classes. Secondly, we found that this bias can be effectively corrected by applying a linear model with a small validation set. Our method has excellent results on two large datasets with 1,000+ classes (ImageNet ILSVRC 2012 and MS-Celeb-1M), outperforming the state-of-the-art by a large margin (11.1% on ImageNet ILSVRC 2012 and 13.2% on MS-Celeb-1M).
Part of the work was done when Yue Wu was an intern at Microsoft. This research is supported in part by the NSF IIS Award 1651902.
-  Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
-  Francisco M. Castro, Manuel J. Marin-Jimenez, Nicolas Guil, Cordelia Schmid, and Karteek Alahari. End-to-end incremental learning. In The European Conference on Computer Vision (ECCV), September 2018.
-  Gert Cauwenberghs and Tomaso Poggio. Incremental and decremental support vector machine learning. In Advances in neural information processing systems, pages 409–415, 2001.
-  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
-  Yandong Guo and Lei Zhang. One-shot face recognition by promoting underrepresented classes. 2017.
-  Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. MS-Celeb-1M: A dataset and benchmark for large scale face recognition. In ECCV, 2016.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, 2015.
-  Heechul Jung, Jeongwoo Ju, Minju Jung, and Junmo Kim. Less-forgetting learning in deep neural networks. arXiv preprint arXiv:1607.00122, 2016.
-  James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017.
-  Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
-  Ilja Kuzborskij, Francesco Orabona, and Barbara Caputo. From n to n+ 1: Multiclass transfer incremental learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3358–3365, 2013.
-  Zhizhong Li and Derek Hoiem. Learning without forgetting. In European Conference on Computer Vision, pages 614–629. Springer, 2016.
-  David Lopez-Paz et al. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, pages 6470–6479, 2017.
-  Michael McCloskey and Neal J.Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. Psychology of Learning and Motivation, 24:109–165, 1989.
-  Thomas Mensink, Jakob Verbeek, Florent Perronnin, and Gabriela Csurka. Distance-based image classification: Generalizing to new classes at near-zero cost. IEEE transactions on pattern analysis and machine intelligence, 35(11):2624–2637, 2013.
-  Robi Polikar, Lalita Upda, Satish S Upda, and Vasant Honavar. Learn++: An incremental learning algorithm for supervised neural networks. IEEE transactions on systems, man, and cybernetics, part C (applications and reviews), 31(4):497–508, 2001.
-  Amal Rannen Ep Triki, Rahaf Aljundi, Matthew Blaschko, and Tinne Tuytelaars. Encoder based lifelong learning. In Proceedings ICCV 2017, pages 1320–1328, 2017.
-  Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H. Lampert. icarl: Incremental classifier and representation learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
-  Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
-  Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
-  Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative replay. In Advances in Neural Information Processing Systems, pages 2994–3003, 2017.
-  Konstantin Shmelkov, Cordelia Schmid, and Karteek Alahari. Incremental learning of object detectors without catastrophic forgetting. In Proceedings of the International Conference on Computer Vision, 2017.
-  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  Gan Sun, Yang Cong, Ji Liu, Lianqing Liu, Xiaowei Xu, and Haibin Yu. Lifelong metric learning. IEEE transactions on cybernetics, (99):1–12, 2018.
-  Gan Sun, Yang Cong, and Xiaowei Xu. Active lifelong learning with” watchdog”. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
-  Ragav Venkatesan, Hemanth Venkateswara, Sethuraman Panchanathan, and Baoxin Li. A strategy for an uncompromising incremental learner. arXiv preprint arXiv:1705.00744, 2017.
-  Tianjun Xiao, Jiaxing Zhang, Kuiyuan Yang, Yuxin Peng, and Zheng Zhang. Error-driven incremental learning in deep convolutional neural network for large-scale image classification. In Proceedings of the 22nd ACM international conference on Multimedia, pages 177–186. ACM, 2014.