Less-forgetful Learning for Domain Expansion in Deep Neural Networks
Expanding the domain that deep neural network has already learned without accessing old domain data is a challenging task because deep neural networks forget previously learned information when learning new data from a new domain. In this paper, we propose a less-forgetful learning method for the domain expansion scenario. While existing domain adaptation techniques solely focused on adapting to new domains, the proposed technique focuses on working well with both old and new domains without needing to know whether the input is from the old or new domain. First, we present two naive approaches which will be problematic, then we provide a new method using two proposed properties for less-forgetful learning. Finally, we prove the effectiveness of our method through experiments on image classification tasks. All datasets used in the paper, will be released on our website for someoneâs follow-up study.
Deep neural networks (DNNs) have advanced to nearly human levels of object, face, and speech recognition [\citeauthoryearTaigman et al.2014] [\citeauthoryearGraves, Mohamed, and Hinton2013] [\citeauthoryearSzegedy et al.2014] [\citeauthoryearSimonyan and Zisserman2014] [\citeauthoryearZhang et al.2016] [\citeauthoryearRichardson, Reynolds, and Dehak2015]. Despite these advances, issues still remain. Domain adaptation (the same tasks but in different domains) is one of these remaining issues [\citeauthoryearGanin and Lempitsky2014] [\citeauthoryearGanin et al.2016] [\citeauthoryearLong and Wang2015]. The domain adaptation problem concerns how well a DNN works in a new domain that has not been learned. In other words, these domain adaptation techniques focus on adapting only to new domains, but in an actual situation, applications often need to remember old domains as well without seeing the old domain data again. We call this the DNN domain expansion problem. Its concept is illustrated in Figure 1.
For example, suppose you have an object recognition system mounted on a robot or a smartphone that has been trained with ImageNet dataset [\citeauthoryearRussakovsky et al.2015]. The real-world environment is so diverse (e.g., with various lighting changes) that the system will sometimes fail. Learning the failed data collected from the real-world environment might prevent the repetition of the failure when the DNN encounters the same situation. Unfortunately, the DNN forgets the information previously learned from ImageNet dataset while learning the failed data collected from the real-world environment. In other words, the object recognition system gradually loses its original ability, and hence, requires the domain expansion functionality to preserve its ability for the ImageNet domain and adapt to the new domain that was not covered by the ImageNet dataset.
The DNN domain expansion problem is specifically important for the following three main reasons:
It enables the DNNs to continually learn from sequentially incoming data.
In practice, users can fine-tune their DNNs using only new data collected from new environments without access to data from the old domain.
Making a single unified network that performs in several domains is possible.
In this paper, we propose a method to enable DNNs to achieve domain expansion functionality by alleviating the forgetting problem.
Domain Expansion Problem
We define the domain expansion problem as the problem of creating a network that works well both on an old domain and a new domain even after it is trained in a supervised way using only the data from the new domain without accessing the data from the old domain. Two challenging issues need to be faced in solving the domain expansion problem. First, the performance of the network on the old domain should not be degraded even if the new domain data are learned without seeing those of the old domain (A general term is the catastrophic forgetting problem). Second, a DNN should work well without any prior knowledge of which domain the input data had come from. Figures 2 (a) and (b) show the existing techniques that preserve the ability for old domain, but require prior knowledge about the data domain. Figure 2 (c) shows our proposed method that preserves the old domain and does not require prior knowledge about the input data. Therefore, we focus on developing a new method to alleviate the catastrophic forgetting problem without any prior knowledge (e.g. old or new domain) about the input data.
Actually, the domain expansion problem is a part of the continual learning problem. The continual learning generally considers multiple task learning or sequence learning (more than two domains), whereas the domain expansion problem only considers two domains such as old and new.
In this section we will list the state-of-the-art techniques for solving the catastrophic forgetting problem. Srivastava et al. proposed a local winner-take-all (LWTA) activation function that helps to prevent the forgetting problem [\citeauthoryearSrivastava et al.2013]. This activation function is effective because it implements implicit long-term memory. Subsequently, several experiments on the forgetting problem in the DNNs were empirically performed in [\citeauthoryearGoodfellow et al.2013a]. The results showed that a dropout method [\citeauthoryearHinton et al.2012] [\citeauthoryearSrivastava et al.2014] with a maxout [\citeauthoryearGoodfellow et al.2013b] activation function was helpful in forgetting less of the learned information. In addition, [\citeauthoryearGoodfellow et al.2013a] stated that a large DNN with a dropout method can address the catastrophic forgetting problem.
An unsupervised approach was also proposed in [\citeauthoryearGoodrich and Arel2014]. Goodrich et al. extended this method to a recurrent neural network [\citeauthoryearGoodrich and Arel2015]. These methods used an online clustering method that can help mitigate forgetting in a data-driven manner. These methods computed cluster centroids while learning the training data in the old domain and using the computed centroids for the new domain.
The learning without forgetting (LwF) method [\citeauthoryearLi and Hoiem2016] was also proposed to improve the DNN performance in a new task (Figure 2 (a)). This method utilizes the knowledge distillation loss method to maintain the performance on the old data. Google DeepMind [\citeauthoryearRusu et al.2016] proposed a unified DNN based on progressive learning (PL) (Figure 2 (b)). The PL method enables one network to operate several tasks. (The applications in [\citeauthoryearRusu et al.2016] were Atari and three-dimensional maze games.) The idea is to use previously learned features when performing a new task via lateral connections. As mentioned in Section Domain Expansion Problem, these methods are difficult to directly apply to the domain expansion problem without any modification because they need to know information about the input data domain.
Elastic weight consolidation (EWC) is one of the methods used to solve the catastrophic forgetting problem [\citeauthoryearKirkpatrick et al.2017]. This technique uses a Fisher information matrix computed from the old domain training data, and uses its diagonal elements as coefficients of regularization to obtain similar weight parameters between the old and new networks when learning the new domain data. Furthermore, generative adversarial networks are also used for generating old domain data while learning new domain data [\citeauthoryearShin et al.2017].
|Type||Type A||Type B||Type C|
|Type A||-||-||EwC, ReplayGAN|
|Type B||LwF||PL||Proposed Method|
State-of-the-art algorithms can be classified into two types, as shown in Figure 3. The algorithms shown in Figure 3 (a) go through an ad-hoc training process to extract useful information from the old domain data. The information extracted from the old domain data will be used to alleviate catastrophic forgetting problem when the network learns new domain data. Figure 3 (b) shows the proposed method; our method uses the usual way to train the network using old domain data. This gives a benefit that our method can be directly applied to any pre-trained models that can be downloaded from the Internet, without access to the old domain training data. Table 1 summarizes state-of-the-art algorithms for each type shown in Figures 2 and 3.
Reformulation of Forgetting Problem
We denote the dataset for the old domain as and the dataset for the new domain as , where and are the number of data points of the old and new domains, respectively. Furthermore, is the training data, and is the corresponding label. These two datasets are mutually exclusive. Each dataset has both the following training and validation datasets: , and where and are the training and validation datasets, respectively.
The old network for the old domain is trained using , where is a weight parameter set for the old domain. The initial values of the weights are randomly initialized using normal distribution . The trained weight parameters for the old domain are obtained using dataset . The new network for the expanded domain, which is union of the old domain and the new domain, is trained using dataset without access to the old domain training data . Finally, we obtain the updated weight parameters to satisfy the less-forgetful condition, for from . Our goal is to develop a method to satisfy the condition.
Fine-tuning only the softmax classifier layer
The most common method to use, such that the DNN does not forget what it has learned, is to freeze lower layers and fine-tune the final softmax classifier layer. This method regards the lower layer as a feature extractor and updates the linear classifier to adapt to new domain data. In other words, the feature extractor is shared between the old and new domains, and the method seems to preserve the old domain information.
Weight constraint approach
The weight constraint method is a method that uses regularization to obtain similar weight parameters between the old and new networks when learning the new data as follows:
where and control the weight of each term, and comes from . The cross-entropy loss is defined as follows:
where is the -th value of the ground truth label; is the -th output value of the softmax of the network; and is the total number of classes. The parameter is initialized to . We then compute the new weight parameter by minimizing the loss function . This method was designed with the expectation that the learned information will be preserved if the weight parameter does not change much.
In general, the lower layer in DNNs is considered as a feature extractor, while the top layer is regarded as a linear classifier, which means that the weights of the softmax classifier represent a decision boundary for classifying the features.
The features extracted from the top hidden layer are usually linearly separable because of the linear nature of the top layer classifier.
Using this knowledge, we propose a new learning scheme that satisfies the following two properties to reduce the tendency of the DNN to forget information learned from the old domain:
Property 1. The decision boundaries should be unchanged.
Property 2. The features extracted by the new network from the data of the old domain should be present in a position close to the features extracted by the old network from the data of the old domain.
We build the less-forgetful learning algorithm based on these two properties. The first property is easily implemented by setting the learning rates of the boundary to zero. However, satisfying the second property is not trivial because we cannot access the old domain data. Therefore, instead of using the old domain data, we use the training data of the new domain and show that it is also helpful in satisfying Property 2.
Figure 4 briefly shows our algorithm. The details of which are as follows: as in the traditional fine-tuning method, we initially reuse the weights of the old network, which was trained using the training data of the old domain, as the initial weights of the new network. Next, we freeze the weights of the softmax classifier layer to preserve the boundaries of the classifier, then we train the network to minimize the total loss function as follows:
where , , and are the total, cross-entropy, and Euclidean loss functions, respectively; and are the tuning parameters for adjusting the scale between the two loss values; and comes from . Parameter usually has a smaller value than . is set to one for all the experiments in this paper.
The cross-entropy loss function defined in Eq. (2) helps the network to correctly classify input data . is defined as follows to satisfy the proposed second property:
where is the total number of layers, and is a feature vector of layer , which is just before the softmax classifier layer. The new network learns to extract features similar to the features extracted by the old network using the loss function. We obtain the following equation:
where denotes a general regularization term, such as weight decay.
Finally, we build the less-forgetful learning algorithm, as shown in Algorithm 1. Parameters and in the algorithm denote the number of iterations and the size of mini-batches, respectively.
Details of Datasets
We conducted two different experiments for image classification: one using datasets consisting of tiny images (CIFAR-10 [\citeauthoryearKrizhevsky and Hinton2009], MNIST [\citeauthoryearLeCun et al.1998], SVHN [\citeauthoryearNetzer et al.2011]) and one using a dataset made up of large images (ImageNet [\citeauthoryearRussakovsky et al.2015]). Figure 5 shows example images from the datasets that we used in the experiments. Table 2 presents the number of images for each dataset. The original training and test data for the SVHN dataset were 73,257 and 26,032, respectively. However, we randomly selected some images in the dataset to match the number of images with those of the MNIST dataset.
|Old domain||MNIST||CIFAR-10 Color||ImageNet Normal|
|New domain||SVHN||CIFAR-10 Gray||ImageNet Dark & Bright|
Details of Comparison Methods
Next, we compare the classification performance of the proposed algorithm with that of the state-of-the-art methods. First, we test two naive approaches, weight constraint and fine-tuning, on the softmax classifier layer (Fine-tuning (Linear)), and we use this as the baseline. Fine-tuning with various activation functions such as ReLU, Maxout, [\citeauthoryearGoodfellow et al.2013a] and LWTA [\citeauthoryearSrivastava et al.2013] are also used for performance comparison. Further, we show classification rates of recent works such as LwF [\citeauthoryearLi and Hoiem2016] and EWC [\citeauthoryearKirkpatrick et al.2017].
We used the Caffe framework for implementing our algorithm and baseline methods [\citeauthoryearJia et al.2014]. Architectures for the tiny image classification experiment are shown in Table 3. Three consecutive convolutional layers and a fully connected layer were used with ReLU or Maxout or LWTA, and the last softmax classifier layer comprised of nodes. We used GoogleNet [\citeauthoryearSzegedy et al.2014] as the ImageNet dataset, and the number of nodes of the softmax classifier layer was set to . Parameters for the solvers are listed as in Table 4. All the experiments such as fine-tuning, weight constraint, modified LwF, and LF were implemented using the same parameters and architectures.
|Dataset||MNIST SVHN||CIFAR-10 COLOR GRAY|
|Layers||INPUT (28283)||INPUT (32323)|
|ReLU or Maxout or LWTA||ReLU or Maxout or LWTA|
|MAXPOOL (33,2)||MAXPOOL (33,2)|
|ReLU or Maxout or LWTA||ReLU or Maxout or LWTA|
|MAXPOOL (33,2)||MAXPOOL (33,2)|
|ReLU or Maxout or LWTA||ReLU or Maxout or LWTA|
|MAXPOOL (33,2)||MAXPOOL (33,2)|
|FC (200)||FC (200)|
|ReLU or Maxout or LWTA||ReLU or Maxout or LWTA|
|FC (10)||FC (10)|
|learning rate (lr)||0.01||0.0001||0.01||0.001|
Tiny image classification (MNIST, SVHN, and CIFAR-10)
We built two experimental scenarios to evaluate our method using the tiny image datasets. The first scenario was the domain expansion from the MNIST to the SVHN (MNIST SVHN), while the second one was the domain expansion from the color to grayscale images using the CIFAR-10 dataset (CIFAR Color CIFAR Gray). We also compared the proposed method with various existing methods, such as traditional fine-tuning, fine-tuning only the softmax classifier layer (Linear), weight constraint method, and modified LwF, to demonstrate the superiority of our method. Please see the supplementary material for details on the modified LwF method.
|Methods||Old (%)||New (%)||Avg. (%)|
|Old network (ReLU)||99.32||31.04||65.14|
|Old network (Maxout)||99.50||29.07||64.29|
|Old network (LWTA)||99.50||27.50||63.50|
|Modified LwF (||94.78||83.77||89.28|
|Old network (ReLU)||77.84||64.09||70.96|
|Old network (Maxout)||78.64||64.90||71.77|
|Old network (LWTA)||76.04||65.72||70.88|
|Gray||Modified LwF ()||75.87||72.79||74.33|
Table 5 shows the classification rates obtained by the test sets of each data set. The “old network” method in Table 5 indicates the training using only the training data of the old domain. The rest of the table shows the results of further training using each method with the training data of the new domain. In addition, the columns “old” and “new” in Table 5 represent the classification rates for each domain, while “avg.” represents the average of the two classification rates. and in Table 5 are hyper parameters for the modified LwF and EWC. is explained in supplementary material, and denotes in the original EWC paper [\citeauthoryearKirkpatrick et al.2017].
Our method outperformed state-of-the art methods, such as the modified LwF and EWC. The method that only fine-tuned the linear classifier failed to adapt to the new domain because of only a few learnable parameters available to learn the new domain. Meanwhile, the weight constraint method forgot the old domain information much more than our method.
We present the classification rate curves of each domain and the average classification rate for various , where , in Figures 6 and 7, respectively, to examine the results more closely. Figure 8 shows the experimental result for the case where some parts of the data from the old domain can be accessed. This figure illustrates that our method was significantly more effective than the traditional fine-tuning method when the old-domain data were partially accessible.
Realistic dataset (ImageNet)
The second experiment was an experiment using an ImageNet 2012 dataset. This dataset was more realistic because the resolution of the training images was much higher than that in the other datasets, such as CIFAR-10, MNIST, and SVHN. The dataset also contained realistic scenarios, such as lighting changes and background clutter. We used a subset of the dataset and randomly chose 50 classes from the original 1000 classes to save training time. We also used image brightness to divide the images into old and new domains. The normal brightness images were put in the old domain, while relatively bright or dark images were put in the new domain.
|Methods||Old (%)||New (%)||Avg. (%)|
|Old network (ReLU)||85.53||76.44||80.99|
|Dark &||Modified LwF()||80.46||85.54||83.00|
Table 6 shows the experimental results for the ImageNet dataset. The experimental results in the previous section clearly showed that the traditional fine-tuning technique has forgotten much about the old domain. Furthermore, it showed that the modified LwF can also mitigate the forgetting problem, and our method remembered more information from the old domain than the modified LwF. On average, our method improved the recognition rate by about 1.8% compared to the existing fine-tuning method.
Are Maxout and LWTA activation functions helpful for mitigating the catastrophic forgetting?
From the experimental result shown in Table 5, we conclude that the effect is not significant. Maxout showed the best performance, and LWTA showed a performance similar to that of ReLU. This might be caused by an increase of learnable parameters because Maxout uses additional parameters for learning piecewise activation functions. As a result, Maxout shows relatively low accuracy compared to state-of-the-art techniques, such as EWC, modified LwF, and our proposed method. This implies that simply changing activation functions is not very helpful in mitigating the catastrophic forgetting problem.
Limitation of the EWC
Our experimental results showed the limitations of the EWC method. The problem emerges when some diagonal elements of the Fisher information matrix are very close to zero. In this case, even if the value of is maximized, a forgetting problem will occur as loss does not work because of the extremely small values of the Fisher information matrix.
There is another problem arising from the fact that the Fisher information matrix is computed using the training data of the old domain. The Fisher information matrix is a key parameter to alleviate the catastrophic forgetting problem in the EWC method, and the matrix may be inaccurate to the test data of the old domain. Therefore it may fail on the test data of the old domain, and this makes the new network forgets a lot.
Effectiveness of the LF
Figures 9 (a) and (b) show the feature spaces after the traditional fine-tuning method and our proposed method are executed, respectively. In the proposed method, high level features of old domain data, which are extracted by each network (old and new), are well clustered, even if re-training only using the new data is finished. Moreover, old domain features extracted from each network are well mixed, and they are not distinguishable from each other in the proposed method. This is probably due to the loss, and it might prevent significant changes in the feature space.
Further Analysis of Scratch learning, Fine-tunig and LF learning
Additional experiment such as learning from scratch on new domain was conducted for further analysis. First, we initialize neural networks from random weights and train them using only the data from the new domain, and we report a comparison of three different methods.
In the case of MNIST SVHN shown in Figure 10 (a), a new network trained from scratch achieves the best performance in the new domain (indicated by orange color). On the other hand, the performance of the old domain is not good. This phenomenon is natural because the network did not see any old domain data. Furthermore, we observed that there is no improvement of the fine-tuning method for the new domain because the amount of data in both MNIST and SVHN is large enough to learn the new domain. The positive effect of fine tuning may occur when the number of new domain data is small as in the CIFAR Color Gray and ImageNet normal Dark & Bright experiments. One interesting point in this experiment is that the average performance of “Scratch” (trained only using SVHN) for both domains is better than that of the “Old network” (trained using MNIST). From this observation, we infer that a network trained with more complex data will have better generalization performance on other domains.
In the CIFAR Color Gray experiment, the number of training images for each domain is different. The number of training images in the new domain is 10,000, and the number of training images in the old domain is 50,000. Training images of the new domain are a disjoint set of training images of the old domain converted into grayscale images. Interestingly, the network trained only using training images of the new domain does not show a performance gap between old and new domains, as shown in Figure 10 (b). This means that weights computed from grayscale images are also useful for distinguishing color images. We also observe that the performance of the scratch learning on the new domain is significantly lower than that of the conventional fine-tuning method because the number of training images in the new domain is small.
In the case of ImageNet Normal Dark & Bright experiment, the number of training images in the new domain is much smaller than that in the old domain (e.g. 52,503 vs 5,978). Similar to the CIFAR experiment, the fine-tuning method outperforms learning from the scratch on the new domain, as shown in Figure 10 (c). Moreover, unlike the CIFAR experiment, the classification rate on the old domain of the scratch learning is the lowest among three different methods. In this case, we think that an overfitting problem occurred because there are few training images in the new domain.
Feasibility for Continual Learning
To show the feasibility of our algorithm for a continual learning problem, we conducted further experiments using the CIFAR-10 dataset. Our experimental protocol is as follows. The CIFAR-10 dataset is manually separated into ten disjoint sets, and each group is input sequentially to the network. We assumed that previous groups are not accessible. Each group is trained during iteration , and a total of was used for both fine-tuning and LF learning. For the offline learning, we used iterations, and this method used whole training data sets. From the results of Table 7, we conclude that fine-tuning is not effective in the continual learning case, but our proposed LF method shows good results. As verified in the previous section, our method remembers the information of old data sets, and hence can achieve better results. From the result, we think that our LF method might be applied to the continual learning problem.
In this paper, we introduced a domain expansion problem and proposed a new method, called the less-forgetful learning, to solve the problem. Our method was effective in preserving the information of the old domain while adapting to the new domain. Our method also outperformed other existing techniques such as fine-tuning with different activation functions, the modified LwF method, and the EWC method. In the experiments, our learning method was applied to the image classification tasks, but it is flexible enough to be applied to other tasks, such as speech and text recognition.
- [\citeauthoryearGanin and Lempitsky2014] Ganin, Y., and Lempitsky, V. 2014. Unsupervised domain adaptation by backpropagation. arXiv preprint arXiv:1409.7495.
- [\citeauthoryearGanin et al.2016] Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; and Lempitsky, V. 2016. Domain-adversarial training of neural networks. Journal of Machine Learning Research 17(59):1–35.
- [\citeauthoryearGoodfellow et al.2013a] Goodfellow, I. J.; Mirza, M.; Xiao, D.; Courville, A.; and Bengio, Y. 2013a. An empirical investigation of catastrophic forgeting in gradient-based neural networks. arXiv preprint arXiv:1312.6211.
- [\citeauthoryearGoodfellow et al.2013b] Goodfellow, I. J.; Warde-Farley, D.; Mirza, M.; Courville, A.; and Bengio, Y. 2013b. Maxout networks. In International Conference on Machine Learning (ICML).
- [\citeauthoryearGoodrich and Arel2014] Goodrich, B., and Arel, I. 2014. Unsupervised neuron selection for mitigating catastrophic forgetting in neural networks. In Circuits and Systems (MWSCAS), 2014 IEEE 57th International Midwest Symposium on, 997–1000. IEEE.
- [\citeauthoryearGoodrich and Arel2015] Goodrich, and Arel, I. 2015. Mitigating catastrophic forgetting in temporal difference learning with function approximation.
- [\citeauthoryearGraves, Mohamed, and Hinton2013] Graves, A.; Mohamed, A.-r.; and Hinton, G. 2013. Speech recognition with deep recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, 6645–6649. IEEE.
- [\citeauthoryearHinton et al.2012] Hinton, G. E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. R. 2012. Improving neural networks by preventing co-adaptation of feature detectors. Technical Report arXiv:1207.0580.
- [\citeauthoryearJia et al.2014] Jia, Y.; Shelhamer, E.; Donahue, J.; Karayev, S.; Long, J.; Girshick, R.; Guadarrama, S.; and Darrell, T. 2014. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093.
- [\citeauthoryearKirkpatrick et al.2017] Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A. A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences 201611835.
- [\citeauthoryearKrizhevsky and Hinton2009] Krizhevsky, A., and Hinton, G. 2009. Learning multiple layers of features from tiny images.
- [\citeauthoryearLeCun et al.1998] LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324.
- [\citeauthoryearLi and Hoiem2016] Li, Z., and Hoiem, D. 2016. Learning without forgetting. In European Conference on Computer Vision, 614–629. Springer.
- [\citeauthoryearLong and Wang2015] Long, M., and Wang, J. 2015. Learning transferable features with deep adaptation networks. CoRR, abs/1502.02791 1:2.
- [\citeauthoryearNetzer et al.2011] Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; and Ng, A. Y. 2011. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, 4. Granada, Spain.
- [\citeauthoryearRichardson, Reynolds, and Dehak2015] Richardson, F.; Reynolds, D.; and Dehak, N. 2015. Deep neural network approaches to speaker and language recognition. IEEE Signal Processing Letters 22(10):1671–1675.
- [\citeauthoryearRussakovsky et al.2015] Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; Berg, A. C.; and Fei-Fei, L. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 1–42.
- [\citeauthoryearRusu et al.2016] Rusu, A. A.; Rabinowitz, N. C.; Desjardins, G.; Soyer, H.; Kirkpatrick, J.; Kavukcuoglu, K.; Pascanu, R.; and Hadsell, R. 2016. Progressive neural networks. arXiv preprint arXiv:1606.04671.
- [\citeauthoryearShin et al.2017] Shin, H.; Lee, J. K.; Kim, J.; and Kim, J. 2017. Continual learning with deep generative replay. arXiv preprint arXiv:1705.08690.
- [\citeauthoryearSimonyan and Zisserman2014] Simonyan, K., and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
- [\citeauthoryearSrivastava et al.2013] Srivastava, R. K.; Masci, J.; Kazerounian, S.; Gomez, F.; and Schmidhuber, J. 2013. Compete to compute. In Advances in Neural Information Processing Systems (NIPS), 2310–2318.
- [\citeauthoryearSrivastava et al.2014] Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15(1):1929–1958.
- [\citeauthoryearSzegedy et al.2014] Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; and Rabinovich, A. 2014. Going deeper with convolutions. arXiv preprint arXiv:1409.4842.
- [\citeauthoryearTaigman et al.2014] Taigman, Y.; Yang, M.; Ranzato, M.; and Wolf, L. 2014. Deepface: Closing the gap to human-level performance in face verification. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, 1701–1708. IEEE.
- [\citeauthoryearVan der Maaten and Hinton2008] Van der Maaten, L., and Hinton, G. 2008. Visualizing data using t-sne. Journal of Machine Learning Research 9(2579-2605):85.
- [\citeauthoryearZhang et al.2016] Zhang, K.; Zhang, Z.; Li, Z.; and Qiao, Y. 2016. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters 23(10):1499–1503.