Positive-Unlabeled Compression on the Cloud
Many attempts have been done to extend the great success of convolutional neural networks (CNNs) achieved on high-end GPU servers to portable devices such as smart phones. Providing compression and acceleration service of deep learning models on the cloud is therefore of significance and is attractive for end users. However, existing network compression and acceleration approaches usually fine-tuning the svelte model by requesting the entire original training data (e.g. ImageNet), which could be more cumbersome than the network itself and cannot be easily uploaded to the cloud. In this paper, we present a novel positive-unlabeled (PU) setting for addressing this problem. In practice, only a small portion of the original training set is required as positive examples and more useful training examples can be obtained from the massive unlabeled data on the cloud through a PU classifier with an attention based multi-scale feature extractor. We further introduce a robust knowledge distillation (RKD) scheme to deal with the class imbalance problem of these newly augmented training examples. The superiority of the proposed method is verified through experiments conducted on the benchmark models and datasets. We can use only of uniformly selected data from the ImageNet to obtain an efficient model with comparable performance to the baseline ResNet-34.
Convolutional neural networks (CNNs) have been widely used in a variety of computer vision applications such as image classification Krizhevsky et al. ; Oguntola et al. (2018); Sánchez et al. (2013), object detection Girshick (2015), semantic segmentation Noh et al. (2015), clustering Zhou et al. (2018), multi-label learning Shen et al. (2017), etcCNNs are often over-parameterized to achieve a good recognition performance. However, many empirical studies suggest that those redundant parameters or filters can be eliminated without affecting the performance of the network. To be compatible with various running environments (e.g. cell phone and autonomous driving) in real-world applications, well trained neural networks need to be further compressed and accelerated accordingly. Considering the scalable computation resource (e.g. GPU and RAM) offered by the cloud, it is therefore promising to provide network compression service for end users.
Compared with the model compression service offered by the cloud, it would be much harder for end users to compress the cumbersome network by themselves. One one hand, GPUs are essential to doing effective deep learning. Compared with setting up their own servers, many users tend to spin up cloud instances with GPUs by balancing the flexibility and the investment, especially when the GPUs are only needed for several hours. One the other hand, not every user is a deep learning expert, and a cloud service would be expected to produce efficient deep neural networks according to users’ needs.
Existing methods like quantization approach Gong et al. (2014), pruning approach Denton et al. (2014) and knowledge distillation approach Hinton et al. (2015) cannot be easily deployed on the cloud to compress the cumbersome network submitted by end customers. The major reason is that most of these methods require users to provide the original training data for fine-tuning the compressed network to avoid much drop of the accuracy. However, compared with the model size of modern CNNs, the size of the entire training data would be much larger. For example, ResNet-50 He et al. (2016) only occupies an about 95MB for storing its parameters while its training dataset (i.e. ImageNet Krizhevsky et al. ) contains more than one million images with an over 120GB file size. Therefore, given the limitation of transmission speed (e.g. 10MB/s), users have to wait for a long period of time before launching the compression methods, which does harm to user experience of the service.
In this paper, we suggest a two-stage pipeline to leverage the easily accessible unlabeled data for training compact neural networks, as shown in Fig. 1. Users are required to upload the pre-trained deep network and a small portion (e.g. ) of the original training data. Taking the scarce labeled data as ‘positive’, in the unlabeled pool (e.g. Flickr Jegou et al. (2008)) there could be ‘positive’ data that follows a similar distribution (e.g. of the same concept), while the remaining data are treated as ‘negative’. A binary PU classifier learned from these positive-unlabeled data can then be employed to identify the most related unlabeled data to augment the training set for our compression task. In order to correct the biased labels contained in the augmented dataset, we further develop a robust knowledge distiller (RKD) to address the problem of noisy and imbalanced labels. Experimental results conducted on several benchmark datasets and deep models demonstrate that with the help of massive unlabeled data, the proposed method is effective for learning efficient networks with only a small proportion of the original training data.
2 Positive-Unlabeled Classifier for More Data
Here we first present some preliminaries for learning efficient neural networks, and then develop a novel framework to effectively utilize massive unlabeled data on the cloud for training.
2.1 Knowledge for Compressing Neural Networks
Conventional deep model compression algorithms aim to eliminate redundant weights or filters in pre-trained deep neural networks. The resulting networks are often of specific structures such as sparse matrices and mix-bit multiplications which need additional technical supports. In contrast, knowledge distillation (KD) method Hinton et al. (2015) is proposed to directly learn student networks with fewer parameters and computational complexities by inheriting feature information from the given teacher network.
Denoting the pre-trained teacher network and the desired efficient student networks as and , respectively, the student network is trained using the following objective function:
where is the cross-entropy loss, is the number of samples in the training set, and are the output responses corresponding to the teacher network and student network , of the same input data , respectively. By using the KD method described in Eq. 1, the student network is able to generalize in the same way as the teacher network, and can empirically obtain a much better result than training it from-scratch.
However, the number of samples in the training dataset of is often extremely large, e.g. there are over 1.2 million images in the ILSVRC 2012 dataset with file size of 120GB. Differently, modern CNN architectures are more and more lightweighted, e.g. the model size of MobileNet-v2 Sandler et al. (2018) is only about 15MB. Thus, the time consumption of uploading such huge datasets affects the user’s experience on the model compression service on the cloud.
2.2 Positive-Unlabeled Classifier for Selecting Data
In order to reduce the required number of samples in the training dataset, we propose to look for alternative data. Actually, there are massive datasets on the cloud servers for conducting different tasks (e.g. CIFAR, ImageNet and Flickr). We can regard them as unlabeled data, and a small proportion of the original samples as positive data. Thus, the data selection task is exactly a positive-unlabeled (PU) learning problem Kiryo et al. (2017); Xu et al. (2017).
PU learning method focuses on learning a classifier from positive and unlabeled data. Given , be the input samples and output labels, together with and be the number of labeled and unlabeled samples, respectively. The training set of the PU classifier can be formulated as:
where is the labeled set and is the unlabeled set, respectively.
Denote the desired decision function as , and is a discriminant function that maps the input data to a real number such that , and is an arbitrary loss function, the decision function can be optimized by the following equation:
where , and are the corresponding risk functions, and is the class prior. is an arbitrary loss function between the target and the ground truth label .
For the given pre-trained teacher network , we can ask the user to provide a tiny dataset consists of a small proportion (e.g. ) of the original training set. Then we can collect an unlabeled dataset on the cloud, Eq. 3 can be further utilized to select more positive data from to construct another training dataset for conducting the subsequent model compression task.
Since the pre-trained teacher network is designed to solve the original tasks, such as an ordinary classification, it is infeasible to directly use the same architecture on PU classification Xu et al. (2019). Therefore, we introduce an attention based multi-scale feature extractor for extracting features of input data, i.e. . Note that the deep features transition from general to specific along the network, and the transferability of the features drop rapidly in higher layers. Simply using the feature produced by the last layer will produce a large transferability gap, while using combined features from layers in different locations of the network will reduce the gap.
Specifically, let be the features extracted in the -th layer. Note that these outputs cannot be directly concatenated because the size of heights and widths are different. A common way to mitigate this problem is using global average pooling. Given , the global spatial information is compressed into a channel-wise descriptor Hu et al. (2018), where the -th element of is calculated by:
Given the compressed channel-wise descriptors, a simplest way is to directly concatenate them together into a single vector. However, it is not flexible enough for the vector to reflect the importance of the input signals, which represent features from general to specific. Generally, inputs containing more information should have a larger weight. Thus, we add attention on top of these descriptors for adaptation between modalities. Attention method can be viewed as a way to allocate the input signal so that more informative component will get more attention by the next layer, which has been widely used in CNN across a range of tasks Cao et al. (2015); Jaderberg et al. (2015). Specifically, given the concatenated channel wise descriptor , we opt to employ a gating mechanism as suggested in Hu et al. (2018):
in which is the ReLU transformation, and are the parameters of two FC layers that reduce the dimensionality of the input by a ratio , followed by a non-linearity and then increase the dimensionality back to origin. A sigmoid activation is used to perform the attention weight . The final output is obtained by simply re-scaling the channel wise descriptor:
Based on the proposed feature extractor , we train the data from the unlabeled dataset and the tiny labeled dataset which is randomly sampled from the original dataset . Dataset is then expanded with the data which is classified as positive in dataset , and finally derive a larger positive dataset with non-negative PU loss Eq. 3. Specifically, we minimize the non-negative PU loss with stochastic gradient descent (SGD) and stochastic gradient ascent (SGA). Denoting . When , we minimize Eq. 3 with SGD. Otherwise, the gradient of is computed and we update the parameter of the network with SGA. That is, we go along with , in order to alleviate the over-fitting of the current mini-batch . A more specific procedure is presented in Algorithm 1.
3 Robust Knowledge Distillation
The number of training examples in each class is usually balanced for a better training of deep neural networks. However, the dataset generated by PU learning may suffer from data imbalanced problem. For example, in ImageNet dataset the number of samples in category ’dog’ is times more than that in category ’plane’, and there are no sample from category ’deer’. When is randomly sampled from CIFAR-10 and ImageNet is treated as the unlabeled dataset , the number of ’dog’ samples will dominate the expanded dataset . Therefore, it is unsuitable to directly adopt the KD method given the imbalanced dataset .
There are many works which focused on the data imbalanced problem. However, they cannot be directly used in our problem, since the number of samples in each category is unknown in . The PU learning method only distinguish whether the images in belong to the given dataset , but never deal with the specific classes of input images.
In practice, we utilize the output of the teacher network. Note that instead of treating the output class label as the ground truth of the input sample , we treat the output response as the pseudo ground truth vector, in which is the final score output and is the temperature parameter which helps soften the output when the probability for one class is close to 1 and others are close to 0 ( in the following experiments). To this end, we propose a robust knowledge distillation (RKD) method to solve the data imbalanced problem.
Specifically, we assign weight to each category of the samples, where categories with fewer samples will have larger weights. Based on this principle, defining , we have the weight vector , in which:
and is the number of categories in the original dataset. When training the student network, the weight of the input sample is defined as in which is the index of the largest element in the ground truth vector .
Therefore, the surrogate KD loss can be derived based on Eq. 1:
Note that the derivation of is not optimal, since the predicted output response is not optimal and is contaminated with noise. However, we assume that the teacher network is well-trained, and there is only a slight difference between the elements in and the optimal weight vector :
Thus, we give a random perturb on each element of the original weight vector and get a finite set of possible weight vectors , in which . Note that this is similar to the cost-sensitive learning with multiple cost matrices. Based on these weight vectors, we are able to train the student network with the following equation:
in which is the hypothesis space. This is similar to the method proposed in Wang and Tang (2012). However, different from the cost matrix, the weight vector in Eq. 7 is only related to the proportion of the samples in each category and has nothing to do with the classification result, which is suitable for our learning problem. Besides, we solve a multi-class problem rather than a binary class problem.
The widely used CIFAR-10 benchmark is first selected as the original dataset, which is composed of images from categories. We randomly select samples in each class and form the tiny labeled dataset with positive samples. Benchmark dataset ImageNet contains over images from classes, but it is treated as the unlabeled dataset with unlabeled samples in our experiment. In this setting, ‘positive’ indicates that the category of the input sample belongs to one of the categories of the original dataset CIFAR-10. Recall that the class prior in Eq. 3 indicates the proportion of the positive samples in , which is assumed to be known in the following experiments. In practice, it can be estimated with the method in Ramaswamy et al. (2016). In this experiment, we manually select positive data from based on the name of the category provided by ImageNet 2012 classification dataset Krizhevsky et al. , and train the student network with manually selected data using the proposed RKD method as the baseline. The total number of positive data we selected is around , thus we set the class prior in the following experiment.
The model used in the first step is an attention based multi-scale feature extractor based on ResNet-34. Specifically, the channel-wise descriptor in Eq. 4 is derived from the outputs of groups in ResNet-34. The network is trained for epochs using SGD. We use a weight decay of and momentum of . We start with a learning rate of and divide it by every epochs. Data in ImageNet is resized to rather than in our experiment. Random flipping, random crop and zero-padding are used for data augmentation. In the second step, the teacher network is a pre-trained ResNet-34, and ResNet-18 is used as the student network. A weight decay of and momentum of is used. We optimized the student network using SGD by starting with a learning rate of and divide it by every epochs. is used in the following experiments.
Note that in the first step in our algorithm, the positive samples are automatically selected by the PU method. Thus, the number of training samples for the second step is unfixed, and could be influenced by the architecture of the network, the hyper-parameter used in the experiment, etc. In this circumstances, it is difficult to judge whether a good result is benefit from the quality or the number of the training data. Therefore, there are two settings in our experiment. The first setting is to feed all the positive data selected by the PU method to the second step to train the student network. Another setting is to randomly select a bunch of data which has the same number as the original training dataset ( for CIFAR-10).
The experimental results are shown in Tab. 1. Wherein, ‘Baseline-1’ method directly feeding manually selected positive data to the second step. ‘Baseline-2’ method randomly select 50000 data and then fed to the second step, which inevitably contains many negative data and should results in a bad performance. ‘PU-s1’ is the setting of feeding all the positive data selected by the PU method to the second step, and ‘PU-s2’ is the setting of randomly feeding 50000 positive data to conduct the second step. In addition, is the number of samples selected from each class in CIFAR-10, is the number of training samples used to train the student network. Suppose that positive samples are selected from by PU method, then we have .
The result shows that the performance of the proposed method is even better than the baseline method. With samples in CIFAR-10 and about training samples selected from ImageNet, it achieves a higher accuracy than the baseline method with manually selected training data. It shows the priority of the proposed method of selecting high quality positive samples from unlabeled dataset. In fact, manually selecting positive samples from ImageNet requires a huge effort, and the way we select are not carefully enough to exclude all the negative data in the manually selected dataset.
In the previous experiments the class prior is assumed to be known. In practice we may suffer from the error of estimating . Thus, a number of different are given to the proposed algorithm in order to test the robustness of the proposed method on the class prior. All the experimental settings are exactly the same except for the change from to . Fig. 3 shows the classification accuracies of using different . training samples are randomly selected in the second step to alleviate the influence of the number of training samples. The same experiments are conducted on both ResNet-34 and the attention based multi-scale feature extractor with traditional KD and RKD method to show the superiority of the proposed architecture and RKD method. The result shows that the proposed architecture with RKD method behaves the best, and is more robust on the under-estimate and over-estimate of the true class prior .
|KD Hinton et al. (2015)||-||50,000||Original Data||557M||11M||94.40|
|Baseline-1||-||269,427||Manually selected data||557M||11M||93.44|
|Baseline-2||-||50,000||Randomly selected data||557M||11M||87.02|
The experimental results show that although there are many negative data in the Imagenet dataset, the PU classifier can successfully pick a large amount of positive data whose categories is the same as that of given data. Therefore, the extended dataset with given data and selective data can be used to train a portable student network.
Then, we conduct experiment on ImageNet dataset, which is treated as the original dataset. Flicker1M dataset is used as the unlabeled dataset111http://press.liacs.nl/mirflickr/mirdownload.html. The experimental setting is the same as those in the CIFAR-10 experiments, except that we train epochs in both steps and divide the learning rate by every epochs. The class prior is set to in the following experiments. Experimental result is shown in Tabel 2.
|Algorithm||Data source||FLOPs||params||top-1 acc()||top-5 acc()|
In order to make a fair comparison, we randomly select samples from ImageNet and treat KD-500k as the baseline method. In the proposed method, we randomly select samples from each category in ImageNet and form a tiny labeled dataset , and then PU method is used to select positive data from Flicker1M dataset. The result shows that when feeding all the positive samples to the second step, the top-5 accuracy is even better than the baseline method. The reason that top-1 accuracy is worth than the baseline while top-5 is better is that we donot distinguish the specific category when using the PU method. Thus, the proposed method is better at learning meta knowledge than the specific label. When using a same number of training samples, the proposed method has only top-5 accuracy drop compared to the baseline method while using only of the samples in the original dataset.
Fig. 3 shows the relationship between the number of samples selected from each category in ImageNet and the accuracy of the proposed method. It is obvious that our method still achieves a promising result when using only about samples of the original dataset.
|data-free KD Lopes et al. (2017)||-||-||-||-||-||92.5|
|FitNet Romero et al. (2014)||90.3||94.2||96.1||96.7||97.3||-|
|FSKD Li et al. (2018)||95.5||97.2||97.6||98.0||98.1||-|
Since most of experiments in existing methods are conducted on the MNIST dataset, we further conduct the experiments on this dataset in order to compare our method to the state-of-the-art methods including FitNet Romero et al. (2014), FSKD Li et al. (2018) and data-free KD method Lopes et al. (2017). The EMNIST dataset222https://www.westernsydney.edu.au/bens/home/reproducibleresearch/emnist is used as the unlabeled dataset, which contains hand-written letters and digits. We randomly select 1,2,5,10 and 20 samples from each category in MNIST to form the tiny set . We use a standard LeNet-5 as the teacher network and the student network is ‘half-size’ to that of the corresponding teacher network in terms of the number of feature map channels per conv-layers. The class prior is set to in the following experiments.
Detailed classification results are shown in Tab. 3. It is clear that the proposed method outperforms FitNet and FSKD with a notable margin and is more robust when the number of labeled samples in each category is extremely rare ().
5 Related Works
In this section, we give a brief introduction about the related works of model compression.
There is a bunch of algorithms designed for learning efficient neural networks with fewer memory usage and computational complexity Han et al. (2019); Wang et al. (2018). For example, Gong et.al. Gong et al. (2014) investigated the vector quantization approach for representing similar weights for smaller CNNs. Denton et.al. Denton et al. (2014) exploited the redundancy within convolutional filters to derive approximations and significantly reduced the required computational costs. Chen et.al. Chen et al. (2015) compressed the weights in neural networks using the hashing trick Shen et al. (2018, 2016). Hinton et.al. Hinton et al. (2015) presents the knowledge distillation approach for transferring information from the pre-trained teacher network to a compressed student network.
Nowadays, there are only a few attempts to learn efficient neural networks with some meta-data of the training set or without using the original training data. For instance, Srinivas and Babu Srinivas and Babu (2015) directly removed the redundant similar neurons in a systematic way. Based on knowledge distillation, Lopes et.al. Lopes et al. (2017) used some extra meta-data to learning smaller deep neural networks. However, the performance of the resulting networks learned through these methods are often much worse than that of the baseline network. This is because the amount of available data and information is extremely small. More recently, Chen et.al. Chen et al. (2019) designed a generator for generating data of the similar properties as those of the original dataset, which obtained promising performance but lacked efficiency for generating images.
Most of existing network compression methods require the original dataset to achieve acceptable performance. However, the huge size of the training dataset leads to unacceptable transmission cost from end-user to the cloud. Therefore, we propose a two-step framework to compress the given neural network using only a small portion of the training data. Firstly, a PU classifier with an attention based multi-scale feature extractor is trained with the given labeled data and massive unlabeled data on the cloud. Then, a new dataset is conducted by combining the given data and the ’positive’ data selected by PU classifier. Secondly, we develop a robust knowledge distillation (RKD) method to address the class imbalanced problem with noise in the augmented dataset. Experiments on the MNIST, CIFAR-10 and ImageNet datasets demonstrate that the proposed method can successfully dig more useful training samples using only a small amount of original data, and achieve the state-of-the-art performance comparing to other few-shot learning model-compression methods.
We thank anonymous area chair and reviewers for their helpful comments. Chang Xu was supported by the Australian Research Council under Project DE180101438.
- Look and think twice: capturing top-down visual attention with feedback convolutional neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2956–2964. Cited by: §2.2.
- Data-free learning of student networks. arXiv preprint arXiv:1904.01186. Cited by: §5.
- Compressing neural networks with the hashing trick. In International Conference on Machine Learning, pp. 2285–2294. Cited by: §5.
- Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in neural information processing systems, pp. 1269–1277. Cited by: §1, §5.
- Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §1.
- Compressing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115. Cited by: §1, §5.
- Full-stack filters to build minimum viable cnns. arXiv preprint arXiv:1908.02023. Cited by: §5.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §1, §2.1, Table 1, §5.
- Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §2.2, §2.2.
- Spatial transformer networks. In Advances in neural information processing systems, pp. 2017–2025. Cited by: §2.2.
- Hamming embedding and weak geometric consistency for large scale image search. In European conference on computer vision, pp. 304–317. Cited by: §1.
- Positive-unlabeled learning with non-negative risk estimator. In Advances in neural information processing systems, pp. 1675–1685. Cited by: §2.2.
-  Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1, §1, §4.1.
- Knowledge distillation from few samples. arXiv preprint arXiv:1812.01839. Cited by: §4.3, Table 3.
- Data-free knowledge distillation for deep neural networks. arXiv preprint arXiv:1710.07535. Cited by: §4.3, Table 3, §5.
- Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pp. 1520–1528. Cited by: §1.
- SlimNets: an exploration of deep model compression and acceleration. In 2018 IEEE High Performance extreme Computing Conference (HPEC), pp. 1–6. Cited by: §1.
- Mixture proportion estimation via kernel embeddings of distributions. In International Conference on Machine Learning, pp. 2052–2060. Cited by: §4.1.
- Fitnets: hints for thin deep nets. arXiv preprint arXiv:1412.6550. Cited by: §4.3, Table 3.
- Image classification with the fisher vector: theory and practice. International journal of computer vision 105 (3), pp. 222–245. Cited by: §1.
- Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: §2.1.
- Multilabel prediction via cross-view search. IEEE transactions on neural networks and learning systems 29 (9), pp. 4324–4338. Cited by: §1.
- Multiview discrete hashing for scalable multimedia search. ACM Transactions on Intelligent Systems and Technology (TIST) 9 (5), pp. 53. Cited by: §5.
- Semi-paired discrete hashing: learning latent hash codes for semi-paired cross-view retrieval. IEEE transactions on cybernetics 47 (12), pp. 4275–4288. Cited by: §5.
- Data-free parameter pruning for deep neural networks. arXiv preprint arXiv:1507.06149. Cited by: §5.
- Minimax classifier for uncertain costs. arXiv preprint arXiv:1205.0406. Cited by: §3.
- Towards evolutionary compression. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2476–2485. Cited by: §5.
- Revisiting sample selection approach to positive-unlabeled learning: turning unlabeled data into positive rather than negative. arXiv preprint arXiv:1901.10155. Cited by: §2.2.
- Multi-positive and unlabeled learning.. In IJCAI, pp. 3182–3188. Cited by: §2.2.
- Deep adversarial subspace clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1596–1604. Cited by: §1.