In Teacher We Trust:
Learning Compressed Models for Pedestrian Detection
Deep convolutional neural networks continue to advance the state-of-the-art in many domains as they grow bigger and more complex. It has been observed that many of the parameters of a large network are redundant, allowing for the possibility of learning a smaller network that mimics the outputs of the large network through a process called Knowledge Distillation. We show, however, that standard Knowledge Distillation is not effective for learning small models for the task of pedestrian detection. To improve this process, we introduce a higher-dimensional hint layer to increase information flow. We also estimate the variance in the outputs of the large network and propose a loss function to incorporate this uncertainty. Finally, we attempt to boost the complexity of the small network without increasing its size by using as input hand-designed features that have been demonstrated to be effective for pedestrian detection. We succeed in training a model that contains fewer parameters than the large network while outperforming AlexNet on the Caltech Pedestrian Dataset.
State-of-the-art deep convolutional neural networks are extremely large and require a vast amount of resources. For example, the classic VGG-16 image classification network  contains 138 million parameters, and the more recent ResNet-200  still contains over 60 million parameters.
This holds true not just for image classification. For instance, at the time of writing, the top three approaches for pedestrian detection as measured on the Caltech Pedestrian Dataset  consist of MSCNN , RPN+BF , both built upon the Faster-RCNN  architecture containing over 100 million parameters, and SA-FastRCNN  which features a network with over 30 million parameters. The larger a network is, the more disk space, memory, and energy it consumes, and the slower it is to use.
These large networks contain many redundant parameters [20, 7], so in theory they could be much smaller. To demonstrate this, we adopt Knowledge Distillation (KD)  to train a small student network to mimic the large teacher network.
KD was developed for classification on the ImageNet dataset  with the idea that the 1000-dimensional prediction from the teacher is much more informative than the single ground truth label. But for pedestrian classification where there are only 2 outputs (pedestrian / no pedestrian), this difference is much less pronounced. To increase the dimensionality of the data that the student tries to learn, we propose introducing just before the final layer a reasonably sized fully-connected hint layer whose outputs we try to match.
The teacher network does not always predict the correct label. The policy behind KD is to train the student to mimic the teacher regardless of the mistakes. However, if the teacher has an estimate of how confident its prediction is, then the student could make more informed decisions. Intuitively, if the teacher reports that it is very confident about its prediction, then the student should trust the teacher more. To produce a measure of teacher confidence, we follow the insights in  and utilize dropout at test time to fit a Gaussian distribution to the teacher outputs. We then propose a loss function that incorporates this information when learning from the teacher.
Hand-designed features are very popular in traditional computer vision. However, these features have not seen much success when used with deep learning . One theory is that deep networks are able to learn features internally that outperform the hand-designed features. If this is indeed the case, we hypothesize that small networks may not have the capacity to do so, and traditional feature extraction may improve small networks. We investigate this using Aggregate Channel Features (ACF)  which are popular and proven features for pedestrian detection.
In this paper we propose to use Knowledge Distillation to compress a large network for pedestrian classification. We explore variations on the training process by learning from the outputs of a hint layer inserted before the final fully-connected layer, introducing a loss function that takes into account output covariances, and using Aggregate Channel Features as input.
We show the effect of these modifications on student networks of various sizes, and show that they outperform both training from scratch and standard KD. We produce a model with only 157K parameters that outperforms AlexNet  which has over 57M parameters, 360 as many.
2 Related Works
Network Pruning The earliest studies into network size reduction came in the form of weight pruning, motivated by the need for regularization. These methods use the magnitude of the weights  or the Hessian of the loss function [20, 14] to prune away less useful weights. Apart from pruning weights, Srinivas and Babu  devised a method for pruning neurons directly without the use of any training data. These pruning approaches remove a significant amount of the uninformative parts of the network and results in lower computation costs and storage requirements.
Parameter Sharing Han et al.  introduced a multi-step pipeline with pruning, weight clustering and Huffman encoding. An orthogonal approach uses hashing or bucketing to quantize various parts of the model [4, 22]. Cheng et al.  enforce a circulant matrix model on the fully-connected layers to exploit faster computation and smaller model size via Fast Fourier Transforms. By quantizing and sharing parameters, the amount of space needed to store the network representation is reduced.
Matrix Decomposition Neural network weights can be treated as matrices and compressed through matrix decomposition. Denil et al.  use a low rank decomposition of the weight matrices together with a sparse dictionary learned from an autoencoder to reduce the number of parameters. Novikov et al.  apply the Tensor-Train decomposition  to compress the weight matrices in the fully-connected layers.
Transfer Learning. While the above methods compress an existing network directly, the underlying architecture remains bulky with the same wide and depth as before. An alternative is to consider transferring the knowledge to a new smaller network. This produces a much more compact model with dense weights instead of sparse weights. Moreover, it is possible to then apply the above methods on top of the new network.
Ba and Caruana  showed that it is possible to train a shallower but wider student network to mimic a teacher network, performing almost as well as the teacher. Hinton et al.  generalized this idea by training the student to learn from both the teacher and from the training data, naming this process Knowledge Distillation (KD). They demonstrated that students trained this way outperform those trained directly using only the training data. FitNets  use Knowledge Distillation with intermediate hint layers to train a thinner but deeper student network containing fewer parameters that outperforms even the teacher network. However, to the best of our knowledge, this approach has yet to be applied to training a network that is both thinner and shallower.
3 Knowledge Distillation
The process of Knowledge Distillation (KD) for classification networks is to train the student from the predictions of the teacher network in addition to the ground truth hard targets (Figure 0(a)). However, with a standard soft maximum (softmax) classification layer, the teacher predictions will often be very similar to the hard targets with one class having probability close to 1 and the other classes having probabilities close to 0. So, instead, a variant of the softmax function which includes a temperature parameter is used instead to produce soft targets.
When , this is the standard softmax function, while higher values of produce a smoother probability distribution over the classes. are the input logits to the softmax layer, and are also the outputs of the fully-connected layer before it.
The loss function used for training the student is a combination of the soft loss , the cross-entropy loss between the soft outputs of the student and teacher, as well as the hard loss , the standard classification cross-entropy loss between the student outputs and the ground truth labels.
4 Augmenting Knowledge Distillation
A graphical outline of our pipeline can be found in Figure 0(b). Here we explain the various parts of the pipeline and the motivations behind them.
4.1 Hint Layer
KD was developed for ImageNet classification with the idea that the 1000-dimensional prediction from the teacher is much more informative than the single ground truth label. But for pedestrian classification where there are only 2 outputs (pedestrian / no pedestrian), this difference is much less pronounced. Since the output of a softmax function sums up to 1 for every value of , the soft targets are actually only 1-dimensional.
To increase the dimensionality of the data that the student learns from, we introduce a hint layer, a fully-connected (FC) layer with 64 outputs in front of the final FC layer, and train the student to match the outputs of the hint layer instead. If the student network can perfectly match the hint layer outputs, then just by copying over the teacher’s final FC layer, the student will be able to mimic the teacher’s outputs. Even if the student cannot perfectly match the hint layer outputs, the weights from the teacher’s final FC layer still serve as a good initialization for the student’s final FC layer, which will be fine-tuned through the hard loss coming from the ground truth labels. In this work, we assume that the hint layer is the same size for both the teacher and student so that interpolation is not required.
This idea of matching earlier hint layers has been explored in FitNets . However, they choose to match a layer in the middle of the model to provide additional guidance in training a thinner but deeper student network. We instead propose this idea in order to increase the amount of information obtained from the teacher model in cases where there are only a small number of output classes.
The outputs of this hint layer cannot be interpreted as a probability distribution, so cross-entropy loss is not applicable. Instead, we use mean squared error loss as the soft loss.
Care must be taken when the activation for the hint layer is a rectified linear (ReLU) nonlinearity, in which case it is advised to match the values before passing them through the ReLU function. This is because the ReLU function discards information of negative values, and also because the gradient for where the student predicts a negative value is ignored, leading to instabilities in training.
4.2 Learning With Confidence
There will be cases where the teacher makes mistakes and predicts differently from the ground truth. The policy behind KD is to train the student to mimic the teacher regardless of the mistakes, relying on the hard losses to nudge the outputs towards the correct label. This results in a tension between the soft and hard losses, each producing a gradient for the opposite label.
This tension can be relaxed slightly if the teacher has an estimate of how confident its prediction is. Intuitively, if the teacher reports that it is very confident about its prediction, then the student should trust the teacher more, and if the teacher instead reports that it is not confident about its prediction, then the student should balance mimicking the teacher with predicting the correct label. The underlying assumption is that the teacher is more likely to be confident about examples that they predict correctly. There will be cases where the teacher is very confident yet mistaken, but we believe that it is important for the student not to disregard the teacher in these cases.
In , the authors draw a theoretical link casting dropout as a Bayesian approximation of Gaussian Processes. Following their ideas, we enable dropout during test time and forward the same input through the model times. Each pass can be thought of as the output of a single model sampled from an ensemble. From this, the sample mean and covariance of the outputs of the ensemble can be estimated.
By doing so, we are fitting a multivariate Gaussian distribution to the teacher outputs, from which it is possible to measure the likelihood of the student output as being drawn from the distribution. In particular, the likelihood of the student output is:
Maximizing the log-likelihood of Equation 5 is equivalent to minimizing the following loss function:
This function is the square of the Mahalanobis distance. Compared to the mean-square distance, it is smaller along dimensions of high variability, consistent with our idea of reporting smaller gradients for outputs that the teacher is not confident in.
One limitation of our method lies in the dimension of the covariance matrix. Since the loss function requires the inversion of covariance matrix, the number of samples must be larger than the dimensionality of the teacher’s output. However, the output from the time consuming convolutional layers can be cached, and only the last few layers with dropout need multiple passes, so the additional overhead during training is low.
4.3 Hand-designed Features as Input
Before deep learning became mainstream, computer vision was dominated by the use of specialized features for each task discovered through extensive experimentation. For example, the advent of HOG features  was ground-breaking in the development of pedestrian detection, and the introduction of Integral Channel Features  brought about another revolution, leading to the discovery of many derivative features such as Aggregate Channel Features  and Checkerboards features , the latter of which is competitive with state-of-the-art.
These hand-designed features are largely ignored in deep learning. In , the authors found that there was no improvement in neural networks trained using hand-designed features compared to those trained using raw RGB images as input. However, their model was large, so it is possible that it was able to learn features internally that outperform the hand-designed features. The same might not be true for a small model, in which case it may be reasonable to expect that by using these hand-designed features as input, the small model can be improved. The use of hand-designed features as input can also be thought of as attaching a fixed layer to the front of the network, pre-trained through years of human research.
For this reason, we explore training our student networks using Aggregate Channel Features (ACF) as input. We choose ACF because it offers a good trade-off between detection accuracy and speed, taking less than 10ms to compute for a image on a single CPU .
ACF consist of 10 channels: the LUV color channels, gradient magnitude, and six oriented gradient bins. The input image is first converted into these 10 channels, then, within each channel, pixels are divided into 4x4 blocks and summed.
Note that when we train the student using ACF features as input, the input to the teacher remains the original RGB image. Whether the student is trained on RGB or ACF, they learn from the exact same teacher.
5 Experimental setup
We perform all training and evaluation on the Caltech Pedestrian Dataset . Following standard practice, we use the first 5 sequences as the training set, the 6th sequence as the validation set, and the last 5 sequences as the test set.
We follow the setup of Caltech10x in  and sample every 3rd frame for training. We use the Reasonable configuration when testing on the Caltech test set, which samples every 30th frame and includes only pedestrians without significant occlusion with a minimum height of 50 pixels and excludes the labels “people” and “person?”. Evaluation is performed using the official evaluation script, which computes a curve of the logarithm of the number of false positives per image versus the miss rate. A value for the log-average miss rate is also calculated, and a lower value indicates a better result.
Our training set uses ground truth patches as well as patches with Intersection-over-Union (IoU) greater than 0.5 as positive patches, and patches with IoU less than 0.5 as negative patches. There are 31,129 positive patches and 748,139 negative patches in the training set.
5.2 Region Proposal
Teacher Network For our teacher network, We use pre-activation ResNet-200  pre-trained on ImageNet, augmented with dropout and a 64-dimensional hint layer, then fine-tuned on our training set.
Student Network We use pre-activation ResNet-18  pre-trained on ImageNet augmented with a 64-dimensional hint layer as the basis for our student networks. We experiment with three versions: unmodified ResNet-18, ResNet-18-Thin which cuts the number of channels for every layer in half, and ResNet-18-Small which fixes every layer at 32 channels. The compression rates of these models can be found in Table 2.
|conv1||, stride 2||, stride 2|
|pool, stride 2||pool, stride 2|
|dropout||FC(512, 64, ReLU)|
|FC(2048, 64, ReLU)||FC(64, 2, softmax)|
|FC(64, 2, softmax)|
5.4 Training configuration
Training is performed through stochastic gradient descent with Nesterov Momentum 0.9 and weight decay 0.0005. We use a batch size of 16, an epoch size of 1000 iterations, a learning rate of 0.01 dropping by a factor of 5 every 20 epochs, and a total of 70 epochs. Since there are many more negatives patches than positive patches in our training set, we force a positive to negative ratio of 1:3 for each training batch.
The inputs to the teacher network are 224x224x3 RGB patches, and the inputs to the student networks are either 224x224x3 RGB patches or 224x224x10 ACF patches. Patches are scaled by warping them to fit the input size, and RGB inputs are normalized using ImageNet mean and standard deviation. During training, patches are randomly flipped horizontally. The extraction of ACF features occur after the flip.
When combining soft and hard losses, hard losses are weighted with a lambda of 0.5. For dropout during testing, we use a probability of 0.5. When estimating the covariance of teacher output, we forward each input 200 times.
The models are trained with the Torch framework on a NVIDIA Titan X GPU with 12GB memory.
6 Experimental Evaluation
All of the evaluation results are reported on the Reasonable subset of the Caltech test set using the model at the epoch with the lowest log-average miss rate on the Reasonable subset of the Caltech validation set.
For all of the student models tested, the best performing configuration was to add a hint layer, use teacher confidence, and train with RGB inputs. A summary of our results can be found in Table 9. In the following sections, we break down the contribution from each of our innovations.
Table 3 compares the various models trained directly from ground truth as well as the student models trained using standard Knowledge Distillation on the softmax logits. The temperature used for KD, , was picked after testing multiple values.
Standard KD does not seem to work for the smaller models. In Figure 3, we visualize the histogram of the distribution of the two output logits on the entire test set for the teacher (ResNet-200), large (ResNet-18) and small (ResNet-18-Small) students trained via standard KD, and the small student trained with our full pipeline. Only the small student trained via standard KD is visibly different.
This difference highlights an important property of standard KD. With or without a temperature, the softmax function normalizes its outputs to sum to 1, and two different inputs can result in the same output. This is not necessarily a problem, but we hypothesize that with very small student models, for problems with very few output labels, standard KD does not offer enough guidance to be superior to training from ground truth labels.
6.2 Hint Layer
As reported in Table 4, adding a hint layer improves training for all student models. This enforces the idea that increasing the amount of information used for training models is beneficial.
It would be interesting to explore the effect of varying the size of the hint layer, but that is unfortunately outside the scope of this work. Is there a point where the hint layer is too large and dominated by noise instead of useful data?
6.3 Learning With Confidence
|Model||Direct||Conf||Hint + Conf (Ours)|
Table 5 shows that estimating output covariances and training with our proposed loss function improves the student models. There is slightly more improvement if the covariances are estimated from the 64-dimensional hint layer outputs instead of the 2-dimensional softmax logits.
This shows that there is indeed more information in the teacher network that can be extracted when treated as an ensemble using dropout.
6.4 Hand-designed Features as Input
|Model||RGB||ACF||ACF + Hint + Conf|
Our results in Table 6 are consistent with  in that a network trained directly using features like ACF as input is worse than if it were trained with raw RGB inputs. This holds true even as the network size is significantly reduced, though the drop in performance for smaller models is not as severe.
Similar to the results for our previous models taking RGB inputs, introducing a hint layer and factoring in teacher confidence offers similar amounts of improvement to the models taking ACF input. However, the models with ACF inputs still fall short compared to those that take RGB inputs, and we can only conclude that ACF is worse than RGB as inputs to pedestrian detection networks.
7 Network Analysis
In this section, we compare ResNet-200 (teacher) with ResNet-18-Small-RGB-Hint-Conf (student).
Model Similarity We visualize the weights of the first convolutional layer in Figure 4. The first layer weights are slightly different, however, we can see similar shapes in the patterns.
Failure Cases We tabulate the correct and incorrect predictions for each model in Table 7. The teacher and student networks predict the same label 78.28% of the time. We sample some example patches where the teacher predicts correctly but the student fails in Figure 4(a), 4(b) and the reverse in Figure 4(c). The student did not predict any positive patches correctly that the teacher had predicted incorrectly. The patches where the student outperformed the teacher are indeed harder to classify, and could be a result of the student model being much smaller and thus more regularized.
7.1 Resource Usage
We report the resource usage during test time on a NVIDIA Titan X in Table 8.
|Single Identity Layer||0.02ms||325MB|
It appears that modern GPUs are not affected very much by the number of channels in convolutional layers, so while ResNet-18-Thin and ResNet-18-Small are much smaller in terms of the number of parameters, they are not significantly faster than ResNet-18. However, the memory usage is significantly decreased. Ignoring the fixed amount of memory used by the inputs and the system measured using a model with a single identity layer, ResNet-18-Small uses less memory.
|#Parameters||63M (1)||11M (6)||2.8M (22)||157K (400)|
|Speed||24ms (1)||3ms (8)||3ms (8)||3ms (8)|
|Memory||5052MB (1)||612MB (8)||308MB (16)||240MB (21)|
We have shown that there is indeed a lot of redundancy in large deep neural networks. We have shown that it is possible to train a student network that contains 400 times fewer parameters while only observing a drop in log-average miss rate of 4.9%. The main gains of our approach utilizes the dimensionality of our new hint layers. We also described a method of obtaining a measure of confidence from the teacher network, and demonstrated that taking this information into account during training can lead to considerable gains. Our student models perform 8x faster than the teacher with 21x less memory usage.
-  J. Ba and R. Caruana. Do deep nets really need to be deep? In Advances in neural information processing systems, pages 2654–2662, 2014.
-  R. Benenson, M. Omran, J. Hosang, , and B. Schiele. Ten years of pedestrian detection, what have we learned? In ECCV, CVRSUAD workshop, 2014.
-  Z. Cai, Q. Fan, R. Feris, and N. Vasconcelos. A unified multi-scale deep convolutional neural network for fast object detection. In ECCV, 2016.
-  W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen. Compressing convolutional neural networks. arXiv preprint arXiv:1506.04449, 2015.
-  Y. Cheng, F. X. Yu, R. S. Feris, S. Kumar, A. N. Choudhary, and S. Chang. An exploration of parameter redundancy in deep networks with circulant projections. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 2857–2865, 2015.
-  N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 886–893. IEEE, 2005.
-  M. Denil, B. Shakibi, L. Dinh, M. A. Ranzato, and N. de Freitas. Predicting parameters in deep learning. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 2148–2156. Curran Associates, Inc., 2013.
-  P. Dollár, R. Appel, S. Belongie, and P. Perona. Fast feature pyramids for object detection. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 36(8):1532–1545, 2014.
-  P. Dollár, Z. Tu, P. Perona, and S. Belongie. Integral channel features. Proc. British Machine Vision Conf, 2009.
-  P. Dollár, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: An evaluation of the state of the art. PAMI, 34, 2012.
-  Y. Gal and Z. Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. CoRR, abs/1506.02142, 2015.
-  S. Han, H. Mao, and W. J. Dally. A deep neural network compression pipeline: Pruning, quantization, huffman encoding. arXiv preprint arXiv:1510.00149, 2015.
-  S. J. Hanson and L. Y. Pratt. Comparing biases for minimal network construction with back-propagation. In Advances in neural information processing systems, pages 177–185, 1989.
-  B. Hassibi, D. G. Stork, and G. J. Wolff. Optimal brain surgeon and general network pruning. In Neural Networks, 1993., IEEE International Conference on, pages 293–299. IEEE, 1993.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. arXiv preprint arXiv:1603.05027, 2016.
-  G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
-  J. H. Hosang, M. Omran, R. Benenson, and B. Schiele. Taking a deeper look at pedestrians. CoRR, abs/1501.05790, 2015.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
-  Y. LeCun, J. S. Denker, and S. A. Solla. Optimal brain damage. In Advances in Neural Information Processing Systems, pages 598–605. Morgan Kaufmann, 1990.
-  J. Li, X. Liang, S. Shen, T. Xu, and S. Yan. Scale-aware fast r-cnn for pedestrian detection. arXiv preprint arXiv:1510.08160, 2015.
-  B. Moons, B. De Brabandere, L. Van Gool, and M. Verhelst. Energy-efficient convnets through approximate computing. arXiv preprint arXiv:1603.06777, 2016.
-  A. Novikov, D. Podoprikhin, A. Osokin, and D. P. Vetrov. Tensorizing neural networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 442–450. Curran Associates, Inc., 2015.
-  I. V. Oseledets. Tensor-train decomposition. SIAM J. Sci. Comput., 33(5):2295–2317, Sept. 2011.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NIPS), 2015.
-  A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  S. Srinivas and R. V. Babu. Data-free parameter pruning for deep neural networks. In Proceedings of the British Machine Vision Conference 2015, BMVC 2015, Swansea, UK, September 7-10, 2015, pages 31.1–31.12, 2015.
-  L. Zhang, L. Lin, X. Liang, and K. He. Is faster r-cnn doing well for pedestrian detection? arXiv preprint arXiv:1607.07032, 2016.
-  S. Zhang, R. Benenson, and B. Schiele. Filtered channel features for pedestrian detection. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pages 1751–1760. IEEE, 2015.