Structured Knowledge Distillation for Dense Prediction
In this paper, we consider transferring the structure information from large networks to small ones for dense prediction tasks. Previous knowledge distillation strategies used for dense prediction tasks often directly borrow the distillation scheme for image classification and perform knowledge distillation for each pixel separately, leading to sub-optimal performance. Here we propose to distill structured knowledge from large networks to small networks, taking into account the fact that dense prediction is a structured prediction problem. Specifically, we study two structured distillation schemes: i) pair-wise distillation that distills the pairwise similarities by building a static graph; and ii) holistic distillation that uses adversarial training to distill holistic knowledge. The effectiveness of our knowledge distillation approaches is demonstrated by extensive experiments on three dense prediction tasks: semantic segmentation, depth estimation and object detection.
Dense prediction is a category of fundamental problems in computer vision, which learns a mapping from input objects to complex output structures, including semantic segmentation, depth estimation and object detection, among many others. One needs to assign category labels or regress specific values for each pixel given an input images to form the structured outputs. In general these tasks are significantly more challenging to solve than image-level prediction problems, thus often requiring networks with large capacity in order to achieve satisfactory accuracy. On the other hand, compact models are desirable for enabling edge computing with limited computation resources.
Deep neural networks have been the dominant solutions since the invention of fully-convolutional neural networks (FCNs) . Subsequent approaches, e.g., DeepLab [7, 6, 8, 75], PSPNet , RefineNet , and FCOS  follow the design of FCNs to optimize energy-based objective functions related to different tasks, having achieved significant improvement in accuracy, often with cumbersome models and expensive computation.
Recently, design of neural networks with compact model sizes, light computation cost and high performance, has attracted much attention due to the need of applications on mobile devices. Most current efforts have been devoted to designing lightweight networks specially for dense prediction tasks or borrowing the design from classification networks, e.g., ENet , ESPNet  and ICNet  for semantic segmentation, YOLO  and SSD  for object detection, and FastDepth  for depth estimation. Advanced strategies, like pruning , knowledge distillation [34, 73] are applied to helping the training of compact networks by making use of cumbersome networks.
The knowledge distillation strategy has been verified valid in classification tasks [25, 58]. Most of previous works [34, 73] directly apply distillation scheme on each pixel separately to transfer the class probability or extracted feature embedding of the corresponding pixel produced from the cumbersome network (teacher) to the compact network (student) for dense prediction tasks. However, such a pixel-wise distillation scheme neglect the important structure information.
Considering the characteristic of dense prediction problem, we present structured knowledge distillation and transfer the structure information with two schemes, pair-wise distillation and holistic distillation. The pair-wise distillation scheme is motivated by the widely-studied pair-wise Markov random field framework  for enforcing spatial labeling contiguity. The goal is to align a static affinity graph which is computed to capture both short and long range structure information among different locations from the compact network and the cumbersome network.
The holistic distillation scheme aims to align higher-order consistencies, which are not characterized in the pixel-wise and pair-wise distillation, between output structures produced from the compact network and the cumbersome network. We adopt the adversarial training scheme, and a fully convolutional network, a.k.a. the discriminator considers both the input image and the output structures to produce a holistic embedding which represents the quality of the structure. The compact network is encouraged to generate structures with similar embeddings as the cumbersome segmentation network. We distill the knowledge of structure qualities into the weight of discriminators.
To this end, we optimize an objective function that combines a conventional task loss with the distillation terms. The main contributions of this paper can be summarized as follows.
We study the knowledge distillation strategy for training accurate compact dense prediction networks.
We present two structured knowledge distillation schemes, pair-wise distillation and holistic distillation, enforcing pair-wise and high-order consistency between the outputs of the compact and cumbersome networks.
We demonstrate the effectiveness of our approach by improving recently-developed state-of-the-art compact networks on three different dense prediction tasks: semantic segmentation, depth estimation and object detection. Taking semantic segmentation as an example, the performance gain is illustrated in Figure 1. Code is available at: https://github.com/irfanICMLL/structure_knowledge_distillation.
2 Related Work
Semantic segmentation. Semantic segmentation is a typical pixel classification problem, which requires a high level understanding of the whole scene as well as discriminative features for pixels from different classes. Deep convolutional neural networks have been the dominant solution to semantic segmentation since the pioneering works, fully-convolutional network , DeConvNet , U-Net . Various schemes  have been developed for improving the network capability and accordingly the segmentation performance. For example, stronger backbone networks, e.g., GoogleNets , ResNets , and DenseNets , have shown better segmentation performance. Improving the resolution through dilated convolutions [7, 6, 8, 75] or multi-path refine networks  leads to significant performance gain. Exploiting multi-scale context, e.g., dilated convolutions , pyramid pooling modules in PSPNet , atrous spatial pyramid pooling in DeepLab , object context , also benefits the segmentation. Lin et al.  combine deep models with structured output learning for semantic segmentation.
In addition to cumbersome networks for highly accurate segmentation, highly efficient segmentation networks have been attracting increasingly more interests due to the need of real applications, e.g., mobile applications. Most works focus on lightweight network design by accelerating the convolution operations with factorization techniques. ENet , inspired by , integrates several acceleration factors, including multi-branch modules, early feature map resolution down-sampling, small decoder size, filter tensor factorization, and so on. SQ  adopts the SqueezeNet  fire modules and parallel dilated convolution layers for efficient segmentation. ESPNet  proposes an efficient spatial pyramid, which is based on filter factorization techniques: point-wise convolutions and spatial pyramid of dilated convolutions, to replace the standard convolution. The efficient classification networks, e.g., MobileNet , ShuffleNet , and IGCNet , are also applied to accelerate segmentation. In addition, ICNet (image cascade network)  exploits the efficiency of processing low-resolution images and high inference quality of high-resolution ones, achieving a trade-off between efficiency and accuracy.
Depth estimation. Depth estimation from a monocular image is essentially an ill-posed problem, which requires an expressive model with high reasoning ability. Previous work highly depend on hand-craft feature or the extra processing like CRF [11, 22] and super-pixel over segmentation [37, 61] to capture the structure information. Since Laina  constructed a fully convolutional architecture to predict the depth map, following works [35, 16] benefit from the increasing ability of FCN and achieve promising results. Besides, Fei  proposed a semantically informed geometric loss while Wei  uses a virtual normal loss to constraint the structure information. Like in semantic segmentation, there are also some works try to replace the encoder with efficiency backbones [70, 63, 71] to decrease the computational cost, but suffer from the training problem limited by the ability of the compact network. The most similar work  apply pruning method to training the compact depth network. However we focus on the structured knowledge distillation inspiring by the widely used structure constraints in depth estimation tasks.
Object detection. Object detection is a fundamental task in computer vision, in which one need to regress a bounding box as well as predict a category label for each instance of interest in an image. The pioneering work R-CNN  define a two-stage fashion of doing object detection. Following works [18, 56, 23] reach significant performance by first predict proposals and then refine the bounding box as well as predict a category label. A lot of works also pay attention to the detection efficiency, like Yolo , SSD [42, 15]. They use a one-stage method and design light-weight network structures. However, the performance of the one stage efficient methods can not compare to the two-stage ones, because of the unbalance samples between interests objects and backgrounds. RetinaNet  solves the problem of unbalance samples in to some extent by proposing the focal loss, which makes the results of one-stage methods comparable to two-stage ones. However, most of the detectors rely on a set of pre-defined anchor boxes, which decreases the training samples and makes the detection network sensitive to hyper parameters. Recently, anchor free methods are popular in object detection tasks, like FCOS  and CornerNet . FCOS employs a fully convolutional framework, and predict bounding box based on every pixels like in semantic segmentation, which solves the object detection task as a dense prediction problem. In this work, we apply the structured knowledge distillation method with the FCOS framework, as it is simple and can achieve good performance.
Knowledge distillation. Knowledge distillation  is a way of transferring knowledge from the cumbersome model to a compact model to improve the performance of compact networks. It has been applied to image classification by using the class probabilities produced from the cumbersome model as âsoft targetsâ for training the compact model [2, 25, 68] or transferring the intermediate feature maps [58, 78].
There are also other applications, including object detection , pedestrian re-identification  and so on. The MIMIC  method distill a compact object detection network by making use of a two-stage Faster-RCNN . They align the feature map on pixel-level and ignore the structure information among pixels.
The very recent and independently-developed application for semantic segmentation  is related to our approach. It mainly distills the class probabilities for each pixel separately (like our pixel-wise distillation) and center-surrounding differences of labels for each local patch (termed as a local relation in ). In contrast, we focus on distilling structured knowledge: pairwise distillation, which transfers the relation among different locations by building a static affinity graph, and holistic distillation, which transfers the holistic knowledge that captures high-order information.  can be seen as a special case of the pair wise distillation.
This paper is a substantial extension of our previous conference paper . The main difference compared with  lie in threefold. 1) We extend the pair-wise distillation to a more general case in Section 3.1 and build a static graph with nodes and connections. We explore the influence of the graph size, and find out that it is important to keep a global connection range. 2) We provide more explanations and ablations on the adversarial training for holistic distillation. 3) We also extend our method to two recent released strong baselines in depth estimation  and object detection , by replacing the backbone with MobileNetV2 , and further improve their performance.
Adversarial learning. Generative adversarial networks (GANs) have been widely studied in text generation [69, 76] and image synthesis [19, 31]. The conditional version  is successfully applied to image-to-image translation, including style transfer , image inpainting , image coloring  and text-to-image .
The idea of adversarial learning is also adopted in pose estimation , encouraging the human pose estimation result not to be distinguished from the ground-truth; and semantic segmentation , encouraging the estimated segmentation map not to be distinguished from the ground-truth map. One challenge in  is the mismatch between the generator’s continuous output and the discrete true labels, making the discriminator in GAN be of very limited success. Different from , in our approach, the employed GAN does not have this problem as the ground truth for the discriminator is the teacher network’s logits, which are real valued. We use adversarial learning to encourage the alignment between the output maps produced from the cumbersome network and the compact network. However, in the depth prediction task, the ground truth maps are not discrete labels. Like in , they use the ground truth maps as the real samples. Different from their methods, our distillation method are trying to align the output of the cumbersome network and the compact network, the task loss calculated with ground truth is optional. When the labelled data is limited, given a well-trained teacher, our method can be applied to unlabelled data and further improve the accuracy.
In this section, we first introduce the structured knowledge distillation method based on semantic segmentation, a task of predicting a category label to each pixel in the image from categories. A segmentation network takes an RGB image of size as the input, then it computes a feature map of size , where is the number of channels. Then, a classifier is applied to compute the segmentation map of size from , which is upsampled to the spatial size of the input image as the segmentation results. Besides, we expand our method to other two dense prediction tasks: depth estimation and object detection under FCN framework.
Pixel-wise distillation. We apply the knowledge distillation  strategy to transfer the knowledge of the cumbersome segmentation network to a compact segmentation network for better training the compact segmentation network. We view the segmentation problem as a collection of separate pixel labeling problems, and directly use knowledge distillation to align the class probability of each pixel produced from the compact network. We adopt an obvious way : use the class probabilities produced from the cumbersome model as soft targets for training the compact network.
The loss function is given as follows,
where represent the class probabilities of the th pixel produced from the compact network , represent the class probabilities of the th pixel produced from the cumbersome network , is the Kullback-Leibler divergence between two probabilities, and denotes all the pixels.
3.1 Structured Knowledge Distillation
In addition to a straightforward scheme, pixel-wise distillation, we present two structured knowledge distillation schemes, pair-wise distillation and holistic distillation, to transfer structured knowledge from the cumbersome network to the compact network. The pipeline is illustrated in Figure 2.
Pair-wise distillation. Inspired by the pair-wise Markov random field framework that is widely adopted for improving spatial labeling contiguity, we propose to transfer the pair-wise relations, specially pair-wise similarities in our approach, among spatial locations.
We build up a static affinity graph to denote the spatial pair-wise relations, in which, the nodes represent for different spatial locations and the connection between two nodes represents the similarity. We denote the connection range and the granularity of each node to control the size of the static affinity graph. For each node, we only consider the similarities with top- near nodes according to spatial distance (here we use the Chebyshev distance) and aggregate pixels in a spatial local patch to represent the feature of this node as illustrate in Figure 3.
For a feature map, here are pixels. With the granularity and the connection range , the static affinity graph will contain nodes with connections.
Let denote the similarity between the th node and the th node the produced from the cumbersome network and denote the similarity between the th node and the th node produced from the compact network . We adopt the squared difference to formulate the pair-wise similarity distillation loss,
where denotes all the nodes. In our implementation, we use an average pool to aggregate features in one node to be , and the similarity between two nodes is simply computed from the aggregated features and as
which empirically works well.
Holistic distillation. We align the high-order relations between the segmentation maps produced from the cumbersome and compact networks. The holistic embeddings of the segmentation maps are computed as the representations.
We adopt conditional generative adversarial learning  for formulating the holistic distillation problem. The compact net is regarded as a generator conditioned on the input RGB image , and the predicted segmentation map is regarded as a fake sample. We expect that is as similar to , which is the segmentation map predicted by the teacher and is regarded as the real sample, as possible. The GAN is usually suffering from the unstable gradient in training the generator due to the discontinuous Jensen-Shannon (JS) divergence, along with other common distance and divergence. The Wasserstein distance  (also known as the Earth Mover distance) can be used to measure the difference between two distributions. The Wasserstein distance is defined as the minimum cost to converge the model distribution to the real distribution . It can help to overcome the problem of the gradient vanish or explosion for neural networks, which is written as the following,
where is the expectation operator, and is an embedding network, acting as the discriminator in GAN, which projects and together into a holistic embedding score. The Lipschitz requirement is satisfied by the gradient penalty.
The segmentation map and the conditional RGB image are concatenated as the input of the embedding network . is a fully convolutional neural network with five convolutions. Two self-attention modules are inserted between the final three layers to capture the structure information . A batch normalization layer is add before the concat input to handle the different scales of the RGB image and the logits.
Such a discriminator is able to produce a holistic embedding representing how well the input image and the segmentation map match. We further add a pooling layer to pool the holistic embedding into a score. As we employ the wasserstein distance in the adversarial training, the discriminator is trained to give a higher score to the output segmentation map from the teacher net and give lower scores to the ones from the student. In this processing, we distill the knowledge of evaluating the quality of a segmentation map into the discriminator. The student is trained with the regularization of achieving a higher score under the discriminator, which help improve the performance of the student.
The whole objective function consists of
a conventional multi-class cross-entropy loss
with pixel-wise and structured distillation terms
where and are set as and , making these loss value ranges comparable. We minimize the objective function with respect to the parameters of the compact segmentation network , while maximize it with respect to the parameters of the discriminator , which is implemented by iterating the following two steps:
Train the discriminator . Training the discriminator is equivalent to minimizing . aims to give a high embedding score for the real samples from the teacher net and a low embedding score for the fake samples from the student net.
Train the compact segmentation network . Given the discriminator network, the goal is to minimize the multi-class cross-entropy loss and the distillation loss relevant to the compact segmentation network:
is a part of given in Equation 3, and we expect to achieve a higher score under the evaluation of .
3.3 Extension to Dense Prediction Tasks
Dense prediction learns a mapping from input RGB image with size to a per-pixel output with size . In the semantic segmentation, the output has channels which is equal to the number of semantic classes.
For the object detection task, for each pixel, we predict the classes with one background class, as well as a 4D tensor representing the distance from the location to the four sides of the bounding box. We follow the task loss in FCOS , and combine with the distillation terms as regularization.
In the depth estimation task, the depth estimation task can be solved as a classification task, as the continuous depth values can be divided into discrete categories . After we get the predict probability, we apply a weighted cross entropy loss following . The pair-wise distillation can be directly applied to the intermediate feature map. The holistic distillation uses the depth map as input. We can use the ground truth as the real samples of GAN in depth estimation task, because it is a continuous map. However, in order to apply our method to unlabelled data, we still consider the depth map from the teacher as our real samples.
In this section, we choose the typical structured output prediction task- semantic segmentation as an example to verify the effectiveness of the structured knowledge distillation. We discuss and explore how does structured knowledge distillation work and how well does structured knowledge distillation work in Section 4.1.4 and Section 4.1.5.
The structured knowledge distillation can be applied to other structured output prediction tasks under the FCN framework. In Section 4.3 and Section 4.2, we apply our distillation method to strong baselines in object detection and depth estimation tasks with mirror various, and further improve the performance.
4.1 Semantic segmentation
We study recent public compact networks, and employ several different architectures to verify the effectiveness of the distillation framework. We first consider ResNet as a basic student network and conduct ablation studies on it. Then, we employ the open source MobileNetVPlus , which is based on a pretrained MobileNetV  model on the ImageNet dataset. We also test the structure of ESPNet-C  and ESPNet  that are very compact and have low complexity.
Training setup. Most segmentation networks in this paper are trained by mini-batch stochastic gradient descent (SGD) with the momentum () and the weight decay () for iterations. The learning rate is initialized as and is multiplied by . We random cut the the images into as the training input. Normal data augmentation methods are applied during training, such as random scaling (from to ) and random flipping. Other than this, we follow the settings in the corresponding publications  to reproduce the results of ESPNet and ESPNet-C, and train the compact networks under our distillation framework.
Cityscapes. The Cityscapes dataset  is collected for urban scene understanding and contains classes with only classes used for evaluation. The dataset contains high quality pixel-level finely annotated images and coarsely annotated images. The finely annotated images are divided into , , images for training, validation and testing. We only use the finely annotated dataset in our experiments.
CamVid. The CamVid dataset  is an automotive dataset. It contains training and testing images. We evaluate the performance over different classes such as building, tree, sky, car, road, etc. and ignore the th class that contains unlabeled data.
ADEK. The ADEK dataset  is used in ImageNet scene parsing challenge . It contains classes and under diverse scenes. The dataset is divided into // images for training, validation and testing.
We use the following metrics to evaluate the segmentation accuracy, as well as the model size and the efficiency.
The Intersection over Union (IoU) score is calculated as the ratio of interval and union between the ground truth mask and the predicted segmentation mask for each class. We use the mean IoU of all classes (mIoU) to study the distillation effectiveness. We also report the class IoU to study the effect of distillation on different classes. Pixel accuracy is the ratio of the pixels with the correct semantic labels to the overall pixels.
The model size is represented by the number of network parameters. and the Complexity is evaluated by the sum of floating point operations (FLOPs) in one forward on a fixed input size.
The effectiveness of distillations. We look into the effect of enabling and disabling different components of our distillation system. The experiments are conduct on ResNet with its variant ResNet () representing a width-halved version of ResNet on the Cityscapes dataset. In Table I, the results of different settings for the student net are the average results of the final epoch from three runs.
|Method||Validation mIoU (%)||Training mIoU (%)|
|+ PI + PA|
|+ PI + PA + HO|
|+ PI + PA|
|+ PI + PA + HO|
|+ PI + PA|
|+ PI + PA + HO|
From Table I, we can see that distillation can improve the performance of the student network, and distilling the structure information helps the student learn better. With the three distillation terms, the improvements for ResNet (), ResNet () and ResNet () with weights pretrained from the ImageNet dataset are , and , respectively, which indicates that the effect of distillation is more pronounced for the smaller student network and networks without initialization with the weight pre-trained from the ImageNet. Such an initialization is also a way to transfer the knowledge from other source (ImageNet). The best mIoU of the holistic distillation for ResNet () reaches on the validation set.
On the other hand, one can see that each distillation scheme lead to higher mIoU score. This implies that the three distillation schemes make complementary contributions for better training the compact network.
The affinity graph in pair-wise distillation. In this section, we discuss the impact of the connection range and the granularity of each node in building the affinity graph. To calculate the pair-wise similarity among each pixel in the feature map will form the most complete affinity graph, but lead to high computational complexity. We fix the node to be one pixel, and change the connection range from global feature map to local patches. Then, we keep the connection range to be the global feature map, and use a local patch to denote each node to change the granularity from fine to coarse. The result are shown in Table II. The results of different settings for the pair-wise distillation are the average results from three runs. We employ a ResNet18 (1.0) with the weight pretrained from ImageNet as the student network. All the experiments are performs with both pixel-wise distillation and pair-wise distillation, but the size of the affinity graph in pair-wise distillation are various. From Table II, we can see increasing the connection range can help improve the distillation performance. With the global feature map, the student can achieve around mIoU. The best is , which is slight better than the complete affinity graph, but the connections are significant decreased. Using a small local patch to denote a node and calculate the affinity graph may form a more stable correlation between different locations. One can choose to use the local patch to cut off the number of the nodes, instead of decreasing the connection range for a better trade-off between efficiency and accuracy.
The adversarial training in holistic distillation. In this section, we illustrate that GAN is able to distill the holistic knowledge. As described in 3.1 , we employ a fully convolutional network with five convolution blocks as our discriminator. Each convolution block has ReLu and bn layers, except for the final output convolution layer. We also insert two self-attention blocks in the discriminator to capture the structure information. The ability of discriminator will effect the adversarial training, and we conduct experiments to discuss the impact of the discriminator’s architecture. The results are shown in Table III, and we use to represent the architecture of the discriminator with self-attention layers and convolution blocks with bn layers. The detail structure can be seen in Fig 4, and the red arrow represents for an self-attention layer.
From the Table III, we can see adding self-attention layers can improve the average mean of mIoU, and adding more self-attention layers will stable the results as the variance is smaller. We choose to add self-attention blocks considering the performance, stability, and computational cost. With the same self-attention layer, a deeper discriminator can help the adversarial training.
|Architecture Index||Validation mIoU (%)|
|Change self-attention layers|
|Remove convolution blocks|
|Class||mIoU||road||sidewalk||building||wall||fence||pole||Traffic light||Traffic sign||vegetation|
To verify the effectiveness of adversarial training, we further explore the learning ability of three typical discriminators: the shallowest one (A2L2), the one without attention layer (A0L4) and ours (A2L4). The IoU for different classes are listed in IV. It is obvious that the self-attention layer can help the discriminator to capture the structure, therefore the accuracy of the students with the structure objects is improved.
In the adversarial training, the student, a.k.a the generator, is trying to learn the distribution of the real samples (output of the teacher). Because we apply the Wasserstein distance to transfer the difference of two distributions into a more intuitive score, we can see the score are highly relative to the quality of the segmentation maps. We use a well-trained discriminator (A2L4) to evaluate the score of a segmentation map. For each image, we feed five segmentation maps, output by the teacher net, the student net w/o holistic distillation, the student net with output by the teacher net, the student net w/o holistic distillation, and the student nets w/ holistic distillation under three different discriminator architectures (listed in Tab. IV) into the discriminator , and compare the distribution of embedding scores. We evaluate on the validation set and calculate the average score difference between different student nets and the teacher net, the results are shown in Table V. With holistic distillation, the segmentation maps produced from student net can achieve a similar score to the teacher, indicating that GAN helps distill the holistic structure knowledge.
|Student w/o D||2.28||69.21|
We also draw a histogram to show the score distribution of the segmentation map across the validation set in Figure 5. The well-trained can assign a higher score to a high quality segmentation maps, and the three student nets with the holistic distillation can generate segmentation maps with higher scores and better quality. Adding self-attention layers and more convolution blocks help the student net to imitate the distribution of the teacher net, and get a better performance.
Feature and local pair-wise distillation. We compare the variants of the pair-wise distillation:
Feature distillation by attention transfer : We aggregate the response maps into a so-called attention map (single channel), and then transfer the attention map from the teacher to the student.
Local pair-wise distillation : We distill a local similarity map, which represents the similarities between each pixel and the -neighborhood pixels.
|Method||ResNet ()||ResNet () + ImN|
|+ PI + MIMIC|
|+ PI + AT|
|+ PI + LOCAL|
|+ PI + PA|
We replace our pair-wise distillation by the above three distillation schemes to verify the effectiveness of our global pair-wise distillation. From Table VI, we can see that our pair-wise distillation method outperforms all the other distillation methods. The superiority over feature distillation schemes: MIMIC  and attention transfer , which transfers the knowledge for each pixel separately, comes from that we transfer the structured knowledge other than aligning the feature for each individual pixel. The superiority to the local pair-wise distillation shows the effectiveness of our global pare-wise distillation which is able to transfer the whole structure information other than a local boundary information .
|Method||#Params (M)||FLOPs (B)||Test §||Val.|
|Current state-of-the-art results|
|ENet  †||n/a|
|ERFNet  ‡||n/a|
|FCN  ‡||n/a|
|RefineNet  ‡||n/a|
|PSPNet  ‡||n/a|
|Results w/ and w/o distillation schemes|
|MD  ‡||5||n/a|
|MD (Enhanced)  ‡||n/a|
|ESPNet-C  †|
|ESPNet-C (ours) †|
|ESPNet  †|
|ESPNet (ours) †|
|ResNet () †|
|ResNet () (ours) †|
|ResNet18 (1.0) ‡|
|ResNet18 (1.0) (ours) ‡|
|MobileNetVPlus  ‡|
|MobileNetVPlus (ours) ‡|
Train from scratch
Initialized from the weights pre-trained on ImageNet
We select a best model along training on validation set to submit to the leader board. All our models are test on single scale. Some cumbersome networks are test on multiple scales, such as OCNet and PSPNet.
We apply our structure distillation method to several compact networks:
MobileNetV2Plus  which is based on a MobileNetV2 model,
ESPNet-C  and ESPNet 
which are carefully designed for mobile applications.
presents the segmentation accuracy, the model complexity and the model size.
Figure 7 shows the IoU scores for each class over MobileNetV2Plus. Both the pixel-wise and structured distillation schemes improve the performance, especially for the categories with low IoU scores. In particular, the structured distillation (pair-wise and holistic) has significant improvement for structured objects, e.g., improvement for Bus and for Truck. The qualitative segmentation results in Figure 8 visually demonstrate the effectiveness of our structured distillation for structured objects, such as trucks, buses, persons, and traffic signs.
CamVid. Table VIII shows the performance of the student networks w/o and w/ our distillation schemes and state-of-the-art results. We train and evaluate the student networks w/ and w/o distillation at the resolution following the setting of ENet. Again we can see that the distillation scheme improves the performance. Figure 9 shows some samples on the CamVid test set w/o and w/ the distillation produced from ESPNet.
|Method||Extra data||mIoU (%)||#Params (M)|
We also conduct an experiment by using an extra unlabeled dataset, which contains unlabeled street scene images collected from the Cityscapes dataset, to show that the distillation schemes can transfer the knowledge of the unlabeled images. The experiments are done with ESPNet and ESPNet-C. The loss function is almost the same except that there is no cross-entropy loss over the unlabeled dataset. The results are shown in Figure 10. We can see that our distillation method with the extra unlabeled data can significantly improve mIoU of ESPNet-c and ESPNet for and .
ADEK. The ADEK dataset is a very challenging dataset and contains categories of objects. The frequency of objects appearing in scenes and the pixel ratios of different objects follow a long-tail distribution. For example, the stuff classes like wall, building, floor, and sky occupy more than of all the annotated pixels, and the discrete objects, such as vase and microwave at the tail of the distribution, occupy only of the annotated pixels.
We report the results for ResNet and the MobileNetV which are trained with the initial weights pretrained on the ImageNet dataset, and ESPNet which is trained from scratch in Table IX. We follow the same training scheme in . All the results are tested on single scale. For ESPNet, with our distillation, we can see that the mIoU score is improved by , and it achieves a higher accuracy with samller #parameters compared to SegNet. For ResNet and MobileNetV, after the distillation, we have a improvement over the one without distillation reported in . We check the result for each class and find that the improvements are mainly from the discrete objects.
|Method||mIoU(%)||Pixel Acc. (%)||#Params (M)|
|PSPNet (teacher) |
4.2 Depth Estimation
|Lower is better||Higher is better|
|Laina et al. ||ResNet50||60.62||0.127||0.055||0.573||0.811||0.953||0.988|
|VNL (teacher) ||ResNext101||86.24||0.108||0.048||0.416||0.875||0.976||0.994|
|VNL (student) w/ distillation||MobileNetV2||2.7||0.130||0.055||0.544||0.838||0.971||0.994|
Network structures We use the same model described in  with the ResNext101 backbone as our cumbersome model, and replace the backbone with MobileNetV2 as the compact model. Training details We train the student net using the crop size by mini-batch stochastic gradient descent (SGD) with batchsize of . The initialized learning rate is and is multiplied by . For both the student model w/ and w/o distillation methods, the training epoch is .
NYUD-V2. The NYUD-V2 dataset annotated indoor images, in which images are split for training and others are for testing. The image size is . Some methods will sample more images from the video sequence of NYUD-v2 to form a Large-NYUD-v2 to further improve the performance. Following , we do ablation studies on the small dataset and also apply the distillation method on current state-of-the-art real-time depth models trained with Large-NYUD-v2 to verify the effectiveness of the structured knowledge distillation.
We follow previous methods  to evaluate the performance of monocular depth estimation quantitatively based on following metrics: mean absolute relative error (rel), mean error (), root mean squared error (rms) , and the accuracy under threshold ().
Ablation studies We compare the pixel-wise distillation and the structured knowledge distillation in this section. In the dense-classification problem, e.g. semantic segmentation, the output logits of the teacher is a soft distribution of all the classes, which contains the relations among different classes. Therefore, directly transfer the logits from cumbersome models from the compact ones on pixel level can help improve the performance. Different from semantic segmentation, as the depth map are continues values, the output of the teacher is not as accurate as the ground truth. In the experiments, we found that adding pixel-level distillation can not help improve the accuracy in depth estimation task. We only use the structured knowledge distillation in depth estimation task.
To verify that the distillation method can further improve the accuracy with unlabeled data, we use image sampled from the video sequence of NYUD-v2 without the depth map. The results are shown in XI, we can see that the structured knowledge distillation performs better than pixel-wise distillation, and adding extral unlabelled data can help further improve the accuracy.
|Method||Baseline||PI||+PA||+PA +HO||+PA +HO +Unl|
Comparion with current state-of-the-art We apply the distillation method on top of current state-of-the-art mobile models for depth estimation. Following , we train the student net on Large-NYUD-v2 with the same constraints in  as our baseline, and achieve on rel. Following the same training setups, with the structured knowledge distillation terms, we further improve the strong baseline, and achieve on rel. In Table X, we list the model parameters and the accuracy of current state-of-the-art large models along with some real-time models, indicating the structured knowledge distillation can work on a strong baseline.
4.3 Object Detection
Network structures We adopt the recent one-stage architecture FCOS  with the backbone ResNeXt-32x8d-101-FPN as the cumbersome network (teacher). The channel in the detector towers is set to . It is a simple anchor-free model, but can achieve comparable performance with state-of-the-art two-stage methods.
We choose two different models based on the MobileNetV2 backbone: c128-MNV2 and c256-MNV2 released by FCOS  as our student nets, where represents the channel in the detector towers. We apply the distillation loss on all the output levels of the feature pyramid network.
Training setup We follow the training schedule in FCOS . For ablation studies, all the teacher, the student w/ and w/o distillation are trained with stochastic gradient descent (SGD) for 90K iterations with the initial learning rate being 0.01 and a mini batch of 16 images. The learning rate is reduced by a factor of 10 at iteration 60K and 80K, respectively. Weight decay and momentum are set as 0.0001 and 0.9, respectively. To compare with other state-of-the-art real-time detectors, we double the training iterations and the batch size, and the distillation method can further improve the results on the strong baselines.
COCO Microsoft Common Objects in Context (COCO)  is a large-scale detection benchmark in object detection. With the common used COCO trainval35k split [56, 39], there are images for training nd images for validation. We evaluate the ablation results on the validation set, and we also submit the final results to the test-dev of COCO. It contains different classes and million object instances.
AP (Average precision) is a popular metric in measuring the accuracy of object detectors [56, 42]. Average precision computes the average precision value for recall value over 0 to 1. The mAP on COCO datasets is a little different from transitional metrics. It is the averaged AP over multiple Intersection over Union (IoU) values, from 0.5 to 0.95 with a step of 0.05. Averaging over IoUs rewards detectors with better localization. We also report AP50 and AP75 represents for the AP with a single IoU of 0.5 and 0.75, respectively. APs, APm and APl are AP across different scales for small, medium and large objects.
Compare with different distillation methods To demonstrate the effectiveness of the structured knowledge distillation, we compare the pair-wise distillation method with previous MIMIC  method, which aligns the feature map on pixel-level. We use the c256-MNV2 as the student net and the results are shown in Table XII. By adding the pixel-wise MIMIC distillation method, the detector can improve on mAP, while our structured knowledge distillation method can on mAP. Under all evaluation metrics, the structured knowledge distillation method performs better than MIMIC. By combing the structured knowledge distillation with the pixel-wise distillation, the results can be further improved to achieve the mAP of . Compare to the baseline method without distillation, the improvement of AP75, APs and APl are more obvious, indicating the distillation method can help the detector get a more accurate results, and do better with extremely small and large objects. We also illustrate in Figure 12, one can see that the detector trained with distillation method can find out more small objects like the person and the bird, and provide a higher score for the predict label.
Results of different student nets We follow the same training steps (90K) and batch size (32) as in FCOS  and apply the distillation method on two different released structures: C256-MV2 and C128-MV2. The results of w/ and w/o distillation are shown in Table XIII, by applying the structured knowledge distillation combine with pixel-wise distillation, the mAP of C128-MV2 and and C256-MV2 are improved by and , respectively.
|FCOS  (teacher)||ResNeXt-101-FPN||42.7||62.2||46.1||26.0||45.6||52.6||130|
|FCOS (student) ||MobileNetV2-FPN||31.4||49.2||33.3||17.1||33.5||38.8||45|
|FCOS (student) w/ distillation||MobileNetV2-FPN||34.1||52.2||36.4||19.0||36.2||42.0||45|
Results on the test-dev The original mAP on the validation set of C128-MV2 reported by FCOS is with 90K iterations. We double the training iterations and train with the distillation method. The final mAP on minival is . We submit this results to the COCO test-dev to show the position of our method, comparing against state-of-the-art. To make a fair comparison, we also double the training iterations without any distillation methods, and obtain mAP of on minival. The test results are in Table XIV, and we also list the AP and inference time for some state-of-the-art one-stage detectors to show the position of the baseline we choose and our detectors trained with structured knowledge distillation method.
We have studied knowledge distillation for training compact dense prediction networks with the help of cumbersome/teacher networks. By considering the structure information in dense prediction problem, we have presented two structural distillation schemes: pair-wise distillation and holistic distillation. We demonstrate the effectiveness of our proposed distillation schemes on several recently-developed compact networks on three dense prediction tasks: semantic segmentation, depth estimation and object detection. Our structured knowledge distillation methods are complimentary to traditional pixel-wise distillation methods.
Yifan Liu obtained her B.S. and M.Sc. in Artificial Intelligence from Beihang University, and now is working as a Ph.D candidate in Computer Science at The University of Adelaide, supervised by Professor Chunhua Shen. Her research interests include image processing, dense prediction and real-time application in deep learning.
Changyong Shu received the B.S. degree in engineering mechanics from China University of Mining and Technology, Xuzhou, China in 2011, and the Ph.D. degree in flight vehicle design and engineering from Beihang University, Beijing, China, in 2017. He has been with the Nanjing Institute of Advanced Artificial Intelligence since 2018. His current research interest focuses on knowledge distillation.
Jingdong Wang is a Senior Principal Research Manager with the Visual Computing Group, Microsoft Research, Beijing, China. He received the B.Eng. and M.Eng. degrees from the Department of Automation, Tsinghua University, Beijing, China, in 2001 and 2004, respectively, and the PhD degree from the Department of Computer Science and Engineering, the Hong Kong University of Science and Technology, Hong Kong, in 2007. His areas of interest include deep learning, large-scale indexing, human understanding, and person re-identification. He is an Associate Editor of IEEE TPAMI, IEEE TMM and IEEE TCSVT, and is an area chair (or SPC) of some prestigious conferences, such as CVPR, ICCV, ECCV, ACM MM, IJCAI, and AAAI. He is a Fellow of IAPR and an ACM Distinguished Member.
Chunhua Shen is a Professor at School of Computer Science, The University of Adelaide, Australia.
- The objective function is the summation of the losses over the mini-batch of training samples. For description clarity, we omit the summation operation.
- The FLOPs is calculated with the pytorch version implementation .
- (2018) Note: \urlhttps://github.com/warmspringwinds/pytorch-segmentation-detection/blob/master/pytorch_segmentation_detection/utils/flops_benchmark.py Cited by: footnote 2.
- (2014) Do deep nets really need to be deep?. In Proc. Advances in Neural Inf. Process. Syst., pp. 2654–2662. Cited by: §2.
- (2017) SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. (12), pp. 2481–2495. Cited by: Fig. 1, TABLE VIII, TABLE IX.
- (2008) Segmentation and recognition using structure from motion point clouds. In Proc. Eur. Conf. Comp. Vis., pp. 44–57. Cited by: §4.1.2.
- (2017) Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Trans. Circuits Syst. Video Technol. 28 (11), pp. 3174–3182. Cited by: §3.3.
- (2018) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40 (4), pp. 834–848. Cited by: §1, §2.
- (2015) Semantic image segmentation with deep convolutional nets and fully connected crfs. In Proc. Int. Conf. Learn. Representations, Cited by: §1, §2, TABLE VIII.
- (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. Proc. Eur. Conf. Comp. Vis.. Cited by: §1, §2.
- (2017) Adversarial PoseNet: a structure-aware convolutional network for human pose estimation. In Proc. IEEE Int. Conf. Comp. Vis., pp. 1212–1221. Cited by: §2.
- (2018) DarkRank: accelerating deep metric learning via cross sample similarities transfer. Proc. Eur. Conf. Comp. Vis.. Cited by: §2.
- (2015) Depth analogy: data-driven approach for single image depth estimation using gradient samples. IEEE Trans. Image Process. 24 (12), pp. 5953–5966. Cited by: §2.
- (2016) The cityscapes dataset for semantic urban scene understanding. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §4.1.2.
- (2017) The one hundred layers tiramisu: fully convolutional densenets for semantic segmentation. Proc. Workshop of IEEE Conf. Comp. Vis. Patt. Recogn.. Cited by: TABLE VIII.
- (2018) Geo-supervised visual depth prediction. In arXiv: Comp. Res. Repository, Vol. abs/1807.11130. Cited by: §2.
- (2017) Dssd: deconvolutional single shot detector. In arXiv: Comp. Res. Repository, Vol. abs/1701.06659. Cited by: §2, TABLE XIV.
- (2018) Deep ordinal regression network for monocular depth estimation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2002–2011. Cited by: §2, TABLE X.
- (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 580–587. Cited by: §2.
- (2015) Fast r-cnn. In Proc. IEEE Int. Conf. Comp. Vis., pp. 1440–1448. Cited by: §2.
- (2014) Generative adversarial nets. Proc. Advances in Neural Inf. Process. Syst. 3, pp. 2672–2680. Cited by: §2.
- (2017) Improved training of wasserstein gans. In Proc. Advances in Neural Inf. Process. Syst., pp. 5767–5777. Cited by: §3.1.
- (2018) Generative adversarial networks for depth map estimation from rgb video. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 1177–1185. Cited by: §2.
- (2015) Direction matters: depth estimation with a surface normal classifier. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 381–389. Cited by: §2.
- (2017) Mask r-cnn. In Proc. IEEE Int. Conf. Comp. Vis., pp. 2961–2969. Cited by: §2.
- (2016) Deep residual learning for image recognition. Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 770–778. Cited by: §2, §4.1.1.
- (2015) Distilling the knowledge in a neural network. arXiv: Comp. Res. Repository abs/1503.02531. Cited by: §1, §2, §3.
- (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv: Comp. Res. Repository abs/1704.04861. Cited by: §2.
- (2019) Revisiting single image depth estimation: toward higher resolution maps with accurate object boundaries. In Proc. Winter Conf. on Appl. of Comp0 Vis., pp. 1043–1051. Cited by: TABLE X.
- (2017) Densely connected convolutional networks. Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2261–2269. Cited by: §2.
- (2016) SqueezeNet: alexnet-level accuracy with 50x fewer parameters and <1mb model size. arXiv: Comp. Res. Repository abs/1602.07360. Cited by: §2.
- (2016) Perceptual losses for real-time style transfer and super-resolution. Proc. Eur. Conf. Comp. Vis., pp. 694–711. Cited by: §2.
- (2018) Progressive growing of gans for improved quality, stability, and variation. Proc. Int. Conf. Learn. Representations. Cited by: §2.
- (2016) Deeper depth prediction with fully convolutional residual networks. In 3D Vision, pp. 239–248. Cited by: §2, TABLE X.
- (2018) Cornernet: detecting objects as paired keypoints. In Proc. Eur. Conf. Comp. Vis., pp. 734–750. Cited by: §2.
- (2017) Mimicking very efficient network for object detection. Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 7341–7349. Cited by: §1, §1, §2, 1st item, §4.1.4, §4.3.4, TABLE XII, TABLE VI.
- (2018) Deep attention-based classification network for robust depth prediction. In arXiv: Comp. Res. Repository, Vol. abs/1807.03959. Cited by: §2.
- (2009) Markov random field modeling in image analysis. Springer Science & Business Media. Cited by: §1.
- (2014) DEPT: depth estimation by parameter transfer for single still images. In Proc. Asian Conf. Comp. Vis., pp. 45–58. Cited by: §2.
- (2019) Refinenet: multi-path refinement networks for dense prediction. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: Fig. 1, §1, §2, TABLE VII.
- (2017) Focal loss for dense object detection. In Proc. IEEE Int. Conf. Comp. Vis., pp. 2980–2988. Cited by: §2, §4.3.2, TABLE XIV.
- (2014) Microsoft coco: common objects in context. In Proc. Eur. Conf. Comp. Vis., pp. 740–755. Cited by: §4.3.2.
- (2018) LightNet: light-weight networks for semantic image segmentation. Note: \urlhttps://github.com/ansleliu/LightNet Cited by: Fig. 1, Fig. 7, §4.1.1, §4.1.5, TABLE VII.
- (2016) Ssd: single shot multibox detector. In Proc. Eur. Conf. Comp. Vis., pp. 21–37. Cited by: §1, §2, §4.3.3, TABLE XIV.
- (2019) Structured knowledge distillation for semantic segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2604–2613. Cited by: §2.
- (2018) Auto-painter: cartoon image generation from sketch by using conditional wasserstein generative adversarial networks. Neurocomputing 311, pp. 78–87. Cited by: §2.
- (2016) Semantic segmentation using adversarial networks. arXiv: Comp. Res. Repository abs/1611.08408. Cited by: §2.
- (2018) ESPNet: efficient spatial pyramid of dilated convolutions for semantic segmentation. Proc. Eur. Conf. Comp. Vis.. Cited by: Fig. 1, §2, §4.1.1, §4.1.1, §4.1.5, TABLE VII, TABLE VIII, TABLE IX.
- (2014) Conditional generative adversarial nets. arXiv: Comp. Res. Repository abs/1411.1784. Cited by: §2, §3.1.
- (2018) Real-time joint semantic segmentation and depth estimation using asymmetric annotations. In arXiv: Comp. Res. Repository, Vol. abs/1809.04766. Cited by: TABLE X.
- (2015) Learning deconvolution network for semantic segmentation. In Proc. IEEE Int. Conf. Comp. Vis., pp. 1520–1528. Cited by: §2.
- (2016) ENet: a deep neural network architecture for real-time semantic segmentation. arXiv: Comp. Res. Repository abs/1606.02147. Cited by: Fig. 1, §1, §2, TABLE VII, TABLE VIII.
- (2016) Context encoders: feature learning by inpainting. Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2536–2544. Cited by: §2.
- (2016) You only look once: unified, real-time object detection. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 779–788. Cited by: §1, §2.
- (2017) YOLO9000: better, faster, stronger. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 7263–7271. Cited by: TABLE XIV.
- (2018) Yolov3: an incremental improvement. In arXiv: Comp. Res. Repository, Vol. abs/1804.02767. Cited by: TABLE XIV.
- (2016) Generative adversarial text to image synthesis. Proc. Int. Conf. Mach. Learn., pp. 1060–1069. Cited by: §2.
- (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Proc. Advances in Neural Inf. Process. Syst., pp. 91–99. Cited by: §2, §2, §4.3.2, §4.3.3.
- (2017) Efficient convnet for real-time semantic segmentation. In IEEE Intelligent Vehicles Symp., pp. 1789–1794. Cited by: Fig. 1.
- (2014) Fitnets: hints for thin deep nets. arXiv: Comp. Res. Repository abs/1412.6550. Cited by: §1, §2, 1st item, TABLE VI.
- (2015) U-net: convolutional networks for biomedical image segmentation. In Proc. Medical Image Computing and Computer-Assisted Intervention, pp. 234–241. Cited by: §2.
- (2018) MobileNetV2: inverted residuals and linear bottlenecks. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §2, §4.1.1.
- (2007) Learning 3-d scene structure from a single still image. In Proc. IEEE Int. Conf. Comp. Vis., pp. 1–8. Cited by: §2.
- (2017) Fully convolutional networks for semantic segmentation.. IEEE Trans. Pattern Anal. Mach. Intell. 39 (4), pp. 640. Cited by: Fig. 1, §1, §2, TABLE VII, TABLE VIII, TABLE IX.
- (2018) CReaM: condensed real-time models for depth prediction using convolutional neural networks. In Int. Conf. on Intell. Robots and Sys., pp. 540–547. Cited by: §2, TABLE X.
- (2015) Going deeper with convolutions. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 1–9. Cited by: §2.
- (2016) Rethinking the inception architecture for computer vision. Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2818–2826. Cited by: §2.
- (2019) FCOS: fully convolutional one-stage object detection. Proc. IEEE Int. Conf. Comp. Vis.. Cited by: §1, §2, §2, §3.3, §4.3.1, §4.3.1, §4.3.1, §4.3.4, TABLE XIV.
- (2016) Speeding up semantic segmentation for autonomous driving. In Proc. Workshop of Advances in Neural Inf. Process. Syst., Cited by: §2.
- (2016) Do deep convolutional nets really need to be deep (or even convolutional)?. In Proc. Int. Conf. Learn. Representations, Cited by: §2.
- (2018) Text generation based on generative adversarial nets with latent variables. In Proc. Pacific-Asia Conf. Knowledge discovery & data mining, pp. 92–103. Cited by: §2.
- (2019) Enforcing geometric constraints of virtual normal for depth prediction. Proc. IEEE Int. Conf. Comp. Vis.. Cited by: §2, §2, §3.3, §4.2.1, §4.2.2, §4.2.3, §4.2.4, TABLE X.
- (2019) FastDepth: fast monocular depth estimation on embedded systems. Int. Conf. on Robotics and Automation. Cited by: §1, §2.
- (2018) Unified perceptual parsing for scene understanding. In Proc. Eur. Conf. Comp. Vis., Cited by: §4.1.5, TABLE IX.
- (2018) Improving fast segmentation with teacher-student learning. Proc. British Machine Vis. Conf.. Cited by: §1, §1, §2, 3rd item, §4.1.4, §4.1.5, TABLE VI, TABLE VII.
- (2018) Learning a discriminative feature network for semantic segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §2.
- (2016) Multi-scale context aggregation by dilated convolutions. Proc. Int. Conf. Learn. Representations. Cited by: §1, §2, TABLE VII.
- (2017) SeqGAN: sequence generative adversarial nets with policy gradient.. In Proc. AAAI Conf. Artificial Intell., pp. 2852–2858. Cited by: §2.
- (2018) OCNet: object context network for scene parsing. In arXiv: Comp. Res. Repository, Vol. abs/1809.00916. Cited by: Fig. 1, §2, TABLE VII.
- (2017) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. Proc. Int. Conf. Learn. Representations. Cited by: §2, 2nd item, §4.1.4, TABLE VI.
- (2018) Self-attention generative adversarial networks. In arXiv: Comp. Res. Repository, Vol. abs/1805.08318. Cited by: §3.1.
- (2017) Interleaved group convolutions. In Proc. IEEE Int. Conf. Comp. Vis., pp. 4383–4392. Cited by: §2.
- (2018) ShuffleNet: an extremely efficient convolutional neural network for mobile devices. Proc. IEEE Conf. Comp. Vis. Patt. Recogn.. Cited by: §2.
- (2018) Icnet for real-time semantic segmentation on high-resolution images. Proc. Eur. Conf. Comp. Vis.. Cited by: §1, §2.
- (2017) Pyramid scene parsing network. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2881–2890. Cited by: Fig. 1, §1, §2, §4.1.1, TABLE VII, TABLE IX.
- (2017) Scene parsing through ade20k dataset. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §4.1.2.