Structured Knowledge Distillation forDense Prediction

Structured Knowledge Distillation for Dense Prediction


In this paper, we consider transferring the structure information from large networks to small ones for dense prediction tasks. Previous knowledge distillation strategies used for dense prediction tasks often directly borrow the distillation scheme for image classification and perform knowledge distillation for each pixel separately, leading to sub-optimal performance. Here we propose to distill structured knowledge from large networks to small networks, taking into account the fact that dense prediction is a structured prediction problem. Specifically, we study two structured distillation schemes: i) pair-wise distillation that distills the pairwise similarities by building a static graph; and ii) holistic distillation that uses adversarial training to distill holistic knowledge. The effectiveness of our knowledge distillation approaches is demonstrated by extensive experiments on three dense prediction tasks: semantic segmentation, depth estimation and object detection.

Structure knowledge distillation, adversarial training, knowledge transferring, dense prediction.

1 Introduction

Dense prediction is a category of fundamental problems in computer vision, which learns a mapping from input objects to complex output structures, including semantic segmentation, depth estimation and object detection, among many others. One needs to assign category labels or regress specific values for each pixel given an input images to form the structured outputs. In general these tasks are significantly more challenging to solve than image-level prediction problems, thus often requiring networks with large capacity in order to achieve satisfactory accuracy. On the other hand, compact models are desirable for enabling edge computing with limited computation resources.

Deep neural networks have been the dominant solutions since the invention of fully-convolutional neural networks (FCNs) [62]. Subsequent approaches, e.g., DeepLab [7, 6, 8, 75], PSPNet [83], RefineNet [38], and FCOS [66] follow the design of FCNs to optimize energy-based objective functions related to different tasks, having achieved significant improvement in accuracy, often with cumbersome models and expensive computation.

Recently, design of neural networks with compact model sizes, light computation cost and high performance, has attracted much attention due to the need of applications on mobile devices. Most current efforts have been devoted to designing lightweight networks specially for dense prediction tasks or borrowing the design from classification networks, e.g., ENet [50], ESPNet [50] and ICNet [82] for semantic segmentation, YOLO [52] and SSD [42] for object detection, and FastDepth [71] for depth estimation. Advanced strategies, like pruning [71], knowledge distillation [34, 73] are applied to helping the training of compact networks by making use of cumbersome networks.

Fig. 1: An example on the semantic segmentation task shows comparsions in terms of computation complexity, number of parameters and mIoU for different networks on the Cityscapes test set. The FLOPs is calculated with the resolution of . The red triangles are the results of our distillation method while others are without distillation. Blue circles are collected from FCN* [62], RefineNet [38], SegNet [3], ENet [50], PSPNet [83], ERFNet [57], ESPNet [46], MobileNetV2Plus [41], and OCNet [77]. With our proposed distillation method, we can achieve a higher mIoU, with no extra FLOPs and parameters.

The knowledge distillation strategy has been verified valid in classification tasks [25, 58]. Most of previous works [34, 73] directly apply distillation scheme on each pixel separately to transfer the class probability or extracted feature embedding of the corresponding pixel produced from the cumbersome network (teacher) to the compact network (student) for dense prediction tasks. However, such a pixel-wise distillation scheme neglect the important structure information.

Considering the characteristic of dense prediction problem, we present structured knowledge distillation and transfer the structure information with two schemes, pair-wise distillation and holistic distillation. The pair-wise distillation scheme is motivated by the widely-studied pair-wise Markov random field framework [36] for enforcing spatial labeling contiguity. The goal is to align a static affinity graph which is computed to capture both short and long range structure information among different locations from the compact network and the cumbersome network.

The holistic distillation scheme aims to align higher-order consistencies, which are not characterized in the pixel-wise and pair-wise distillation, between output structures produced from the compact network and the cumbersome network. We adopt the adversarial training scheme, and a fully convolutional network, a.k.a. the discriminator considers both the input image and the output structures to produce a holistic embedding which represents the quality of the structure. The compact network is encouraged to generate structures with similar embeddings as the cumbersome segmentation network. We distill the knowledge of structure qualities into the weight of discriminators.

To this end, we optimize an objective function that combines a conventional task loss with the distillation terms. The main contributions of this paper can be summarized as follows.

  • We study the knowledge distillation strategy for training accurate compact dense prediction networks.

  • We present two structured knowledge distillation schemes, pair-wise distillation and holistic distillation, enforcing pair-wise and high-order consistency between the outputs of the compact and cumbersome networks.

  • We demonstrate the effectiveness of our approach by improving recently-developed state-of-the-art compact networks on three different dense prediction tasks: semantic segmentation, depth estimation and object detection. Taking semantic segmentation as an example, the performance gain is illustrated in Figure  1. Code is available at:

2 Related Work

Semantic segmentation. Semantic segmentation is a typical pixel classification problem, which requires a high level understanding of the whole scene as well as discriminative features for pixels from different classes. Deep convolutional neural networks have been the dominant solution to semantic segmentation since the pioneering works, fully-convolutional network [62], DeConvNet [49], U-Net [59]. Various schemes [74] have been developed for improving the network capability and accordingly the segmentation performance. For example, stronger backbone networks, e.g., GoogleNets [64], ResNets [24], and DenseNets [28], have shown better segmentation performance. Improving the resolution through dilated convolutions [7, 6, 8, 75] or multi-path refine networks [38] leads to significant performance gain. Exploiting multi-scale context, e.g., dilated convolutions [75], pyramid pooling modules in PSPNet [83], atrous spatial pyramid pooling in DeepLab [6], object context [77], also benefits the segmentation. Lin et al. [38] combine deep models with structured output learning for semantic segmentation.

In addition to cumbersome networks for highly accurate segmentation, highly efficient segmentation networks have been attracting increasingly more interests due to the need of real applications, e.g., mobile applications. Most works focus on lightweight network design by accelerating the convolution operations with factorization techniques. ENet [50], inspired by [65], integrates several acceleration factors, including multi-branch modules, early feature map resolution down-sampling, small decoder size, filter tensor factorization, and so on. SQ [67] adopts the SqueezeNet [29] fire modules and parallel dilated convolution layers for efficient segmentation. ESPNet [46] proposes an efficient spatial pyramid, which is based on filter factorization techniques: point-wise convolutions and spatial pyramid of dilated convolutions, to replace the standard convolution. The efficient classification networks, e.g., MobileNet [26], ShuffleNet [81], and IGCNet [80], are also applied to accelerate segmentation. In addition, ICNet (image cascade network) [82] exploits the efficiency of processing low-resolution images and high inference quality of high-resolution ones, achieving a trade-off between efficiency and accuracy.

Depth estimation. Depth estimation from a monocular image is essentially an ill-posed problem, which requires an expressive model with high reasoning ability. Previous work highly depend on hand-craft feature or the extra processing like CRF [11, 22] and super-pixel over segmentation [37, 61] to capture the structure information. Since Laina [32] constructed a fully convolutional architecture to predict the depth map, following works [35, 16] benefit from the increasing ability of FCN and achieve promising results. Besides, Fei [14] proposed a semantically informed geometric loss while Wei [70] uses a virtual normal loss to constraint the structure information. Like in semantic segmentation, there are also some works try to replace the encoder with efficiency backbones [70, 63, 71] to decrease the computational cost, but suffer from the training problem limited by the ability of the compact network. The most similar work [71] apply pruning method to training the compact depth network. However we focus on the structured knowledge distillation inspiring by the widely used structure constraints in depth estimation tasks.

Object detection. Object detection is a fundamental task in computer vision, in which one need to regress a bounding box as well as predict a category label for each instance of interest in an image. The pioneering work R-CNN [17] define a two-stage fashion of doing object detection. Following works [18, 56, 23] reach significant performance by first predict proposals and then refine the bounding box as well as predict a category label. A lot of works also pay attention to the detection efficiency, like Yolo [52], SSD [42, 15]. They use a one-stage method and design light-weight network structures. However, the performance of the one stage efficient methods can not compare to the two-stage ones, because of the unbalance samples between interests objects and backgrounds. RetinaNet [39] solves the problem of unbalance samples in to some extent by proposing the focal loss, which makes the results of one-stage methods comparable to two-stage ones. However, most of the detectors rely on a set of pre-defined anchor boxes, which decreases the training samples and makes the detection network sensitive to hyper parameters. Recently, anchor free methods are popular in object detection tasks, like FCOS [66] and CornerNet [33]. FCOS employs a fully convolutional framework, and predict bounding box based on every pixels like in semantic segmentation, which solves the object detection task as a dense prediction problem. In this work, we apply the structured knowledge distillation method with the FCOS framework, as it is simple and can achieve good performance.

Knowledge distillation. Knowledge distillation [25] is a way of transferring knowledge from the cumbersome model to a compact model to improve the performance of compact networks. It has been applied to image classification by using the class probabilities produced from the cumbersome model as “soft targets” for training the compact model [2, 25, 68] or transferring the intermediate feature maps [58, 78].

There are also other applications, including object detection [34], pedestrian re-identification [10] and so on. The MIMIC  [34] method distill a compact object detection network by making use of a two-stage Faster-RCNN [56]. They align the feature map on pixel-level and ignore the structure information among pixels.

The very recent and independently-developed application for semantic segmentation [73] is related to our approach. It mainly distills the class probabilities for each pixel separately (like our pixel-wise distillation) and center-surrounding differences of labels for each local patch (termed as a local relation in [73]). In contrast, we focus on distilling structured knowledge: pairwise distillation, which transfers the relation among different locations by building a static affinity graph, and holistic distillation, which transfers the holistic knowledge that captures high-order information. [73] can be seen as a special case of the pair wise distillation.

This paper is a substantial extension of our previous conference paper [43]. The main difference compared with  [43] lie in threefold. 1) We extend the pair-wise distillation to a more general case in Section 3.1 and build a static graph with nodes and connections. We explore the influence of the graph size, and find out that it is important to keep a global connection range. 2) We provide more explanations and ablations on the adversarial training for holistic distillation. 3) We also extend our method to two recent released strong baselines in depth estimation [70] and object detection [66], by replacing the backbone with MobileNetV2 [60], and further improve their performance.

Adversarial learning. Generative adversarial networks (GANs) have been widely studied in text generation [69, 76] and image synthesis [19, 31]. The conditional version [47] is successfully applied to image-to-image translation, including style transfer [30], image inpainting [51], image coloring [44] and text-to-image [55].

The idea of adversarial learning is also adopted in pose estimation [9], encouraging the human pose estimation result not to be distinguished from the ground-truth; and semantic segmentation [45], encouraging the estimated segmentation map not to be distinguished from the ground-truth map. One challenge in [45] is the mismatch between the generator’s continuous output and the discrete true labels, making the discriminator in GAN be of very limited success. Different from [45], in our approach, the employed GAN does not have this problem as the ground truth for the discriminator is the teacher network’s logits, which are real valued. We use adversarial learning to encourage the alignment between the output maps produced from the cumbersome network and the compact network. However, in the depth prediction task, the ground truth maps are not discrete labels. Like in [21], they use the ground truth maps as the real samples. Different from their methods, our distillation method are trying to align the output of the cumbersome network and the compact network, the task loss calculated with ground truth is optional. When the labelled data is limited, given a well-trained teacher, our method can be applied to unlabelled data and further improve the accuracy.

Fig. 2: Our distillation framework with the semantic segmentation task as an example. (a) Pair-wise distillation; (b) Pixel-wise distillation; (c) Holistic distillation. In the training process, we keep the cumbersome network fixed as our teacher net, and only the student net and the discriminator net will be optimized. The student net with a compact architecture will be trained with three distillation terms and a task-specific loss, e.g., the cross entropy loss for semantic segmentation.

3 Approach

In this section, we first introduce the structured knowledge distillation method based on semantic segmentation, a task of predicting a category label to each pixel in the image from categories. A segmentation network takes an RGB image of size as the input, then it computes a feature map of size , where is the number of channels. Then, a classifier is applied to compute the segmentation map of size from , which is upsampled to the spatial size of the input image as the segmentation results. Besides, we expand our method to other two dense prediction tasks: depth estimation and object detection under FCN framework.

Pixel-wise distillation. We apply the knowledge distillation [25] strategy to transfer the knowledge of the cumbersome segmentation network to a compact segmentation network for better training the compact segmentation network. We view the segmentation problem as a collection of separate pixel labeling problems, and directly use knowledge distillation to align the class probability of each pixel produced from the compact network. We adopt an obvious way [25]: use the class probabilities produced from the cumbersome model as soft targets for training the compact network.

The loss function is given as follows,


where represent the class probabilities of the th pixel produced from the compact network , represent the class probabilities of the th pixel produced from the cumbersome network , is the Kullback-Leibler divergence between two probabilities, and denotes all the pixels.

3.1 Structured Knowledge Distillation

In addition to a straightforward scheme, pixel-wise distillation, we present two structured knowledge distillation schemes, pair-wise distillation and holistic distillation, to transfer structured knowledge from the cumbersome network to the compact network. The pipeline is illustrated in Figure 2.

Pair-wise distillation. Inspired by the pair-wise Markov random field framework that is widely adopted for improving spatial labeling contiguity, we propose to transfer the pair-wise relations, specially pair-wise similarities in our approach, among spatial locations.

We build up a static affinity graph to denote the spatial pair-wise relations, in which, the nodes represent for different spatial locations and the connection between two nodes represents the similarity. We denote the connection range and the granularity of each node to control the size of the static affinity graph. For each node, we only consider the similarities with top- near nodes according to spatial distance (here we use the Chebyshev distance) and aggregate pixels in a spatial local patch to represent the feature of this node as illustrate in Figure 3.

(a) ,
(b) ,
(c) ,
Fig. 3: Illustrations of the connection range and the granularity of each node .

For a feature map, here are pixels. With the granularity and the connection range , the static affinity graph will contain nodes with connections.

Let denote the similarity between the th node and the th node the produced from the cumbersome network and denote the similarity between the th node and the th node produced from the compact network . We adopt the squared difference to formulate the pair-wise similarity distillation loss,


where denotes all the nodes. In our implementation, we use an average pool to aggregate features in one node to be , and the similarity between two nodes is simply computed from the aggregated features and as

which empirically works well.

Holistic distillation. We align the high-order relations between the segmentation maps produced from the cumbersome and compact networks. The holistic embeddings of the segmentation maps are computed as the representations.

We adopt conditional generative adversarial learning [47] for formulating the holistic distillation problem. The compact net is regarded as a generator conditioned on the input RGB image , and the predicted segmentation map is regarded as a fake sample. We expect that is as similar to , which is the segmentation map predicted by the teacher and is regarded as the real sample, as possible. The GAN is usually suffering from the unstable gradient in training the generator due to the discontinuous Jensen-Shannon (JS) divergence, along with other common distance and divergence. The Wasserstein distance [20] (also known as the Earth Mover distance) can be used to measure the difference between two distributions. The Wasserstein distance is defined as the minimum cost to converge the model distribution to the real distribution . It can help to overcome the problem of the gradient vanish or explosion for neural networks, which is written as the following,


where is the expectation operator, and is an embedding network, acting as the discriminator in GAN, which projects and together into a holistic embedding score. The Lipschitz requirement is satisfied by the gradient penalty.

The segmentation map and the conditional RGB image are concatenated as the input of the embedding network . is a fully convolutional neural network with five convolutions. Two self-attention modules are inserted between the final three layers to capture the structure information [79]. A batch normalization layer is add before the concat input to handle the different scales of the RGB image and the logits.

Such a discriminator is able to produce a holistic embedding representing how well the input image and the segmentation map match. We further add a pooling layer to pool the holistic embedding into a score. As we employ the wasserstein distance in the adversarial training, the discriminator is trained to give a higher score to the output segmentation map from the teacher net and give lower scores to the ones from the student. In this processing, we distill the knowledge of evaluating the quality of a segmentation map into the discriminator. The student is trained with the regularization of achieving a higher score under the discriminator, which help improve the performance of the student.

3.2 Optimization

The whole objective function consists of a conventional multi-class cross-entropy loss with pixel-wise and structured distillation terms1


where and are set as and , making these loss value ranges comparable. We minimize the objective function with respect to the parameters of the compact segmentation network , while maximize it with respect to the parameters of the discriminator , which is implemented by iterating the following two steps:

  • Train the discriminator . Training the discriminator is equivalent to minimizing . aims to give a high embedding score for the real samples from the teacher net and a low embedding score for the fake samples from the student net.

  • Train the compact segmentation network . Given the discriminator network, the goal is to minimize the multi-class cross-entropy loss and the distillation loss relevant to the compact segmentation network:


    is a part of given in Equation 3, and we expect to achieve a higher score under the evaluation of .

3.3 Extension to Dense Prediction Tasks

Dense prediction learns a mapping from input RGB image with size to a per-pixel output with size . In the semantic segmentation, the output has channels which is equal to the number of semantic classes.

For the object detection task, for each pixel, we predict the classes with one background class, as well as a 4D tensor representing the distance from the location to the four sides of the bounding box. We follow the task loss in FCOS [66], and combine with the distillation terms as regularization.

In the depth estimation task, the depth estimation task can be solved as a classification task, as the continuous depth values can be divided into discrete categories [5]. After we get the predict probability, we apply a weighted cross entropy loss following  [70]. The pair-wise distillation can be directly applied to the intermediate feature map. The holistic distillation uses the depth map as input. We can use the ground truth as the real samples of GAN in depth estimation task, because it is a continuous map. However, in order to apply our method to unlabelled data, we still consider the depth map from the teacher as our real samples.

4 Experiments

In this section, we choose the typical structured output prediction task- semantic segmentation as an example to verify the effectiveness of the structured knowledge distillation. We discuss and explore how does structured knowledge distillation work and how well does structured knowledge distillation work in Section 4.1.4 and Section 4.1.5.

The structured knowledge distillation can be applied to other structured output prediction tasks under the FCN framework. In Section 4.3 and Section 4.2, we apply our distillation method to strong baselines in object detection and depth estimation tasks with mirror various, and further improve the performance.

4.1 Semantic segmentation

Implementation Details

Network structures. We adopt state-of-the-art segmentation architecture PSPNet [83] with a ResNet [24] as the cumbersome network (teacher) .

We study recent public compact networks, and employ several different architectures to verify the effectiveness of the distillation framework. We first consider ResNet as a basic student network and conduct ablation studies on it. Then, we employ the open source MobileNetVPlus [41], which is based on a pretrained MobileNetV [60] model on the ImageNet dataset. We also test the structure of ESPNet-C [46] and ESPNet [46] that are very compact and have low complexity.

Training setup. Most segmentation networks in this paper are trained by mini-batch stochastic gradient descent (SGD) with the momentum () and the weight decay () for iterations. The learning rate is initialized as and is multiplied by . We random cut the the images into as the training input. Normal data augmentation methods are applied during training, such as random scaling (from to ) and random flipping. Other than this, we follow the settings in the corresponding publications [46] to reproduce the results of ESPNet and ESPNet-C, and train the compact networks under our distillation framework.


Cityscapes. The Cityscapes dataset [12] is collected for urban scene understanding and contains classes with only classes used for evaluation. The dataset contains high quality pixel-level finely annotated images and coarsely annotated images. The finely annotated images are divided into , , images for training, validation and testing. We only use the finely annotated dataset in our experiments.

CamVid. The CamVid dataset [4] is an automotive dataset. It contains training and testing images. We evaluate the performance over different classes such as building, tree, sky, car, road, etc. and ignore the th class that contains unlabeled data.

ADEK. The ADEK dataset [84] is used in ImageNet scene parsing challenge . It contains classes and under diverse scenes. The dataset is divided into // images for training, validation and testing.

Evaluation Metrics

We use the following metrics to evaluate the segmentation accuracy, as well as the model size and the efficiency.

The Intersection over Union (IoU) score is calculated as the ratio of interval and union between the ground truth mask and the predicted segmentation mask for each class. We use the mean IoU of all classes (mIoU) to study the distillation effectiveness. We also report the class IoU to study the effect of distillation on different classes. Pixel accuracy is the ratio of the pixels with the correct semantic labels to the overall pixels.

The model size is represented by the number of network parameters. and the Complexity is evaluated by the sum of floating point operations (FLOPs) in one forward on a fixed input size.

Ablation Study

The effectiveness of distillations. We look into the effect of enabling and disabling different components of our distillation system. The experiments are conduct on ResNet with its variant ResNet () representing a width-halved version of ResNet on the Cityscapes dataset. In Table I, the results of different settings for the student net are the average results of the final epoch from three runs.

Method Validation mIoU (%) Training mIoU (%)
ResNet ()
+ PI
+ PI + PA
+ PI + PA + HO
ResNet ()
+ PI
+ PI + PA
+ PI + PA + HO
ResNet ()
+ PI
+ PI + PA
+ PI + PA + HO
TABLE I: The effect of different components of the loss in the proposed method. PI: pixel-wise distillation; PA: pair-wise distillation; HO: holistic distillation; ImN: initial from the pretrain weight on the ImageNet.

From Table I, we can see that distillation can improve the performance of the student network, and distilling the structure information helps the student learn better. With the three distillation terms, the improvements for ResNet (), ResNet () and ResNet () with weights pretrained from the ImageNet dataset are , and , respectively, which indicates that the effect of distillation is more pronounced for the smaller student network and networks without initialization with the weight pre-trained from the ImageNet. Such an initialization is also a way to transfer the knowledge from other source (ImageNet). The best mIoU of the holistic distillation for ResNet () reaches on the validation set.

On the other hand, one can see that each distillation scheme lead to higher mIoU score. This implies that the three distillation schemes make complementary contributions for better training the compact network.

The affinity graph in pair-wise distillation. In this section, we discuss the impact of the connection range and the granularity of each node in building the affinity graph. To calculate the pair-wise similarity among each pixel in the feature map will form the most complete affinity graph, but lead to high computational complexity. We fix the node to be one pixel, and change the connection range from global feature map to local patches. Then, we keep the connection range to be the global feature map, and use a local patch to denote each node to change the granularity from fine to coarse. The result are shown in Table II. The results of different settings for the pair-wise distillation are the average results from three runs. We employ a ResNet18 (1.0) with the weight pretrained from ImageNet as the student network. All the experiments are performs with both pixel-wise distillation and pair-wise distillation, but the size of the affinity graph in pair-wise distillation are various. From Table II, we can see increasing the connection range can help improve the distillation performance. With the global feature map, the student can achieve around mIoU. The best is , which is slight better than the complete affinity graph, but the connections are significant decreased. Using a small local patch to denote a node and calculate the affinity graph may form a more stable correlation between different locations. One can choose to use the local patch to cut off the number of the nodes, instead of decreasing the connection range for a better trade-off between efficiency and accuracy.

TABLE II: The impact of the connection range and node granularity. The shape of the output feature map is . We can see that to keep a global connection range is more helpful in pair-wise distillation.
Method Validation mIoU(%) Connections
Resnet18 (1.0)

The adversarial training in holistic distillation. In this section, we illustrate that GAN is able to distill the holistic knowledge. As described in 3.1 , we employ a fully convolutional network with five convolution blocks as our discriminator. Each convolution block has ReLu and bn layers, except for the final output convolution layer. We also insert two self-attention blocks in the discriminator to capture the structure information. The ability of discriminator will effect the adversarial training, and we conduct experiments to discuss the impact of the discriminator’s architecture. The results are shown in Table III, and we use to represent the architecture of the discriminator with self-attention layers and convolution blocks with bn layers. The detail structure can be seen in Fig 4, and the red arrow represents for an self-attention layer.

Fig. 4: We show different architectures of the discriminator we discussed. The red arrow represents for an self-attention layer. The orange block denotes a residual block with stride 2. We add an average pooling layer in the output block to get the final score.

From the Table III, we can see adding self-attention layers can improve the average mean of mIoU, and adding more self-attention layers will stable the results as the variance is smaller. We choose to add self-attention blocks considering the performance, stability, and computational cost. With the same self-attention layer, a deeper discriminator can help the adversarial training.

Architecture Index Validation mIoU (%)
Change self-attention layers
A2L4 (ours)
Remove convolution blocks
A2L4 (ours)
TABLE III: We choose ResNet18 (1.0) as the example student net. An index represents for attention blocks with residual blocks in the discriminator. The ability of discriminator will effect the adversarial training.
Class mIoU road sidewalk building wall fence pole Traffic light Traffic sign vegetation
D_Shallow 72.28 97.31 80.07 91.08 36.86 50.93 62.13 66.4 76.77 91.73
D_no_attention 72.69 97.36 80.22 91.62 45.16 56.97 62.23 68.09 76.38 91.94
Our method 74.10 97.15 79.17 91.60 44.84 56.61 62.37 67.37 76.34 91.91
class terrain sky person rider car truck bus train motorcycle bicycle
D_Shallow 60.14 93.76 79.89 55.32 93.45 69.44 73.83 69.54 48.98 75.78
D_no_attention 62.98 93.84 80.1 57.35 93.45 68.71 75.26 56.28 47.66 75.59
Our method 58.67 93.79 79.9 56.61 94.3 75.83 82.87 72.03 50.89 75.72
TABLE IV: We choose ResNet18 (1.0) as the example student net. Class IoU with three different discriminator architectures. The self-attention layer can significantly improve the accuracy of structured objects, such as truck, bus, train and motorcycle.

To verify the effectiveness of adversarial training, we further explore the learning ability of three typical discriminators: the shallowest one (A2L2), the one without attention layer (A0L4) and ours (A2L4). The IoU for different classes are listed in IV. It is obvious that the self-attention layer can help the discriminator to capture the structure, therefore the accuracy of the students with the structure objects is improved.

In the adversarial training, the student, a.k.a the generator, is trying to learn the distribution of the real samples (output of the teacher). Because we apply the Wasserstein distance to transfer the difference of two distributions into a more intuitive score, we can see the score are highly relative to the quality of the segmentation maps. We use a well-trained discriminator (A2L4) to evaluate the score of a segmentation map. For each image, we feed five segmentation maps, output by the teacher net, the student net w/o holistic distillation, the student net with output by the teacher net, the student net w/o holistic distillation, and the student nets w/ holistic distillation under three different discriminator architectures (listed in Tab. IV) into the discriminator , and compare the distribution of embedding scores. We evaluate on the validation set and calculate the average score difference between different student nets and the teacher net, the results are shown in Table V. With holistic distillation, the segmentation maps produced from student net can achieve a similar score to the teacher, indicating that GAN helps distill the holistic structure knowledge.

method Score difference mIoU
Teacher 0 78.56
Student w/o D 2.28 69.21
w/ D_no_attention 0.23 72.69
w/ D_shallow 0.46 72.28
w/ D_ours 0.14 74.10
TABLE V: We choose ResNet18 (1.0) as the example student net. The embedding score difference and mIoU on the validation set of Cityscapes.

We also draw a histogram to show the score distribution of the segmentation map across the validation set in Figure 5. The well-trained can assign a higher score to a high quality segmentation maps, and the three student nets with the holistic distillation can generate segmentation maps with higher scores and better quality. Adding self-attention layers and more convolution blocks help the student net to imitate the distribution of the teacher net, and get a better performance.

Fig. 5: The score distribution of segmentation maps generated by different student nets evaluated by a well-trained discriminator. With adversarial training, the distributions of the segmentation map become more similar to the teacher; and our method (the red one) is the most similar one to the teacher (the orange one).
(a) Bus
(b) Truck
Fig. 6: Segmentation results for structured objects with ResNet18 (1.0) trained with different discriminators. (a) W/o holistic distillation, (b) W/ D_shallow, (c) W/ D_no_attention, (d) Our method, (e) Teacher net, (f) Ground truth, (g) Image. One can see a strong discriminator can help the student learn structure objects better. With the attention layers, labels of the objects are more consistent.

Feature and local pair-wise distillation. We compare the variants of the pair-wise distillation:

  • Feature distillation by MIMIC [58, 34]: We follow [34] to align the features of each pixel between and through a convolution layer to match the dimension of the feature

  • Feature distillation by attention transfer [78]: We aggregate the response maps into a so-called attention map (single channel), and then transfer the attention map from the teacher to the student.

  • Local pair-wise distillation [73]: We distill a local similarity map, which represents the similarities between each pixel and the -neighborhood pixels.

Method ResNet () ResNet () + ImN
w/o distillation
+ PI
+ PI + AT
+ PI + PA
TABLE VI: Empirical comparison of feature transfer MIMIC [58, 34], attention transfer [78], and local pair-wise distillation [73] to our global pair-wise distillation. The segmentation is evaluated by mIoU (%). PI: pixel-wise distillation. MIMIC: using a convolution for feature distillation. AT: attention transfer for feature distillation. LOCAL: The local similarity distillation method. PA: our pair-wise distillation. ImN: initializing the network from the weights pretrained on ImageNet dataset.

We replace our pair-wise distillation by the above three distillation schemes to verify the effectiveness of our global pair-wise distillation. From Table VI, we can see that our pair-wise distillation method outperforms all the other distillation methods. The superiority over feature distillation schemes: MIMIC [34] and attention transfer [78], which transfers the knowledge for each pixel separately, comes from that we transfer the structured knowledge other than aligning the feature for each individual pixel. The superiority to the local pair-wise distillation shows the effectiveness of our global pare-wise distillation which is able to transfer the whole structure information other than a local boundary information [73].

Method #Params (M) FLOPs (B) Test § Val.
Current state-of-the-art results
ENet [50] n/a
ERFNet [75] n/a
FCN [62] n/a
RefineNet [38] n/a
OCNet [77] n/a
PSPNet [83] n/a
Results w/ and w/o distillation schemes
MD [73] 5 n/a
MD (Enhanced) [73] n/a
ESPNet-C [46]
ESPNet-C (ours)
ESPNet [46]
ESPNet (ours)
ResNet ()
ResNet () (ours)
ResNet18 (1.0)
ResNet18 (1.0) (ours)
MobileNetVPlus [41]
MobileNetVPlus (ours)

  • Train from scratch

  • Initialized from the weights pre-trained on ImageNet

  • We select a best model along training on validation set to submit to the leader board. All our models are test on single scale. Some cumbersome networks are test on multiple scales, such as OCNet and PSPNet.

TABLE VII: The segmentation results on the testing, validation (Val.), training (Tra.) set of Cityscapes.
Fig. 7: Illustrations of the effectiveness of pixel-wise and structured distillation schemes in terms of class IoU scores on the network MobileNetV2Plus [41] over the Cityscapes test set. Both pixel-level and structured distillation are helpful for improving the performance especially for the hard classes with low IoU scores. The improvement from structured distillation is more significant for structured objects, such as bus and truck.
(a) Image
(b) W/o distillation
(c) Pixel-wise distillation
(d) Our method
(e) Ground truth
Fig. 8: Qualitative results on the Cityscapes testing set produced from MobileNetV2Plus: (a) initial images, (b) w/o distillation, (c) only w/ pixel-wise distillation, (d) Our distillation schemes: both pixel-wise and structured distillation schemes. The segmentation map in the red box about four structured objects: trunk, person, bus and traffic sign are zoomed in. One can see that the structured distillation method (ours) produces more consistent labels.

Segmentation Results

Cityscapes. We apply our structure distillation method to several compact networks: MobileNetV2Plus [41] which is based on a MobileNetV2 model, ESPNet-C [46] and ESPNet [46] which are carefully designed for mobile applications. Table VII presents the segmentation accuracy, the model complexity and the model size. FLOPs2 is calculated on the resolution to evaluate the complexity. parameters is the number of network parameters. We can see that our distillation approach can improve the results over compact networks: ESPNet-C and ESPNet [46], ResNet (), ResNet (), and MobileNetV2Plus [41]. For the networks without pre-training, such as ResNet () and ESPNet-C, the improvements are very significant with and , respectively. Compared with MD (Enhanced) [73] that uses the pixel-wise and local pair-wise distillation schemes over MobileNet, our approach with the similar network MobileNetV2Plus achieves higher segmentation quality ( vs on the validation set) with a little higher computation complexity and much smaller model size.

Figure 7 shows the IoU scores for each class over MobileNetV2Plus. Both the pixel-wise and structured distillation schemes improve the performance, especially for the categories with low IoU scores. In particular, the structured distillation (pair-wise and holistic) has significant improvement for structured objects, e.g., improvement for Bus and for Truck. The qualitative segmentation results in Figure 8 visually demonstrate the effectiveness of our structured distillation for structured objects, such as trucks, buses, persons, and traffic signs.

CamVid. Table VIII shows the performance of the student networks w/o and w/ our distillation schemes and state-of-the-art results. We train and evaluate the student networks w/ and w/o distillation at the resolution following the setting of ENet. Again we can see that the distillation scheme improves the performance. Figure 9 shows some samples on the CamVid test set w/o and w/ the distillation produced from ESPNet.

Method Extra data mIoU (%) #Params (M)
ENet[50] no
FC-DenseNet[13] no
SegNet[3] ImN
DeepLab-LFOV[7] ImN
FCN-s[62] ImN

ESPNet-C (ours) no
ESPNet-C (ours) unl
ESPNet[46] no
ESPNet (ours) no
ESPNet (ours) unl
ResNet ImN
ResNet (ours) ImN
ResNet (ours) ImN+unl
TABLE VIII: The segmentation performance on the test set of CamVid. ImN = ImageNet dataset, and unl = unlabeled street scene dataset sampled from Cityscapes.
(a) Image
(b) W/o dis.
(c) Our method
(d) Ground truth
Fig. 9: Qualitative results on the CamVid test set produced from ESPNet. W/o dis. represents for the baseline student network trained without distillation.

We also conduct an experiment by using an extra unlabeled dataset, which contains unlabeled street scene images collected from the Cityscapes dataset, to show that the distillation schemes can transfer the knowledge of the unlabeled images. The experiments are done with ESPNet and ESPNet-C. The loss function is almost the same except that there is no cross-entropy loss over the unlabeled dataset. The results are shown in Figure 10. We can see that our distillation method with the extra unlabeled data can significantly improve mIoU of ESPNet-c and ESPNet for and .

Fig. 10: The effect of structured distillation on CamVid. This figure shows that distillation can improve the results in two cases: trained over only the labeled data and over both the labeled and extra unlabeled data.

ADEK. The ADEK dataset is a very challenging dataset and contains categories of objects. The frequency of objects appearing in scenes and the pixel ratios of different objects follow a long-tail distribution. For example, the stuff classes like wall, building, floor, and sky occupy more than of all the annotated pixels, and the discrete objects, such as vase and microwave at the tail of the distribution, occupy only of the annotated pixels.

We report the results for ResNet and the MobileNetV which are trained with the initial weights pretrained on the ImageNet dataset, and ESPNet which is trained from scratch in Table IX. We follow the same training scheme in [72]. All the results are tested on single scale. For ESPNet, with our distillation, we can see that the mIoU score is improved by , and it achieves a higher accuracy with samller #parameters compared to SegNet. For ResNet and MobileNetV, after the distillation, we have a improvement over the one without distillation reported in [72]. We check the result for each class and find that the improvements are mainly from the discrete objects.

Method mIoU(%) Pixel Acc. (%) #Params (M)

SegNet [3]
DilatedNet [72]
PSPNet (teacher) [83]
FCN [62]
ESPNet [46]
ESPNet (ours)

MobileNetV [72]
MobileNetV (ours)
ResNet [72]
ResNet (ours)
TABLE IX: mIoU and pixel accuracy on validation set of ADEK.
(a) Image
(b) W/o dis.
(c) Our method
Fig. 11: Qualitative results on the ADE20K produced from MobileNetV2. W/o dis. represents for the baseline student network trained without distillation.

4.2 Depth Estimation

Implementation Details

Method backbone #Params (M) rel log10 rms
Lower is better Higher is better
Laina et al. [32] ResNet50 60.62 0.127 0.055 0.573 0.811 0.953 0.988
DORN [16] ResNet101 105.17 0.115 0.051 0.509 0.828 0.965 0.992
AOB [27] SENET-154 149.82 0.115 0.050 0.530 0.866 0.975 0.993
VNL (teacher) [70] ResNext101 86.24 0.108 0.048 0.416 0.875 0.976 0.994
CReaM [63] - 1.5 0.190 - 0.687 0.704 0.917 0.977
RF-LW [48] MobileNetV2 3.0 0.149 - 0.565 0.790 0.955 0.990
VNL (student) MobileNetV2 2.7 0.135 0.060 0.576 0.813 0.958 0.991
VNL (student) w/ distillation MobileNetV2 2.7 0.130 0.055 0.544 0.838 0.971 0.994

TABLE X: Depth estimation results and model parameters on NYUD-v2 test dataset. With the structured knowledge distillation, the performance is improved over all evaluation metrics.

Network structures We use the same model described in [70] with the ResNext101 backbone as our cumbersome model, and replace the backbone with MobileNetV2 as the compact model. Training details We train the student net using the crop size by mini-batch stochastic gradient descent (SGD) with batchsize of . The initialized learning rate is and is multiplied by . For both the student model w/ and w/o distillation methods, the training epoch is .


NYUD-V2. The NYUD-V2 dataset annotated indoor images, in which images are split for training and others are for testing. The image size is . Some methods will sample more images from the video sequence of NYUD-v2 to form a Large-NYUD-v2 to further improve the performance. Following [70], we do ablation studies on the small dataset and also apply the distillation method on current state-of-the-art real-time depth models trained with Large-NYUD-v2 to verify the effectiveness of the structured knowledge distillation.

Evaluation Metrics

We follow previous methods [70] to evaluate the performance of monocular depth estimation quantitatively based on following metrics: mean absolute relative error (rel), mean error (), root mean squared error (rms) , and the accuracy under threshold ().


Ablation studies We compare the pixel-wise distillation and the structured knowledge distillation in this section. In the dense-classification problem, e.g. semantic segmentation, the output logits of the teacher is a soft distribution of all the classes, which contains the relations among different classes. Therefore, directly transfer the logits from cumbersome models from the compact ones on pixel level can help improve the performance. Different from semantic segmentation, as the depth map are continues values, the output of the teacher is not as accurate as the ground truth. In the experiments, we found that adding pixel-level distillation can not help improve the accuracy in depth estimation task. We only use the structured knowledge distillation in depth estimation task.

To verify that the distillation method can further improve the accuracy with unlabeled data, we use image sampled from the video sequence of NYUD-v2 without the depth map. The results are shown in XI, we can see that the structured knowledge distillation performs better than pixel-wise distillation, and adding extral unlabelled data can help further improve the accuracy.

Method Baseline PI +PA +PA +HO +PA +HO +Unl
rel 0.181 0.183 0.175 0.173 0.160
TABLE XI: Relative error on NYUD-V2 test dataset. Unl=Unleblled data sampled from the large video sequence. The pixel-level distillation can not help to improve the accuracy as the ground truth of depth estimation are continues values. Therefore we only use structured-knowledge distillation in the depth estimation task.

Comparion with current state-of-the-art We apply the distillation method on top of current state-of-the-art mobile models for depth estimation. Following [70], we train the student net on Large-NYUD-v2 with the same constraints in [70] as our baseline, and achieve on rel. Following the same training setups, with the structured knowledge distillation terms, we further improve the strong baseline, and achieve on rel. In Table X, we list the model parameters and the accuracy of current state-of-the-art large models along with some real-time models, indicating the structured knowledge distillation can work on a strong baseline.

4.3 Object Detection

Implementation Details

Network structures We adopt the recent one-stage architecture FCOS [66] with the backbone ResNeXt-32x8d-101-FPN as the cumbersome network (teacher). The channel in the detector towers is set to . It is a simple anchor-free model, but can achieve comparable performance with state-of-the-art two-stage methods.

We choose two different models based on the MobileNetV2 backbone: c128-MNV2 and c256-MNV2 released by FCOS [66] as our student nets, where represents the channel in the detector towers. We apply the distillation loss on all the output levels of the feature pyramid network.

Training setup We follow the training schedule in FCOS [66]. For ablation studies, all the teacher, the student w/ and w/o distillation are trained with stochastic gradient descent (SGD) for 90K iterations with the initial learning rate being 0.01 and a mini batch of 16 images. The learning rate is reduced by a factor of 10 at iteration 60K and 80K, respectively. Weight decay and momentum are set as 0.0001 and 0.9, respectively. To compare with other state-of-the-art real-time detectors, we double the training iterations and the batch size, and the distillation method can further improve the results on the strong baselines.


COCO Microsoft Common Objects in Context (COCO) [40] is a large-scale detection benchmark in object detection. With the common used COCO trainval35k split [56, 39], there are images for training nd images for validation. We evaluate the ablation results on the validation set, and we also submit the final results to the test-dev of COCO. It contains different classes and million object instances.

Evaluation Metrics

AP (Average precision) is a popular metric in measuring the accuracy of object detectors [56, 42]. Average precision computes the average precision value for recall value over 0 to 1. The mAP on COCO datasets is a little different from transitional metrics. It is the averaged AP over multiple Intersection over Union (IoU) values, from 0.5 to 0.95 with a step of 0.05. Averaging over IoUs rewards detectors with better localization. We also report AP50 and AP75 represents for the AP with a single IoU of 0.5 and 0.75, respectively. APs, APm and APl are AP across different scales for small, medium and large objects.

Method mAP AP50 AP75 APs APm APl
Teacher 42.5 61.7 45.9 26.0 46.2 54.3
student 31.0 48.5 32.7 17.1 34.2 39.7
+MIMIC [34] 31.4 48.1 33.4 16.5 34.3 41.0
+PA 31.9 49.2 33.7 17.7 34.7 41.3
Ours 32.1 49.5 34.2 18.5 35.3 41.2
TABLE XII: PA vs. MIMIC on the minival split with MobileNetV2-c256 as the student net. Both distillation method can improve the accuracy of the detector, and the structured knowledge distillation performs better than the pixel-wise MIMIC. By applying all the distillation terms, the results can be further improved.


Compare with different distillation methods To demonstrate the effectiveness of the structured knowledge distillation, we compare the pair-wise distillation method with previous MIMIC [34] method, which aligns the feature map on pixel-level. We use the c256-MNV2 as the student net and the results are shown in Table XII. By adding the pixel-wise MIMIC distillation method, the detector can improve on mAP, while our structured knowledge distillation method can on mAP. Under all evaluation metrics, the structured knowledge distillation method performs better than MIMIC. By combing the structured knowledge distillation with the pixel-wise distillation, the results can be further improved to achieve the mAP of . Compare to the baseline method without distillation, the improvement of AP75, APs and APl are more obvious, indicating the distillation method can help the detector get a more accurate results, and do better with extremely small and large objects. We also illustrate in Figure 12, one can see that the detector trained with distillation method can find out more small objects like the person and the bird, and provide a higher score for the predict label.

Method mAP AP50 AP75 APs APm APl
Teacher 42.5 61.7 45.9 26.0 46.2 54.3
C128-MV2 30.9 48.5 32.7 17.1 34.2 39.7
W/ distillation 31.8 49.2 33.8 17.8 35.0 40.4
C256-MV2 33.1 51.1 35.0 18.5 36.1 43.4
W/ distillation 33.9 51.8 35.7 19.7 37.3 43.4
TABLE XIII: Detection accuracy with and without distillation on COCO-minival.

Results of different student nets We follow the same training steps (90K) and batch size (32) as in FCOS [66] and apply the distillation method on two different released structures: C256-MV2 and C128-MV2. The results of w/ and w/o distillation are shown in Table XIII, by applying the structured knowledge distillation combine with pixel-wise distillation, the mAP of C128-MV2 and and C256-MV2 are improved by and , respectively.

backbone AP AP AP AP AP AP time (ms/img)
 RetinaNet [39] ResNet-101-FPN 39.1 59.1 42.3 21.8 42.7 50.2 198
 RetinaNet [39] ResNeXt-101-FPN 40.8 61.1 44.1 24.1 44.2 51.2 -
 FCOS [66] (teacher) ResNeXt-101-FPN 42.7 62.2 46.1 26.0 45.6 52.6 130
 YOLOv2 [53] DarkNet-19 21.6 44.0 19.2 5.0 22.4 35.5 25
 SSD513 [42] ResNet-101-SSD 31.2 50.4 33.3 10.2 34.5 49.8 125
 DSSD513 [15] ResNet-101-DSSD 33.2 53.3 35.2 13.0 35.4 51.1 156
 YOLOv3 [54] Darknet-53 33.0 57.9 34.4 18.3 35.4 41.9 51
 FCOS (student) [66] MobileNetV2-FPN 31.4 49.2 33.3 17.1 33.5 38.8 45
 FCOS (student) w/ distillation MobileNetV2-FPN 34.1 52.2 36.4 19.0 36.2 42.0 45
TABLE XIV: Detection results and inference time on the COCO test-dev. The inference time were reported in the original paper [66, 39] for reference. Our distillation method can improve the accuracy of a strong baseline with no extra inference time.
(a) Detection results W/o distillation
(b) Detection results w/ distillation
Fig. 12: Detection results on COCO dataset. With the structured knowledge distillation, the detector can improve the results with occluded, highly overlapped and extremely small objects. It can also produce a higher classification score compared to the baseline.

Results on the test-dev The original mAP on the validation set of C128-MV2 reported by FCOS is with 90K iterations. We double the training iterations and train with the distillation method. The final mAP on minival is . We submit this results to the COCO test-dev to show the position of our method, comparing against state-of-the-art. To make a fair comparison, we also double the training iterations without any distillation methods, and obtain mAP of on minival. The test results are in Table XIV, and we also list the AP and inference time for some state-of-the-art one-stage detectors to show the position of the baseline we choose and our detectors trained with structured knowledge distillation method.

5 Conclusion

We have studied knowledge distillation for training compact dense prediction networks with the help of cumbersome/teacher networks. By considering the structure information in dense prediction problem, we have presented two structural distillation schemes: pair-wise distillation and holistic distillation. We demonstrate the effectiveness of our proposed distillation schemes on several recently-developed compact networks on three dense prediction tasks: semantic segmentation, depth estimation and object detection. Our structured knowledge distillation methods are complimentary to traditional pixel-wise distillation methods.

Yifan Liu obtained her B.S. and M.Sc. in Artificial Intelligence from Beihang University, and now is working as a Ph.D candidate in Computer Science at The University of Adelaide, supervised by Professor Chunhua Shen. Her research interests include image processing, dense prediction and real-time application in deep learning.

Changyong Shu received the B.S. degree in engineering mechanics from China University of Mining and Technology, Xuzhou, China in 2011, and the Ph.D. degree in flight vehicle design and engineering from Beihang University, Beijing, China, in 2017. He has been with the Nanjing Institute of Advanced Artificial Intelligence since 2018. His current research interest focuses on knowledge distillation.

Jingdong Wang is a Senior Principal Research Manager with the Visual Computing Group, Microsoft Research, Beijing, China. He received the B.Eng. and M.Eng. degrees from the Department of Automation, Tsinghua University, Beijing, China, in 2001 and 2004, respectively, and the PhD degree from the Department of Computer Science and Engineering, the Hong Kong University of Science and Technology, Hong Kong, in 2007. His areas of interest include deep learning, large-scale indexing, human understanding, and person re-identification. He is an Associate Editor of IEEE TPAMI, IEEE TMM and IEEE TCSVT, and is an area chair (or SPC) of some prestigious conferences, such as CVPR, ICCV, ECCV, ACM MM, IJCAI, and AAAI. He is a Fellow of IAPR and an ACM Distinguished Member.

Chunhua Shen is a Professor at School of Computer Science, The University of Adelaide, Australia.


  1. The objective function is the summation of the losses over the mini-batch of training samples. For description clarity, we omit the summation operation.
  2. The FLOPs is calculated with the pytorch version implementation [1].


  1. (2018) Note: \url Cited by: footnote 2.
  2. J. Ba and R. Caruana (2014) Do deep nets really need to be deep?. In Proc. Advances in Neural Inf. Process. Syst., pp. 2654–2662. Cited by: §2.
  3. V. Badrinarayanan, A. Kendall and R. Cipolla (2017) SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. (12), pp. 2481–2495. Cited by: Fig. 1, TABLE VIII, TABLE IX.
  4. G. J. Brostow, J. Shotton, J. Fauqueur and R. Cipolla (2008) Segmentation and recognition using structure from motion point clouds. In Proc. Eur. Conf. Comp. Vis., pp. 44–57. Cited by: §4.1.2.
  5. Y. Cao, Z. Wu and C. Shen (2017) Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Trans. Circuits Syst. Video Technol. 28 (11), pp. 3174–3182. Cited by: §3.3.
  6. L. Chen, G. Papandreou, I. Kokkinos, K. Murphy and A. L. Yuille (2018) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40 (4), pp. 834–848. Cited by: §1, §2.
  7. L. Chen, G. Papandreou, I. Kokkinos, K. Murphy and A. Yuille (2015) Semantic image segmentation with deep convolutional nets and fully connected crfs. In Proc. Int. Conf. Learn. Representations, Cited by: §1, §2, TABLE VIII.
  8. L. Chen, Y. Zhu, G. Papandreou, F. Schroff and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. Proc. Eur. Conf. Comp. Vis.. Cited by: §1, §2.
  9. Y. Chen, C. Shen, X. Wei, L. Liu and J. Yang (2017) Adversarial PoseNet: a structure-aware convolutional network for human pose estimation. In Proc. IEEE Int. Conf. Comp. Vis., pp. 1212–1221. Cited by: §2.
  10. Y. Chen, N. Wang and Z. Zhang (2018) DarkRank: accelerating deep metric learning via cross sample similarities transfer. Proc. Eur. Conf. Comp. Vis.. Cited by: §2.
  11. S. Choi, D. Min, B. Ham, Y. Kim, C. Oh and K. Sohn (2015) Depth analogy: data-driven approach for single image depth estimation using gradient samples. IEEE Trans. Image Process. 24 (12), pp. 5953–5966. Cited by: §2.
  12. M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth and B. Schiele (2016) The cityscapes dataset for semantic urban scene understanding. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §4.1.2.
  13. S. J. M. Drozdzal, D. Vazquez and A. R. Y. Bengio (2017) The one hundred layers tiramisu: fully convolutional densenets for semantic segmentation. Proc. Workshop of IEEE Conf. Comp. Vis. Patt. Recogn.. Cited by: TABLE VIII.
  14. X. Fei, A. Wang and S. Soatto (2018) Geo-supervised visual depth prediction. In arXiv: Comp. Res. Repository, Vol. abs/1807.11130. Cited by: §2.
  15. C. Fu, W. Liu, A. Ranga, A. Tyagi and A. C. Berg (2017) Dssd: deconvolutional single shot detector. In arXiv: Comp. Res. Repository, Vol. abs/1701.06659. Cited by: §2, TABLE XIV.
  16. H. Fu, M. Gong, C. Wang, K. Batmanghelich and D. Tao (2018) Deep ordinal regression network for monocular depth estimation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2002–2011. Cited by: §2, TABLE X.
  17. R. Girshick, J. Donahue, T. Darrell and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 580–587. Cited by: §2.
  18. R. Girshick (2015) Fast r-cnn. In Proc. IEEE Int. Conf. Comp. Vis., pp. 1440–1448. Cited by: §2.
  19. I. J. Goodfellow, J. Pougetabadie, M. Mirza, B. Xu, D. Wardefarley, S. Ozair, A. Courville, Y. Bengio, Z. Ghahramani and M. Welling (2014) Generative adversarial nets. Proc. Advances in Neural Inf. Process. Syst. 3, pp. 2672–2680. Cited by: §2.
  20. I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin and A. C. Courville (2017) Improved training of wasserstein gans. In Proc. Advances in Neural Inf. Process. Syst., pp. 5767–5777. Cited by: §3.1.
  21. K. Gwn Lore, K. Reddy, M. Giering and E. A. Bernal (2018) Generative adversarial networks for depth map estimation from rgb video. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 1177–1185. Cited by: §2.
  22. C. Hane, L. Ladicky and M. Pollefeys (2015) Direction matters: depth estimation with a surface normal classifier. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 381–389. Cited by: §2.
  23. K. He, G. Gkioxari, P. Dollár and R. Girshick (2017) Mask r-cnn. In Proc. IEEE Int. Conf. Comp. Vis., pp. 2961–2969. Cited by: §2.
  24. K. He, X. Zhang, S. Ren and J. Sun (2016) Deep residual learning for image recognition. Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 770–778. Cited by: §2, §4.1.1.
  25. G. E. Hinton, O. Vinyals and J. Dean (2015) Distilling the knowledge in a neural network. arXiv: Comp. Res. Repository abs/1503.02531. Cited by: §1, §2, §3.
  26. A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto and H. Adam (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv: Comp. Res. Repository abs/1704.04861. Cited by: §2.
  27. J. Hu, M. Ozay, Y. Zhang and T. Okatani (2019) Revisiting single image depth estimation: toward higher resolution maps with accurate object boundaries. In Proc. Winter Conf. on Appl. of Comp0 Vis., pp. 1043–1051. Cited by: TABLE X.
  28. G. Huang, Z. Liu, L. van der Maaten and K. Q. Weinberger (2017) Densely connected convolutional networks. Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2261–2269. Cited by: §2.
  29. F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally and K. Keutzer (2016) SqueezeNet: alexnet-level accuracy with 50x fewer parameters and <1mb model size. arXiv: Comp. Res. Repository abs/1602.07360. Cited by: §2.
  30. J. Johnson, A. Alahi and L. Fei-Fei (2016) Perceptual losses for real-time style transfer and super-resolution. Proc. Eur. Conf. Comp. Vis., pp. 694–711. Cited by: §2.
  31. T. Karras, T. Aila, S. Laine and J. Lehtinen (2018) Progressive growing of gans for improved quality, stability, and variation. Proc. Int. Conf. Learn. Representations. Cited by: §2.
  32. I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari and N. Navab (2016) Deeper depth prediction with fully convolutional residual networks. In 3D Vision, pp. 239–248. Cited by: §2, TABLE X.
  33. H. Law and J. Deng (2018) Cornernet: detecting objects as paired keypoints. In Proc. Eur. Conf. Comp. Vis., pp. 734–750. Cited by: §2.
  34. Q. Li, S. Jin and J. Yan (2017) Mimicking very efficient network for object detection. Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 7341–7349. Cited by: §1, §1, §2, 1st item, §4.1.4, §4.3.4, TABLE XII, TABLE VI.
  35. R. Li, K. Xian, C. Shen, Z. Cao, H. Lu and L. Hang (2018) Deep attention-based classification network for robust depth prediction. In arXiv: Comp. Res. Repository, Vol. abs/1807.03959. Cited by: §2.
  36. S. Z. Li (2009) Markov random field modeling in image analysis. Springer Science & Business Media. Cited by: §1.
  37. X. Li, H. Qin, Y. Wang, Y. Zhang and Q. Dai (2014) DEPT: depth estimation by parameter transfer for single still images. In Proc. Asian Conf. Comp. Vis., pp. 45–58. Cited by: §2.
  38. G. Lin, F. Liu, A. Milan, C. Shen and I. Reid (2019) Refinenet: multi-path refinement networks for dense prediction. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: Fig. 1, §1, §2, TABLE VII.
  39. T. Lin, P. Goyal, R. Girshick, K. He and P. Dollár (2017) Focal loss for dense object detection. In Proc. IEEE Int. Conf. Comp. Vis., pp. 2980–2988. Cited by: §2, §4.3.2, TABLE XIV.
  40. T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár and C. L. Zitnick (2014) Microsoft coco: common objects in context. In Proc. Eur. Conf. Comp. Vis., pp. 740–755. Cited by: §4.3.2.
  41. H. Liu (2018) LightNet: light-weight networks for semantic image segmentation. Note: \url Cited by: Fig. 1, Fig. 7, §4.1.1, §4.1.5, TABLE VII.
  42. W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu and A. C. Berg (2016) Ssd: single shot multibox detector. In Proc. Eur. Conf. Comp. Vis., pp. 21–37. Cited by: §1, §2, §4.3.3, TABLE XIV.
  43. Y. Liu, K. Chen, C. Liu, Z. Qin, Z. Luo and J. Wang (2019) Structured knowledge distillation for semantic segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2604–2613. Cited by: §2.
  44. Y. Liu, Z. Qin, T. Wan and Z. Luo (2018) Auto-painter: cartoon image generation from sketch by using conditional wasserstein generative adversarial networks. Neurocomputing 311, pp. 78–87. Cited by: §2.
  45. P. Luc, C. Couprie, S. Chintala and J. Verbeek (2016) Semantic segmentation using adversarial networks. arXiv: Comp. Res. Repository abs/1611.08408. Cited by: §2.
  46. S. Mehta, M. Rastegari, A. Caspi, L. Shapiro and H. Hajishirzi (2018) ESPNet: efficient spatial pyramid of dilated convolutions for semantic segmentation. Proc. Eur. Conf. Comp. Vis.. Cited by: Fig. 1, §2, §4.1.1, §4.1.1, §4.1.5, TABLE VII, TABLE VIII, TABLE IX.
  47. M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. arXiv: Comp. Res. Repository abs/1411.1784. Cited by: §2, §3.1.
  48. V. Nekrasov, T. Dharmasiri, A. Spek, T. Drummond, C. Shen and I. Reid (2018) Real-time joint semantic segmentation and depth estimation using asymmetric annotations. In arXiv: Comp. Res. Repository, Vol. abs/1809.04766. Cited by: TABLE X.
  49. H. Noh, S. Hong and B. Han (2015) Learning deconvolution network for semantic segmentation. In Proc. IEEE Int. Conf. Comp. Vis., pp. 1520–1528. Cited by: §2.
  50. A. Paszke, A. Chaurasia, S. Kim and E. Culurciello (2016) ENet: a deep neural network architecture for real-time semantic segmentation. arXiv: Comp. Res. Repository abs/1606.02147. Cited by: Fig. 1, §1, §2, TABLE VII, TABLE VIII.
  51. D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell and A. A. Efros (2016) Context encoders: feature learning by inpainting. Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2536–2544. Cited by: §2.
  52. J. Redmon, S. Divvala, R. Girshick and A. Farhadi (2016) You only look once: unified, real-time object detection. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 779–788. Cited by: §1, §2.
  53. J. Redmon and A. Farhadi (2017) YOLO9000: better, faster, stronger. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 7263–7271. Cited by: TABLE XIV.
  54. J. Redmon and A. Farhadi (2018) Yolov3: an incremental improvement. In arXiv: Comp. Res. Repository, Vol. abs/1804.02767. Cited by: TABLE XIV.
  55. S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele and H. Lee (2016) Generative adversarial text to image synthesis. Proc. Int. Conf. Mach. Learn., pp. 1060–1069. Cited by: §2.
  56. S. Ren, K. He, R. Girshick and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Proc. Advances in Neural Inf. Process. Syst., pp. 91–99. Cited by: §2, §2, §4.3.2, §4.3.3.
  57. E. Romera, J. M. Alvarez, L. M. Bergasa and R. Arroyo (2017) Efficient convnet for real-time semantic segmentation. In IEEE Intelligent Vehicles Symp., pp. 1789–1794. Cited by: Fig. 1.
  58. A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta and Y. Bengio (2014) Fitnets: hints for thin deep nets. arXiv: Comp. Res. Repository abs/1412.6550. Cited by: §1, §2, 1st item, TABLE VI.
  59. O. Ronneberger, P. Fischer and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In Proc. Medical Image Computing and Computer-Assisted Intervention, pp. 234–241. Cited by: §2.
  60. M. Sandler, A. Howard, M. Zhu, A. Zhmoginov and L. Chen (2018) MobileNetV2: inverted residuals and linear bottlenecks. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §2, §4.1.1.
  61. A. Saxena, M. Sun and A. Y. Ng (2007) Learning 3-d scene structure from a single still image. In Proc. IEEE Int. Conf. Comp. Vis., pp. 1–8. Cited by: §2.
  62. E. Shelhamer, J. Long and T. Darrell (2017) Fully convolutional networks for semantic segmentation.. IEEE Trans. Pattern Anal. Mach. Intell. 39 (4), pp. 640. Cited by: Fig. 1, §1, §2, TABLE VII, TABLE VIII, TABLE IX.
  63. A. Spek, T. Dharmasiri and T. Drummond (2018) CReaM: condensed real-time models for depth prediction using convolutional neural networks. In Int. Conf. on Intell. Robots and Sys., pp. 540–547. Cited by: §2, TABLE X.
  64. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke and A. Rabinovich (2015) Going deeper with convolutions. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 1–9. Cited by: §2.
  65. C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens and Z. Wojna (2016) Rethinking the inception architecture for computer vision. Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2818–2826. Cited by: §2.
  66. Z. Tian, C. Shen, H. Chen and T. He (2019) FCOS: fully convolutional one-stage object detection. Proc. IEEE Int. Conf. Comp. Vis.. Cited by: §1, §2, §2, §3.3, §4.3.1, §4.3.1, §4.3.1, §4.3.4, TABLE XIV.
  67. M. Treml, J. Arjona-Medina, T. Unterthiner, R. Durgesh, F. Friedmann, P. Schuberth, A. Mayr, M. Heusel, M. Hofmarcher and M. Widrich (2016) Speeding up semantic segmentation for autonomous driving. In Proc. Workshop of Advances in Neural Inf. Process. Syst., Cited by: §2.
  68. G. Urban, K. J. Geras, S. E. Kahou, O. Aslan, S. Wang, R. Caruana, A. Mohamed, M. Philipose and M. Richardson (2016) Do deep convolutional nets really need to be deep (or even convolutional)?. In Proc. Int. Conf. Learn. Representations, Cited by: §2.
  69. H. Wang, Z. Qin and T. Wan (2018) Text generation based on generative adversarial nets with latent variables. In Proc. Pacific-Asia Conf. Knowledge discovery & data mining, pp. 92–103. Cited by: §2.
  70. Y. Wei, Y. Liu, C. Shen and Y. Yan (2019) Enforcing geometric constraints of virtual normal for depth prediction. Proc. IEEE Int. Conf. Comp. Vis.. Cited by: §2, §2, §3.3, §4.2.1, §4.2.2, §4.2.3, §4.2.4, TABLE X.
  71. D. Wofk, F. Ma, T. Yang, S. Karaman and V. Sze (2019) FastDepth: fast monocular depth estimation on embedded systems. Int. Conf. on Robotics and Automation. Cited by: §1, §2.
  72. T. Xiao, Y. Liu, B. Zhou, Y. Jiang and J. Sun (2018) Unified perceptual parsing for scene understanding. In Proc. Eur. Conf. Comp. Vis., Cited by: §4.1.5, TABLE IX.
  73. J. Xie, B. Shuai, J. Hu, J. Lin and W. Zheng (2018) Improving fast segmentation with teacher-student learning. Proc. British Machine Vis. Conf.. Cited by: §1, §1, §2, 3rd item, §4.1.4, §4.1.5, TABLE VI, TABLE VII.
  74. C. Yu, J. Wang, C. Peng, C. Gao, G. Yu and N. Sang (2018) Learning a discriminative feature network for semantic segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §2.
  75. F. Yu and V. Koltun (2016) Multi-scale context aggregation by dilated convolutions. Proc. Int. Conf. Learn. Representations. Cited by: §1, §2, TABLE VII.
  76. L. Yu, W. Zhang, J. Wang and Y. Yu (2017) SeqGAN: sequence generative adversarial nets with policy gradient.. In Proc. AAAI Conf. Artificial Intell., pp. 2852–2858. Cited by: §2.
  77. Y. Yuan and J. Wang (2018) OCNet: object context network for scene parsing. In arXiv: Comp. Res. Repository, Vol. abs/1809.00916. Cited by: Fig. 1, §2, TABLE VII.
  78. S. Zagoruyko and N. Komodakis (2017) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. Proc. Int. Conf. Learn. Representations. Cited by: §2, 2nd item, §4.1.4, TABLE VI.
  79. H. Zhang, I. Goodfellow, D. Metaxas and A. Odena (2018) Self-attention generative adversarial networks. In arXiv: Comp. Res. Repository, Vol. abs/1805.08318. Cited by: §3.1.
  80. T. Zhang, G. Qi, B. Xiao and J. Wang (2017) Interleaved group convolutions. In Proc. IEEE Int. Conf. Comp. Vis., pp. 4383–4392. Cited by: §2.
  81. X. Zhang, X. Zhou, M. Lin and J. Sun (2018) ShuffleNet: an extremely efficient convolutional neural network for mobile devices. Proc. IEEE Conf. Comp. Vis. Patt. Recogn.. Cited by: §2.
  82. H. Zhao, X. Qi, X. Shen, J. Shi and J. Jia (2018) Icnet for real-time semantic segmentation on high-resolution images. Proc. Eur. Conf. Comp. Vis.. Cited by: §1, §2.
  83. H. Zhao, J. Shi, X. Qi, X. Wang and J. Jia (2017) Pyramid scene parsing network. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2881–2890. Cited by: Fig. 1, §1, §2, §4.1.1, TABLE VII, TABLE IX.
  84. B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso and A. Torralba (2017) Scene parsing through ade20k dataset. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §4.1.2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters