DifNet: Semantic Segmentation by Diffusion Networks
Abstract
Deep Neural Networks (DNNs) have recently shown state of the art performance on semantic segmentation tasks, however they still suffer from problems of poor boundary localization and spatial fragmented predictions. The difficulties lie in the requirement of making dense predictions from a long path model all at once, since details are hard to keep when data goes through deeper layers. Instead, in this work, we decompose this difficult task into two relative simple subtasks: seed detection which is required to predict initial predictions without need of wholeness and preciseness, and similarity estimation which measures the possibility of any two nodes belong to the same class without need of knowing which class they are. We use one branch for one subtask each, and apply a cascade of random walks base on hierarchical semantics to approximate a complex diffusion process which propagates seed information to the whole image according to the estimated similarities.
The proposed DifNet consistently produces improvements over the baseline models with the same depth and with equivalent number of parameters, and also achieves promising performance on Pascal VOC and Pascal Context dataset. Our DifNet is trained endtoend without complex loss functions.
DifNet: Semantic Segmentation by Diffusion Networks
Peng Jiang Shandong University sdujump@gmail.com Fanglin Gu Shandong University fanglin.gu@gmail.com Changhe Tu Shandong University chtu@sdu.edu.cn Baoquan Chen Shandong University baoquan.chen@gmail.com
noticebox[b]Preprint. Work in progress.\end@float
1 Introduction
Semantic Segmentation who aims to give dense label predictions for pixels in an image is one of the fundamental topics in computer vision. Recently, Fully convolutional networks (FCNs) proposed in FCN () have proved to be much more powerful than schemes which rely on handcrafted features. Following FCNs, subsequent works Deeplabv1 (); Deeplabv2 (); Deeplabv3 (); PSPNet (); ParseNet (); PANet (); Lin_2016_CVPR (); Vemulapalli_2016_CVPR (); Liu_2015_ICCV (); Jampani_2016_CVPR (); Zheng_2015_ICCV (); ChandraEccv2016 (); Chandra_2017_ICCV (); Bertasius_2017_CVPR (); ConvPP () have been keeping getting promoted by further introducing atrous convolution, shortcut between layers and CRFs postprocessing.
Even with these refinements, current FCNs based semantic segmentation methods still suffer from the problems of poor boundary localization and spatial fragmented predictions, because of following challenges: First, to abstract invariant high level feature representations, deeper models are preferred, however, the invariance and increasing depth of layers may lead detailed spatial information lost. Second, given this long path model, the requirement of making dense predictions all at once makes these problems more severe. Third, the lack of ability to capturing longrange dependencies causes model hard to generate accuracy and uniform predictions nonlocal ().
To address these challenges, we relieve the burden of semantic segmentation model by decomposing semantic segmentation task into two relative simple subtasks, seed detection and similarity estimation, then diffuse seeds information to the whole image according to the estimated similarities. For each subtask, we train one branch network respectively and simultaneously, and thus our model has two branches: seed branch and similarity branch. The simplicity and motivation lie in these following aspects: For seed detection, we hope to give a initial predictions without need of wholeness and preciseness, this requirement is highly appropriate to the property of high level features generated by deep neural networks. For similarity detection, we intend to estimate the possibility of any two nodes that they belong to the same class, in this case, relative low level features already could be competent.
Based on the motivations mentioned above, we let seed branch predict initial predictions and let similarity branch estimate similarities. To be specific, seed branch firstly predicts initial dense predictions and then learns a importance map which will be used to reweight each channel of initial dense predictions. At the same time, similarity branch will extract different levels of features, from which transition matrices are computed. Transition matrices measure the possibility of random walk between any two nodes, with our implementation, they also could reflect similarities on different semantic levels. Finally, we apply a cascade of random walks based on these transition matrices to approximate a complex diffusion process, in order to propagate seed information to the whole image according to the hierarchical similarities. In this way, the inversion operation of dense matrix in diffusion process could be avoid. Our diffusion process by cascaded random walks shares the similar idea as residual learning framework resnet () who eases the approximation of a complex objective by learning residuals. Moreover, our random walk actually computes the final response at a position as a weighted sum of all the seeds values which is a nonlocal operation that can capture longrange dependencies regardless of positional distance. And from the Fig. 1, we can see the cascaded random walks also increase the flexibility and diversity of information propagation paths.
Our proposed DifNet is trained endtoend with common loss function and no postprocessing. In experiments, our model consistently shows superior performance over the baseline models with the same depth and with equivalent number of parameters, and also achieves promising performance on Pascal VOC 2012 and Pascal Context datasets. In summary, our contributions are:

We decompose the semantic segmentation task into two simple subtasks.

We approximate a complex diffusion process by cascaded random walks.

We provide comprehensive mechanism studies.

Our model can capture longrang dependencies and has more information propagation paths.

Our model demonstrates consistent improvements over various baseline models.
2 Related Work
Many works Deeplabv1 (); Lin_2016_CVPR (); Vemulapalli_2016_CVPR (); Bertasius_2017_CVPR (); Chandra_2017_ICCV (); ChandraEccv2016 (); Zheng_2015_ICCV (); Jampani_2016_CVPR (); Liu_2015_ICCV (); Bertasius_2016_CVPR (); harley_segaware (); Liu_2017_nips () (Here, we mainly focus on methods that are based on deep neural networks, as these represent the stateoftheart and are the most relevant to our scheme.) have approached the problems of poor boundary localization and spatially fragmented prediction for semantic segmentation. Among them, conditional random field (CRF) is one of major methods used to refine results according to graphical structures.
Works, such as Deeplabv1 () using CRF as a disjoint postprocessing module on top of segmentation model. Because of disjoint training and postprocessing, such approaches often fail to accurately capture semantic relationships between objects and thus still produce segmentation results spatially disjoint. Instead, worksLin_2016_CVPR (); Liu_2015_ICCV (); Zheng_2015_ICCV (); Vemulapalli_2016_CVPR (); harley_segaware () propose to integrate CRF into the networks, thereby enabling endtoend training of the joint model. However, this integration may lead to a dramatic increase in complexity and number of parameters that to optimize the CNNCRF models, it usually needs many iterations of meanfield inference or a recurrent neural networks Zheng_2015_ICCV (). To avoid iterative optimization, works ChandraEccv2016 (); Chandra_2017_ICCV () employ Gaussian Conditional Random Fields which can be optimized by only solving a system of linear equations, but at the cost of increasing complexity of gradient computation. Apart from view of CRF, work Bertasius_2017_CVPR () utilizes graphical structures to refine results by random walks whose final prediction formula has a dense matrix inversion term which is not appropriate for network. Unlike above mentioned methods, work Liu_2017_nips () does not compute global pairwise relations directly, it predicts four local pairwise relations along different direction to approximate the global pairwise relations, however it also leads to the complexity of the model.
To integrate CRF into model, several works also employ networks with two branches that one for pairwise term and one for unary term. However, the usage and form of outputs are different from ours, while we require pairwise term to be transition matrix that the sum of values in each row or column equals one, other methods do not. For the purpose of measuring similarity, different metrics are presented, ours and Chandra_2017_ICCV () compute similarities by inner product while most of others use Mahalanobis distance. It is important to note that our DifNet consists several transition matrices, which are generated from different levels of features. Each random walk operation is conducted based on one transition matrix, see Fig. 1. In this way, we do not require each transition matrix contains all the similarity information, which relieves the burden of similarity (pairwise) branch, and the cascaded random walks will also increase the flexibility and diversity of information propagation paths.
For supervised semantic segmentation tasks, crossentropy is the most common used loss function. However, in some previous mentioned CNNCRF models, other loss for similarity estimation could also be applied, such as Bertasius_2017_CVPR (); harley_segaware (), where similarity groundtruth are transformed from label groundtruth. According to Jiang_2015_ICCV (), optimal pairwise term and optimal unary term are mutually influenced, so strict constraint on only one of these two terms may not lead to good result. Consequently, in our DifNet we only penalize final predictions by crossentropy loss. As for training strategy, we train our two branches simultaneously, while some works, for instance Chandra_2017_ICCV (), train two branches iteratively.
3 Methodology
Given the input image of size , the pairwise relationship can be expressed as affinity matrix , where and each element encodes the similarity between node and node . As mentioned in Sec. 1, our seed vector is defined as where is the initial dense predictions of size ( is the number of classes) and importance map is the diagonal matrix with value in .
Assuming the final predictions are , in order to diffuse seed value to all other nodes according to affinity matrix, we can optimize the following equation:
(1) 
Eq. 1 is convex and has a closeform solution, without loss of generality:
(2) 
where is the degree matrix and defined as , . Eq. 2 is usually considered as diffusion process that , where is the seed vector and usually are called diffusion matrix ( equals inversion of normalized graph Laplacians). Works such as Bertasius_2017_CVPR (); Chandra_2017_ICCV () propose to use networks with two branches to predict these two components respectively. However, to compute final predictions , they will solve dense matrix inversion or system of linear equations, which is timeconsuming and instability (the matrix to be inversed may be singular.). To tackle this problem, we propose to use a cascade of random walks to approximate the diffusion process, that a random walk with seed vector as initial state is defined as:
(3) 
where is a parameter in that controls the degree of random walk to other state from initial state, and is transition matrix whose element measures the possibility that a random walk occurs between corresponding positions and has value in . It is important to note that Eq. 3 does not contain dense matrix inversion anymore and can be proved to equal to Eq. 2 when , see following Proof.
Proof.
Eq. 3 can be unfolded to . When and since , , thereby . By and set , we could get and finally ∎
4 Implementation
In this section, we describe how we implement a cascade of random walks by DifNet to approximate the diffusion process.
The key part of random walk is the computing of transition matrix which is equivalent to conducting a softmax along each row of , . To measure similarities, we compute inner product of intermediate features from similarity branch. In our implementation, we first reduce channels of feature by one layer , so that . In this way, the networks only involve matrix multiplication. has encoded similarity between any node pairs, meanwhile our each random walk is a nonlocal operation, with such properties longrang dependencies can be captured.
To let DifNet approximate the diffusion process by learning, we should take full advantage of learning capacity of model. Instead of using a predefined and fixed parameter , our model learns this parameter and determine the degree of each random walk adaptively. Further on, for each random walk, we compute transition matrix from different levels of intermediate features which means that the information will be propagated gradually according to different levels of semantic similarities. By this manner, diffusion process is not merely approximated by the cascaded random walks, but by the cascaded random walks on hierarchical semantics. We demonstrate how these transition matrices look like in Sec. 5.3. Therefore, our random walks can be defined as:
(4) 
As mentioned before, our seed is expressed as a multiplication of importance map and initial dense predictions that , this operation is denoted as in Fig. 1. Initial dense predictions is the direct output of seed branch, has the size of . In our implementation, is the confidence with value in , therefore our DifNet is diffusing confidence. In this way, the influence of a node to others in certain channel should be defined as . To further adjust influence of nodes, we introduce several layers on top of to predict a importance map with size of that where is layers with as last activation. And then we transform to diagonal matrix with size of . From experiments, we observe that importance map adjusts the influence of nodes base on the confidence of neighborhood. Fig. 2 demonstrates examples of influence of and the importance map . Clearly, will reduce the influence of overemphasis nodes and outliers. Please see Sec. 5.3 for details.
As mentioned in previous section, optimal seed and optimal diffusion matrix are mutually determined. Accordingly, we choose to let model learn seed and affinity on its own instead of providing supervision on affinity as Bertasius_2017_CVPR (). However, no supervision for the cascaded random walks may also cause a problem. By the definition of Eq. 4, if can not gain useful similarity information from certain intermediate features, will be set small by model when training, thereby , in this case, all the previous results will be discarded. To preserve useful information of preceding random walks, we propose to further employ a adaptive identity mapping term. By reformulating Eq. 4 as , the final operation can be defined as:
(5) 
where is another parameter to be learned, for the sake of controlling degree of identity mapping. In our experiments, DifNet occasionally assigns very small value to certain , but will also set small at the same time. In this way, the effects actually equal to omit certain random walk , so the information from preceding random walks can be preserved and passed to following random walks.
Fig. 1 shows the whole framework of our DifNet, the upper branch is seed branch while the lower branch is similarity branch. We use to represent the random walk operation (Fig. 1), for each the inputs are: (1) features from certain intermediate layer of similarity branch ; (2) seed vector from seed branch; (3) output of previous random walk. Given the inputs, computes , determines and , and finally gives output according to Eq. 5.
5 Experiments
5.1 Experimental Settings
Our DifNet can be built on any FCNslike models. In this paper, we choose DeeplabV2 Deeplabv2 () as our backbone. Original DeeplabV2 has reported promising performance by introducing atrous convolution, ASPP module, mutliscale inputs with max fusion, CRFpostproposing and MSCOCO pretrain. However, to better study diffusion property of our DifNet, we design our backbone only with atrous convolution (we also employ ASPP module for seed branch.).
We study the performance and mechanism of our DifNet on the prevalent used Augmented Pascal VOC 2012 dataset pascalvoc (); pascalvoc_aug () and Pascal Context dataset pascalcontext (). Augmented Pascal VOC 2012 dataset has 10,582 training, 1,449 validation, and 1,456 testing images with pixellevel labels in 20 foreground object classes and one background class, while Pascal Context has 4998 training and 5105 validation images with pixellevel labels in 59 classes and one background category. The performance is measured in terms of pixel intersectionoverunion (IOU) averaged across all the classes. To train our model and baseline models, We use a minibatch of 16 images for 200 epochs and set learning rate, learning policy, momentum and weight decay same as Deeplabv2 (). We also augment training dataset by flipping, scaling and finally cropping to due to computing resource limitation.
5.2 Performance Study
Pascal VOC
For quantitative comparison, we use simplified DeeplabV2 as our baseline which only has atrous convolution and ASPP module. Our DifNet and simplified DeeplabV2 are both based on ResNet resnet () architecture and trained from scratch. In our model, we conduct five random walks on transition matrices which are computed from features of last layer of each ResNet block as well as input. Though we have two branches, the depth of our model is equal to the deepest one, because data flows through two branches parallelly other than cascadely when doing inference. To be more fair, in Table. 1, instead of making comparison based on same depth, we also report results based on equivalent number of parameters. For example, DifNet50 has the same depth as SimDeeplab50 while has equivalent number of parameters as SimDeeplab101. In experiments, our models achieve consistent improvement over simDeeplab models on Pascal VOC validation dataset, and the performance is also verified on testing dataset^{1}^{1}1Because of computing resource limitation, we can not test DifNet101.. To further verify efficiency of our diffuse module, we also conduct experiments without any ASPP module (DifNet50noASPP), which also demonstrates superior performance.
Pascal Context
In Table. 2, We further make comparison between DifNet50, original DeeplabV2 Deeplabv2 () with different combinations of components and other methods on Pascal Context dataset. Our DifNet50 models achieve promising performance with our default setting, the performance is still comparable by only using diffuse module.
mIOU(Val)  mIOU(Test)  

DifNet18  70.17%  70.46% 
SimDeeplab18  66.33%   
SimDeeplab34  69.76%   
DifNet34  71.84%  71.62% 
DifNet50noASPP  72.52%   
DifNet50  72.57%  72.55% 
SimDeeplab50  70.78%   
SimDeeplab101  71.83%   
mIOU(Val)  
FCN8sFCN ()  39.1%  
CRFRNNZheng_2015_ICCV ()  39.3%  
ParseNetParseNet ()  40.4%  
ConvPP8sConvPP ()  41.0%  
UoAContext+CRFLin_2016_CVPR ()  43.3%  
MSC  COCO  ASPP  CRF  Diffuse  
ResNet101  
DeeplabDeeplabv2 ()  ✓  41.4%  
DeeplabDeeplabv2 ()  ✓  ✓  42.9%  
DeeplabDeeplabv2 ()(SimDeeplab)  ✓  43.6%  
DeeplabDeeplabv2 ()  ✓  ✓  ✓  44.7%  
DeeplabDeeplabv2 ()  ✓  ✓  ✓  ✓  45.7%  
ResNet50  
DifNet(our model)  ✓  44.7%  
DifNet(our model)  ✓  ✓  45.1% 
5.3 Mechanism Study
In this section, we focus on the mechanism and effects of components in our model. We use DifNet50 trained on Pascal VOC dataset to carry out following experiments .
Seed Branch
We compute seed as , where is the confidence and is the importance map learned based on neighborhood of . The influence of each node in diffusion process is . To visualize influence of , we define influence map as . We show our , and in Fig. 2. Obviously, contains many outliers and has the problems of poor boundary localization and spatial fragmented predictions. However, observed from the influence map , most of these outliers have little influence to the diffusion process. The importance map will further reduce or increase influence of certain regions, such as columns where keyboard is suppressed and columns where sofa is enhanced, to refine the diffusion process.
Similarity Branch
Our cascaded random walks are conducted on a sequence of transition matrices and each measures similarities on different levels of semantics. To visualize these hierarchical semantics, in Fig. 3, for each selected node, we show the corresponding row of each by color coding. represents the possibilities that other nodes random walk to , and . As shown in Fig. 3, from to , similarities are measured from lowlevel feature such as color, texture to highlevel feature such as object. Particularly, can measures similarities of objects with no labels, such as nodes where table mat and painting are highlighted. These results meet our assumption that similarity branch estimates the possibility of any two nodes that belong to the same class without knowing which class they are. Finally, examples of nodes prove the ability of capturing longrange dependencies in our model.
Diffusion
We show outputs of random walks in Fig. 4. represents the output after th random walk. Obviously, the outputs are gradually refined after each random walk. We also report learned and in Table. 4. The increasing of parameters means outputs are more depended on information transited from other nodes rather than initial seed and previous results as data flows through our model. To validate the efficiency of cascaded transition matrices built on all ResNet blocks, we delete the th and th random walks in DifNet50, we found the performance will have 1 percent drop on Pascal VOC validation dataset.
5.4 Efficiency Study
In Table. 4, we report the time consumption for doing inference with inputs of size on one GTX 1080 GPU. For inference, total time consumption of our DifNet50 is equivalent to SimDeeplab101. However, when doing inference, the data can flows through two branches of our model parallelly, so our model can be further accelerated by model parallel to two times faster. The diffusion process only involves matrix multiplication (five random walks), and can be implemented efficiently with little extra time.
On the contrary, the backpropagation of our model will require much more calculations compared with vanilla model. Since the outputs of two branches determine the final results together from amounts of information propagation paths, the parameters of two branches will be heavily mutually influenced when doing optimization. The time consumption of backpropagation in our model is about 1.3 times than SimDeeplab101. However, in view of benefits from model parallel during inference, extra time spent on training is considered acceptable.
6 Conclusion
We present DifNet for semantic segmentation task, our model applies the cascaded random walks to approximate a complex diffusion process. With these cascaded random walks, more details can be complemented according to the hierarchical semantic similarities and meanwhile longrange dependencies are captured. Our model achieves promising performance compared with various baseline models, the comprehensive mechanism studies also prove usefulness and effectiveness of our components.
References
 (1) Long, J., E. Shelhamer, T. Darrell. Fully convolutional networks for semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015.
 (2) Chen, L. C., G. Papandreou, I. Kokkinos, et al. Semantic image segmentation with deep convolutional nets and fully connected crfs. In International Conference on Learning Representations (ICLR). 2015.
 (3) —. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
 (4) Chen, L., G. Papandreou, F. Schroff, et al. Rethinking atrous convolution for semantic image segmentation. In arXiv preprint arxiv:1706.05587. 2017.
 (5) Zhao, H., J. Shi, X. Qi, et al. Pyramid scene parsing network. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017.
 (6) Liu, W., A. Rabinovich, A. Berg. Parsenet: Looking wider to see better. In arXiv preprint arXiv:1506.04579. 2015.
 (7) Liu, S., L. Qi, H. Qin, et al. Pyramid scene parsing network. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2018.
 (8) Lin, G., C. Shen, A. van den Hengel, et al. Efficient piecewise training of deep structured models for semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016.
 (9) Vemulapalli, R., O. Tuzel, M.Y. Liu, et al. Gaussian conditional random field network for semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016.
 (10) Liu, Z., X. Li, P. Luo, et al. Semantic image segmentation via deep parsing network. In The IEEE International Conference on Computer Vision (ICCV). 2015.
 (11) Jampani, V., M. Kiefel, P. V. Gehler. Learning sparse high dimensional filters: Image filtering, dense crfs and bilateral neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016.
 (12) Zheng, S., S. Jayasumana, B. RomeraParedes, et al. Conditional random fields as recurrent neural networks. In The IEEE International Conference on Computer Vision (ICCV). 2015.
 (13) Chandra, S., I. Kokkinos. Fast, exact and multiscale inference for semantic image segmentation with deep gaussian crfs. In European Conference on Computer Vision (ECCV). 2016.
 (14) Chandra, S., N. Usunier, I. Kokkinos. Dense and lowrank gaussian crfs using deep embeddings. In The IEEE International Conference on Computer Vision (ICCV). 2017.
 (15) Bertasius, G., L. Torresani, S. X. Yu, et al. Convolutional random walk networks for semantic image segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017.
 (16) Xie, S., X. Huang, Z. Tu. Topdown learning for structured labeling with convolutional pseudoprior. In European Conference on Computer Vision (ECCV). 2016.
 (17) Wang, X., R. Girshick, A. Gupta, et al. Nonlocal neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2018.
 (18) He, K., X. Zhang, S. Ren, et al. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016.
 (19) Bertasius, G., J. Shi, L. Torresani. Semantic segmentation with boundary neural fields. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016.
 (20) Adam W Harley, I. K., Konstantinos G. Derpanis. Segmentationaware convolutional networks using local attention masks. In IEEE International Conference on Computer Vision (ICCV). 2017.
 (21) Liu, S., S. De Mello, J. Gu, et al. Learning affinity via spatial propagation networks. In Advances in Neural Information Processing Systems (NIPS). 2017.
 (22) Jiang, P., N. Vasconcelos, J. Peng. Generic promotion of diffusionbased salient object detection. In The IEEE International Conference on Computer Vision (ICCV). 2015.
 (23) Everingham, M., S. M. A. Eslami, L. Van Gool, et al. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 2015.
 (24) Hariharan, B., P. Arbelaez, L. Bourdev, et al. Semantic contours from inverse detectors. In International Conference on Computer Vision (ICCV). 2011.
 (25) Mottaghi, R., X. Chen, X. Liu, et al. The role of context for object detection and semantic segmentation in the wild. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2014.