A Survey on Deep Learning-based Architectures for Semantic Segmentation on 2D images
Semantic segmentation is the pixel-wise labelling of an image. Since the problem is defined at the pixel level, determining image class labels only is not acceptable, but localising them at the original image pixel resolution is necessary. Boosted by the extraordinary ability of convolutional neural networks (CNN) in creating semantic, high level and hierarchical image features; excessive numbers of deep learning-based 2D semantic segmentation approaches have been proposed within the last decade. In this survey, we mainly focus on the recent scientific developments in semantic segmentation, specifically on deep learning-based methods using 2D images. We started with an analysis of the public image sets and leaderboards for 2D semantic segmantation, with an overview of the techniques employed in performance evaluation. In examining the evolution of the field, we chronologically categorised the approaches into three main periods, namely pre-and early deep learning era, the fully convolutional era, and the post-FCN era. We technically analysed the solutions put forward in terms of solving the fundamental problems of the field, such as fine-grained localisation and scale invariance. Before drawing our conclusions, we present a table of methods from all mentioned eras, with a brief summary of each approach that explains their contribution to the field. We conclude the survey by discussing the current challenges of the field and to what extent they have been solved.
Semantic segmentation has recently become one of the fundamental problems, and accordingly a hot topic for the fields of computer vision and machine learning. Assigning a separate class label to each pixel of an image is one of the important steps in building complex robotic systems such as driverless cars/drones, human-friendly robots, robot-assisted surgery, and intelligent military systems. Thus, it is no wonder that in addition to scientific institutions, industry-leading companies studying artificial intelligence are now summarily confronting this problem.
The simplest problem definition for semantic segmentation is pixel-wise labelling. Because the problem is defined at the pixel level, finding only class labels that the scene includes is considered insufficient, but localising labels at the original image pixel resolution is also a fundamental goal. Depending on the context, class labels may change. For example, in a driverless car, the pixel labels may be human, road and car [SiamEJY17] whereas for a medical system [SahaTIP2018, Jiangmedical2017], they could be cancer cells, muscle tissue, aorta wall etc.
The recent increase in interest in this topic has been undeniably caused by the extraordinary success seen with convolutional neural networks [lecun1989generalization] (CNN) that have been brought to semantic segmentation. Understanding a scene at the semantic level has long been one of the main topics of computer vision, but it is only now that we have seen actual solutions to the problem.
In this paper, our primary motivation is focusing on the recent scientific developments in semantic segmentation, specifically on deep learning-based methods using 2D images. The reason we narrowed down our survey to techniques that utilise only 2D visible imagery is because, in our opinion, the scale of the problem in the literature is so vast and widespread that it would be impractical to analyse and categorise all semantic segmentation modalities (such as 3D point clouds, hyper-spectral data, MRI etc.) found in journal articles to any degree of detail. In addition to analysing the techniques which make semantic segmentation possible and accurate, we also examine the most popular image sets created for this problem. Additionally, we review the performance measures used for evaluating the success of semantic segmentation. Most importantly, we propose a taxonomy of methods, which we believe is novel in the sense that it provides insight to the existing deficiencies and suggests future directions for the field.
The remainder of the paper is organised as follows: in the following subsection we refer to other survey studies on the subject and underline our contribution. Section 2 presents information about the different image sets, the challenges, and how to measure the performance of semantic segmentation. Starting with Section 3, we chronologically scrutinise semantic segmentation methods under three main titles, hence in three separate sections. Section 3 covers the methods of pre- and early deep convolutional neural networks era. Section 4 provides details on the fully convolutional neural networks, which we consider to be a milestone for the semantic segmentation literature. Section 5 covers the state-of-the art methods on the problem and provides details on both the architectural details and the success of these methods. And finally, Section 6 provides a conclusion to the paper.
1.1 Surveys on Semantic Segmentation
Very recently, driven by both academia and industry, the rapid increase of interest in semantic segmentation has inevitably led to a number of survey studies being published [Ahmad_2017, Jiangmedical2017, SiamEJY17, thoma2016, saffar2018semantic, YU201882, Guo2018, Garcia2017].
Some of these surveys focus on a specific problem such as the comparison of semantic segmentation approaches for horizon/skyline detection [Ahmad_2017], whilst others deal with relatively broader problems related to industrial challenges, such as semantic segmentation for driverless cars [SiamEJY17] or medical systems [Jiangmedical2017]. These studies are useful if working on the same specific problem, but they lack an overarching vision that may ‘technically’ contribute to the future directions of the field.
Another group [thoma2016, saffar2018semantic, YU201882, Guo2018] of survey studies on semantic segmentation have provided a general overview of the subject, but they lack the necessary depth of analysis regarding deep learning-based methods. Whilst semantic segmentation was studied for two decades prior to deep learning, actual contribution to the field has only been achieved very recently, particularly following a revolutionary paper on fully convolutional networks (FCN) [Shelhamer2017] (which has also been thoroughly analysed in this paper). It could be said that most state-of-the-art studies are in fact extensions of that same [Shelhamer2017] study. For this reason, without scrupulous analysis of FCNs and the direction of the subsequent papers, survey studies will lack the necessary academic rigour in examining semantic segmentation using deep learning.
On the other hand, most state-of-the-art studies [Chen2017, Lin_2018_ECCV] on semantic segmentation provide solid analysis of the literature within a separate section. Since these studies are principally about the proposal of a new method, the analysis is usually brief and somewhat biased in defending the paper’s own contribution or position. Therefore, such papers do not adequately match or satisfy the depth offered by a survey, which logically covers all the related techniques through an unbiased examination and outlook.
A recent review of deep semantic segmentation by Garcia-Garcia et al. [Garcia2017] provided a comprehensive survey on the subject. They covered almost all the popular semantic segmentation image sets and methods, and for all modalities such as 2D, RGB, 2.5D, RGB-D, and 3D data. Although the study is inclusive in the sense that most related material on deep semantic segmentation has been included, the categorisation of the methods is coarse, since the survey attempts to cover almost everything umbrellaed under the topic of semantic segmentation literature. Recent deep learning studies on semantic segmentation follow a number of fundamental directions and labour with tackling the varied corresponding issues. In this survey paper, we define and describe these new challenges, and present a novel, consistent categorisation of all the studies within this proposed context. This way, we believe that readers will better understand the current state-of-the-art, as well as the future directions seen for 2D semantic segmentation.
2 Image Sets, Challenges and Performance Evaluation
2.1 Image Sets and Challenges
The level of success for any machine-learning application is undoubtedly determined by the quality and the depth of the data being used for training. When it comes deep learning, data is even more important since most systems are termed end-to-end, thus even the features are determined by the data, not for the data. Therefore, data is no longer the object, but becomes the actual subject in the case of deep learning.
In this section, we scrutinise the most popular large-scale 2D image sets that have been utilised for the semantic segmentation problem. The image sets were categorised into two main branches, namely general purpose image sets, with generic class labels including almost every type of object or background, and also urban street image sets, which include class labels such as car and person, and are generally created for the training of driverless car systems. There are many other unresolved 2D semantic segmentation problem domains such as medical imaging, satellite imagery, or infrared imagery. However, urban street image is currently driving scientific development in the field because they attract more attention from industry and therefore very large-scale image sets and challenges with crowded leaderboards exist, yet, only specifically for industrial users. Scientific interest for depth-based semantic segmentation is growing rapidly; however, as mentioned in the Introduction, we have excluded depth-based and 3D-based segmentation datasets from the current study in order to focus with sufficient detail on the novel categorisation of recent techniques pertinent to 2D semantic segmentation.
General Purpose Semantic Segmentation Image Sets
PASCAL Visual Object Classes (VOC) [Everingham2010]: This image set includes image annotations not only for semantic segmentation, but for also classification, detection, action classification, and person layout tasks. The image set and annotations are regularly updated and the leaderboard of the challenge is public
1(with more than 100 submissions just for the segmentation challenge alone). It is the most popular among the semantic segmentation challenges, and is still active following its initial release in 2005. The PASCAL VOC semantic segmentation challenge image set includes 20 foreground object classes and one background class. The original data consisted of 1,464 images for the purposes of training, plus 1,449 images for validation. The 1,456 test images are kept private for the challenge. The image set includes all types of indoor and outdoor images, and is generic across all categories.
The PASCAL VOC image set has a number of extension image sets, most popular among these are PASCAL Context [mottaghi_cvpr14] and PASCAL Parts [chen_cvpr14]. The first [mottaghi_cvpr14] is a set of additional annotations for PASCAL VOC 2010, which goes beyond the original PASCAL semantic segmentation task by providing annotations for the whole scene. The statistics section contains a full list of more than 400 labels (compared to the original 21 labels). The second [chen_cvpr14] is also a set of additional annotations for PASCAL VOC 2010. It provides segmentation masks for each body part of the object, such as the separately labelled limbs and body of an animal. For these extensions, the training and validation set contains 10,103 images, while the test set contains 9,637 images. There are other extensions to PASCAL VOC using other functional annotations such as the Semantic Parts (PASParts) [Wangiccv2015] image set and the Semantic Boundaries Dataset (SBD) [Hariharan2011]. For example, PASParts [Wangiccv2015] additionally provides ‘instance’ labels such as two instances of an object within an image are labelled separately, rather than using a single class label. However, unlike the former two additional extensions [chen_cvpr14, mottaghi_cvpr14], these further extensions [Wangiccv2015, Hariharan2011] have proven less popular as their challenges have attracted much less attention in state-of-the-art semantic segmentation studies, thus their leaderboards are less crowded. In Figure 1, a sample object, parts and instance segmentation is depicted.
Common Objects in Context (COCO) [lin2014microsoft]: With 200K labelled images, 1.5 million object instances, and 80 object categories, COCO is a very largescale object detection, semantic segmentation, and captioning image set, including almost every possible types of scene. COCO provides challenges not only at the instance-level and pixel-level (which they refer to as stuff) semantic segmentation, but also introduces a novel task, namely that of panoptic segmentation [Kirillov18], which aims at unifying instance-level and pixel-level segmentation tasks. Their leaderboards
2are relatively less crowded because of the scale of the data. On the other hand, for the same reason, their challenges are assessed only by the most ambitious scientific and industrial groups, and thus are considered as the state-of-the-art in their leaderboards.
Other General Purpose Semantic Segmentation Image Sets: Although less popular than either PASCAL VOC or COCO, there are also some other image sets in the same domain. Introduced in [Prest2012], YouTube-Objects is a set of low-resolution (480360) video clips with more than 10k pixel-wise annotated frames. Similarly, SIFT-flow [SIFTFlow] is another low-resolution (256256) semantic segmentation image set with 33 class labels for a total of 2,688 images. These and other relatively primitive image sets have been mostly abandoned in the semantic segmentation literature due to their limited resolution and low volume.
Urban Street Semantic Segmentation Image Sets
Cityscapes [cordts2016cityscapes]: This is a largescale image set with a focus on the semantic understanding of urban street scenes. It contains annotations for high-resolution images from 50 different cities, taken at different hours of the day and from all seasons of the year, and also with varying background and scene layout. The annotations are carried out at two quality levels: fine for 5,000 images and course for 20,000 images. There are 30 different class labels, some of which also have instance annotations (vehicles, people, riders etc.). Consequently, there two challenges with separate public leaderboards
3: one for pixel-level semantic segmentation, and a second for instance-level semantic segmentation. There are more than 100 entries to the challenge, making it the most popular regarding semantic segmentation of urban street scenes.
Other Urban Street Semantic Segmentation Image Sets: There are a number of alternative image sets for urban street semantic segmentation, such as CamVid [Brostow2009SemanticOC], KITTI [Geiger2013], and SYNTHIA [RosCVPR16]. These are generally overshadowed by the Cityscapes image set [cordts2016cityscapes] for several reasons. Principally, their scale is relatively low. Only the SYNTHIA image set [RosCVPR16] can be considered as largescale (with more than 13k annotated images); however, it is an artificially generated image set, and this is considered a major limitation for security-critical systems like driverless cars.
2.2 Performance Evaluation
There are two main criteria in evaluating the performance of semantics segmentation: accuracy, or in other words, the success of an algorithm; and computation complexity in terms of speed and memory requirements. In this section we analyse these two criteria separately.
Measuring the performance of segmentation can be complicated, mainly because there are two distinct values to measure. The first is classification, which is simply determining the pixel-wise class labels; and the second is localisation, or finding the correct set of pixels that enclose the object. Different metrics can be found in the literature to measure one or both of these values. The following is a brief explanation of the principal measures most commonly used in evaluating semantic segmentation performance.
ROC-AUC: ROC stands for the Receiver-Operator Characteristic curve, which summarises the trade-off between true positive rate and false positive rate for a predictive model using different probability thresholds; whereas AUC stands for the area under this curve, which is 1 at maximum. This tool is useful in the interpretation of binary classification problems, and is appropriate when observations are balanced between classes. However, since most semantic segmentation image sets [Everingham2010, mottaghi_cvpr14, chen_cvpr14, Wangiccv2015, Hariharan2011, lin2014microsoft, cordts2016cityscapes] are not balanced between the classes, this metric is no longer used by the most popular challenges.
Pixel Accuracy: Also known as global accuracy [BadrinarayananK15], pixel accuracy (PA) is a very simple metric which calculates the ratio between the amount of properly classified pixels and their total number. Mean pixel accuracy (mPA), is a version of this metric which computes the ratio of correct pixels on a per-class basis. mPA is also referred to as class average accuracy [BadrinarayananK15].
where is the total number of pixels both classified and labelled as class j. In other words, corresponds to the total number of True Positives for class j. is the total number of pixels labelled as class j.
Intersection over Union (IoU): Also known as the Jaccard Index, IoU is a statistic used for comparing the similarity and diversity of sample sets. In semantics segmentation, it is the ratio of the intersection of the pixel-wise classification results with the ground truth, to their union.
where, is the number of pixels which are labelled as class i, but classified as class j. In other words they are False Positives (false alarms) for class j. Similarly, , the total number of pixels labelled as class j, but classified as class i are the False Negatives (misses) for class j.
Two extended versions of IoU are also widely in use:
Mean Intersection over Union (mIoU): mIoU is the class-averaged IoU, as in (3).
Frequency-weighted intersection over Union (FwIoU): This is an improved version of MIoU that weighs each class importance depending on appearance frequency by using (the total number of pixels labelled as class j, as also defined in (1)). The formula of FwIoU is given in (4):
IoU and its extensions, compute the ratio of true positives (hits) to the sum of false positives (false alarms), false negatives (misses) and true positives (hits). Thereby, the IoU measure is more informative when compared to pixel accuracy simply because it takes false alarms into consideration, whereas PA does not. However, since false alarms and misses are summed up in the denominator, the significance between them is not measured by this metric, which is considered its primary drawback. In addition, IoU only measures the amount of pixels correctly labelled without considering how accurate the segmentation boundaries are.
Precision-Recall Curve (PRC)-based metrics: Precision (ratio of hits over summation of hits and false alarms) and recall (ratio of hits over summation of hits and misses) are the two axes of the PRC used to depict the trade-off between precision and recall, under a varying threshold for the task of binary classification. PRC is very similar to ROC. However PRC is more powerful in discriminating the effects between the false positives (alarms) and false negatives (misses). That is predominantly why PRC-based metrics are commonly used for evaluating the performance of semantic segmentation. The formula for Precision (also called Specificity) and Recall (also called Sensitivity) for a given class j, are provided in (5):
There are three main PRC-based metrics:
F: Also known as the ’dice coefficient’, this measure is the harmonic mean of the precision and recall for a given threshold. It is a normalised measure of similarity, and ranges between 0 and 1 (Please see (6)).
PRC-AuC: This is similar to the ROC-AUC metric. It is simply the area under the PRC. This metric refers to information about the precision-recall trade-off for different thresholds, but not the shape of the PR curve.
Average Precision (AP): This metric is a single value which summarises both the shape and the AUC of PRC. In order to calculate AP, using the PRC, for uniformly sampled recall values (e.g., 0.0, 0.1, 0.2, …, 1.0), precision values are recorded. The average of these precision values are referred to as the average precision. This is the most commonly used single value metric for semantic segmentation. Similarly, mean average precision (mAP) is the mean of the AP values, calculated on a per-class basis.
IoU and its variants, along with AP, are the most commonly used accuracy evaluation metrics in the most popular semantic segmentation challenges [Everingham2010, mottaghi_cvpr14, chen_cvpr14, Wangiccv2015, Hariharan2011, lin2014microsoft, cordts2016cityscapes].
The burden of computation is evaluated using two main metrics: how fast the algorithm completes, and how much computational memory is demanded.
Execution time: This is measured as the whole processing time, starting from the instant a single image is introduced to the system/algorithm right through until the pixel-wise semantic segmentation results are obtained. The performance of this metric significantly depends on the hardware utilised. Thus, for an algorithm, any execution time metric should be accompanied by a thorough description of the hardware used. There are notations such as Big-O, which provide a complexity measure independent of the implementation domain. However, these notations are highly theoretical and are predominantly not preferred for extremely complex algorithms such as deep semantic segmentation as they are simple and largely inaccurate.
For a deep learning-based algorithm, the offline (i.e., training) and online (i.e., testing) operation may last for considerably different time intervals. Technically, the execution time refers only to the online operation or, academically speaking, the test duration for a single image. Although this metric is extremely important for industrial applications, academic studies refrain from publishing exact execution times, and none of the aforementioned challenges were found to have provided this metric. In a recent study, [zhao2018icnet] provided a 2D histogram of Accuracy (MIoU%) vs. frames-per-second, in which some of state-of-the-art methods with open source codes (including their proposed structure, namely image cascade network – ICNet), were benchmarked using the Cityscapes [cordts2016cityscapes] image set.
Memory Usage: Memory usage is specifically important when semantic segmentation is utilised in limited performance devices such as smartphones, digital cameras, or when the requirements of the system are extremely restrictive. The prime examples of these would be military systems or security-critical systems such as self-driving cars.
The usage of memory for a complex algorithm like semantic segmentation may change drastically during operation. That is why a common metric for this purpose is peak memory usage, which is simply the maximum memory required for the entire segmentation operation for a single image. The metric may apply to computer (data) memory or the GPU memory depending on the hardware design.
Although critical for industrial applications, this metric is not usually made available for any of the aforementioned challenges.
3 Before Fully Convolutional Networks
As mentioned in the Introduction, the utilisation of FCNs is a breaking point for semantic segmentation literature. Efforts on semantic segmantaion literature prior to FCNs [Shelhamer2017] can be analysed in two separate branches, as pre-deep learning and early deep learning approaches. In this section, we briefly discuss both sets of approaches.
3.1 Pre-Deep Learning Approaches
The differentiating factor between conventional image segmentation and semantic segmentation is the utilisation of semantic features in the process. Conventional methods for image segmentation such as thresholding, clustering, and region growing, etc. (please see [ZAITOUN2015] for a survey on conventional image segmentation techniques) utilise handcrafted low-level features (i.e., edges, blobs) to locate object boundaries in images. Thus, in situations where the semantic information of an image is necessary for pixel-wise segmentation, such as in similar objects occluding each other, these methods usually return a poor performance.
Regarding semantic segmentation efforts prior to deep CNNs becoming popular, a wide variety of approaches [HeNIPS2008, UlusoyCVPR05, LadickICCV2009, Bjorn2013, Montillo2011, Ravi2016, Vezhnevets2011, Shotton2008, Yao2012, XiaoICCV209, Micuslik2009, PylonModel2011, krahenbuhl2011] utilised graphical models, such as Markov Random Fields (MRF), Conditional Random Fields (CRF) or forest-based (or sometimes referred to as ‘holistic’) methods, in order to find scene labels at the pixel level. The main idea was to find an inference by observing the dependencies between neighbouring pixels. In other words, these methods modelled semantics of the image as a kind of ‘a priori’ information among adjacent pixels. Thanks to deep learning, today we know that image semantics require abstract exploitation of largescale data. Initially, graph-based approaches were thought to have this potential. The so-called ‘super-pixelisation’, which is usually the term applied in these studies, was a process of modelling abstract regions. However, a practical and feasible implementation for largescale data processing was never achieved for these methods, while it was accomplished for deep CNNs, first by [AlexNet2012] and then in many other studies.
Another group of studies, sometimes referred to as the ‘Layered models’ [yang2012, arbelaez2012semantic, LadickyECCV2010], used a composition of pretrained and separate object detectors so as to extract the semantic information from the image. Because the individual object detectors failed to classify regions properly, or because the methods were limited by the finite number of object classes provided by the ‘hand-selected’ bank of detectors in general, their performance were seen as relatively low compared to today’s state-of-the-art methods.
Although the aforementioned methods of the pre-deep learning era are no longer preferred as segmentation methods, some of the graphical models, especially CRFs, are currently being utilised by the state-of-the-art methods as post-processing (refinement) layers, with the purpose of improving the semantic segmentation performance, the details of which are discussed in following section.
Deep neural networks are powerful in extracting abstract local features. However, they lack the capability to utilise global context information, and accordingly cannot model interactions between adjacent pixel predictions [Marvin2018]. On the other hand, the popular segmentation methods of the pre-deep learning era, the graphical models, are highly suited to this sort of task. That is why they are currently being used as a refinement layer on many deep CNN-based semantic segmentation architectures.
As also mentioned in the previous section, the idea behind using graphical models for segmentation is finding an inference by observing the low-level relations between neighbouring pixels. In Figure 2, the effect of using a graphical model-based refinement on segmentation results can be seen. The classifier (see Figure 2.b) cannot correctly segment pixels where different class labels are adjacent. In this example, a CRF-based refinement [krahenbuhl2011] is applied to improve the pixel-wise segmentation results. CRF-based methods are widely used for the refinement of deep semantic segmentation methods, although some alternative graphical model-based refinement methods also exist in the literature [Liu2015Semantic, pmlr-v78-zuo17a].
CRFs [lafferty2001] are a type of discriminative undirected probabilistic graphical model. They are used to encode known relationships between observations and to construct consistent interpretations. Their usage as a refinement layer comes from the fact that, unlike a discrete classifier, which does not consider the similarity of adjacent pixels, a CRF can utilise this information. The main advantage of CRFs over other graphical models (such as Hidden Markov Models) is their conditional nature and their ability to avoid the problem of label bias [lafferty2001]. Even though a considerable number of methods (see Table 1) utilise CRFs for refinement, these models started to lose popularity in relatively recent approaches because they are notoriously slow and very difficult to optimise [Marvin2018].
3.2 Early Deep Learning Approaches
Before FCNs first appeared in 2014
However, the first mature approaches were just simple attempts to convert classification networks such AlexNet and VGG to segmentation networks by fine-tuning the fully connected layers [Ning2005, Ganin2014, Ciresan2012]. They suffered from the overfitting and timeconsuming nature of their fully connected layers in the training phase. Moreover, the CNNs used were not sufficiently deep so as to create abstract features, which would relate to the semantics of the image.
There were a few early deep learning studies in which the researchers declined to use fully connected layers for their decisioning, but utilised different structures such as a recurrent architecture [Pinheiro2014] or using labelling from a family of separately computed segmentations [Farabet2013]. By proposing alternative solutions to fully connected layers, these early studies showed the first traces of the necessity for a structure like the FCN, and unsurprisingly they were succeeded by [Shelhamer2017].
Since their segmentation results were deemed to be unsatisfactory, these studies generally utilised a refinement process, either as a post-processing layer[Ning2005, Ganin2014, Ciresan2012, Hariharan2014] or as an alternative architecture to fully connected decision layers [Farabet2013, Pinheiro2014]. Refinement methods varied such as Markov random fields [Ning2005], nearest neighbour-based approach [Ganin2014], the use of a calibration layer [Ciresan2012], using super-pixels [Farabet2013, Hariharan2014], or a recurrent network of plain CNNs [Pinheiro2014]. Refinement layers, as discussed in the previous section, are still being utilised by post-FCN methods, with the purpose of increasing the pixel-wise labelling performance around regions where class intersections occur.
4 Fully Convolutional Networks for Semantic Segmentation
In [Shelhamer2017], the idea of dismantling fully connected layers from deep CNNs (DCNN) was proposed, and to imply this idea, the proposed architecture was named as ‘Fully Convolutional Networks’ (see Figure 3). The main objective was to create semantic segmentation networks by adapting classification networks such as AlexNet [Krizhevsky2012], VGG [Simonyan15] , and GoogLeNet [Szegedy2015] into fully convolutional networks, and then transferring their learnt representations by fine-tuning. The most widely used architectures obtained from the study [Shelhamer2017] are known as ‘FCN-32s’, ‘FCN16s’, and ‘FCN8s’, which are all transfer-learnt using the VGG architecture [Simonyan15].
FCN architecture was considered revolutionary in many aspects. First of all, since FCNs did not include fully connected layers, inference per image was seen to be considerably faster. This was mainly because convolutional layers, when compared to fully connected layers, had a marginal number of weights. Second, and maybe more significant, the structure allowed segmentation maps to be generated for images of any resolution. In order to achieve this, FCNs used deconvolutional layers that can upsample coarse deep convolutional layer outputs to dense pixels of any desired resolution. Finally, and most importantly, they proposed the skip architecture for DCNNs.
Skip architectures (or connections) provide links between nonadjacent layers in DCNNs. Simply by summing or concatenating outputs of unconnected layers, these connections enable information to flow, which would otherwise be lost because of an architectural choice such as max-pooling layers or dropouts. The most common practise is to use skip connections preceding a max-pooling layer, which downsamples layer output by choosing the maximum value in a specific region. Pooling layers helps the architecture create feature hierarchies, but also causes loss of localised information which could be valuable for semantic segmentation, especially at object borders. Skip connections preserve and forward this information to deeper layers by way of bypassing the pooling layers. Actually, the usage of skip connections in [Shelhamer2017] was perceived as being considerably primitive. The ‘FCN-8s’ and ‘FCN-16s’ networks included these skip connections at different layers. Denser skip connections for the same architecture, namely ‘FCN-4s’ and ‘FCN-2s’, were also utilised for various applications [Zhong2016, Lee2017]. This idea eventually evolved into the encoder-decoder structures [Ronneberger2015, BadrinarayananK15] for semantic segmentation, which are presented in the following section.
5 Post-FCN Approaches
The past five years has seen a dramatic increase in global interest on the subject of semantic segmentation. Almost all subsequent approaches on semantic segmentation have followed the idea of FCNs, thus it would not be wrong to state that fully connected layers effectively ceased to exist
On the other hand, the idea of FCNs also created new opportunities to further improve deep semantic segmentation architectures. Generally speaking, the main drawbacks of FCNs can be summarised as inefficient loss of label localisation within the feature hierarchy, inability to process global context knowledge, and the lack of a mechanism for multiscale processing. Thus, most subsequent studies have been principally aimed at solving these issues through the proposal of various architectures or techniques. For the remainder of this paper, we analyse these issues under the title, ‘fine-grained localisation’. Consequently, before presenting a list of the post-FCN state-of-the-art methods, we focus on this categorisation of techniques and examine different approaches that aim at solving these main issues. In the following, we also discuss scale invariance in the semantic segmentation context, and finish with object detection-based approaches, which are a new breed of solution that aim at resolving the semantic segmentation problem simultaneously with detecting object instances.
5.1 Techniques for Fine-grained Localisation
Semantic segmentation is, by definition, a dense procedure, hence it requires fine-grained localisation of class labels at the pixel level. For example, in robotic surgery, pixel errors in semantic segmentation can lead to life or death situations. Hierarchical features created by pooling (i.e., max-pooling) layers can partially lose localisation. Moreover, due to their fully convolutional nature, FCNs do not inherently possess the ability to model global context information in an image, which is also very effective in the localisation of class labels. Thus, these two issues are intertwined in nature, and in the following we discuss different approaches that aim at overcoming these problems and to providing finer localisation of class labels.
The so-called Encoder-Decoder (ED) architectures (also known as the U-nets, referring to the pioneering study of [Ronneberger2015]) are comprised of two parts. Encoder gradually reduces the spatial dimension with pooling layers, whilst decoder gradually recovers the object details and spatial dimension. Each feature map of the decoder part only directly receives the information from the feature map at the same level of the encoder part using skip connections, thus EDs can create abstract hierarchical features with fine localisation (see Figure 4.a). U-Net [Ronneberger2015] and Seg-Net [BadrinarayananK15] are very well-known examples. In this architecture, the strongly correlated semantic information, which is provided by the adjacent lower-resolution feature map of the encoder part, has to pass through additional intermediate layers in order to reach the same decoder layer. This usually results in a level of information decay. However, U-Net architectures have proven very useful for the segmentation of different applications, such as satellite images [Ulku2019].
Spatial Pyramid Pooling
The idea of constructing a fixed-sized spatial pyramid was first proposed by [LazebnikSPP2006], in order to prevent a Bag-of-Words system losing spatial relations among features. Later, the approach was adopted to CNNs by [KaimingHeSPP], in that, regardless of the input size, a spatial pyramid representation of deep features could be created in a Spatial Pyramid Pooling Network (SPP-Net). The most important contribution of the SPP-Net was that it allowed inputs of different sizes to be fed into CNNs. Images of different sizes fed into convolutional layers inevitably create different-sized feature maps. However, if a pooling layer, just prior to a decision layer, has stride values proportional to the input size, the feature map created by that layer would be fixed (see Figure 4.b).
There is a common misconception that SPP-Net structure carries an inherent scale-invariance property, which is incorrect. SPP-Net allows the efficient training of images at different scales/resolutions by allowing different input sizes to the CNN. However, the trained CNN with SPP is scale-invariant if, and only if, the training set includes images with different scales/resolutions. This fact is also true for a CNN without SPP layers.
However, similar to the original idea proposed in [LazebnikSPP2006], the SPP layer in a CNN constructs relations among the features of different hierarchies. Thus, it is quite similar to skip connections in ED structures, which also allow information flow between feature hierarchies.
The most common utilisation of a SPP layer for semantic segmentation is proposed in [KaimingHeSPP], such that the SPP layer is appended to the last convolutional layer and fed to the pixel-wise classifier.
This idea is based on fusing features extracted from different sources. For example, in [Pinheiro2015] the so-called ‘DeepMask’ network utilises skip connections in a feed-forward manner, so that an architecture partially similar to both SPP layer and ED is obtained. The same group extends this idea with a top-down refinement approach of the feed-forward module and propose the so-called ‘SharpMask’ network, which has proven to be more efficient and accurate in segmentation performance. Another approach from this category is the so-called ‘ParseNet’ [liu2015parsenet], which fuses CNN features with external global features from previous layers in order to provide context knowledge. Although a novel idea in principle, feature fusion approaches (including SPP) create hybrid structures, therefore they are relatively difficult to train.
The idea of dilated (atrous) convolutions is actually quite simple: with contiguous convolutional filters, an effective receptive field of units can only grow linearly with layers; whereas with dilated convolution, which has gaps in the filter (see Figure 4.c), the effective receptive field would grow much more quickly [Chen18]. Thus, with no pooling or subsampling, a rectangular prism of convolutional layers is created. Dilated convolution is a very effective and powerful method for the detailed preservation of feature map resolutions. The negative aspect of the technique, compared to other techniques, concerns its higher demand for GPU storage and computation, since the feature map resolutions do not shrink within the feature hierarchy [He2016ResNet].
Conditional Random Fields
As also discussed in Section 3.1.1, CNNs naturally lack mechanisms to specifically ‘focus’ on regions where class intersections occur. Around these regions, graphical models are used to find inference by observing low-level relations between neighbouring feature maps of CNN layers. Consequently, graphical models, mainly CRFs, are utilised as refinement layers in deep semantic segmentation architectures. As in [rother2004], CRFs connect low-level interactions with output from multiclass interactions and in this way global context knowledge is constructed.
As a refinement layer, various methods exist that employ CRFs to deep CNNs, such as the Convolutional CRFs [Marvin2018], the Dense CRF [krahenbuhl2011], and CRN-as-RNN [ZhengICCV2015]. Although CRFs help build context knowledge and thus a finer level of localisation in class labels, Table 1 shows CRFs categorised under the ‘CRF Model’ tab, so as to differentiate them from actual CNN architectural extensions.
The ability of Recurrent Neural Networks (RNNs) to handle temporal information can help improve segmentation accuracy. For example, [Pfeuffer2019] used ConvLSTM layers to improve their semantic segmentation results in image sequences. However, there are also methods that use recurrent structures on still images. In [Lin_2018_ECCV], the researchers utilised LSTM-chains in order to intertwine multiple scales, resulting in pixel-wise segmentation improvements. There are also hybrid approaches where CNNs and RNNs are fused. A good example of this is the so-called ReSeg model [ReSeg2016], in which the input image is fed to a VGG-like CNN encoder, and is then processed afterwards by recurrent layers (namely the ReNet architecture) in order to better localise the pixel labels. To the best of our knowledge, no purely recurrent structures for semantic segmentation exist, mainly because semantic segmentation requires a preliminary CNN-based feature encoding scheme.
There is currently an increasing trend in one specific type of RNN, namely ‘attention modules’. In these modules, attention [Vaswani2017] is technically fused in the RNN, providing a focus on certain regions of the input when predicting a certain part of the output sequence. Consequently, they are also being utilised in semantic segmentation [li2019Emanet, Zhao2018, Oktay2018].
Scale Invariance is, by definition, the ability of a method to process input independent of the relative scale (i.e., the scale of an object to its scene) or image resolution. Although it is extremely crucial for certain applications, this ability is usually overlooked or is confused with a method’s ability to include multiscale information. A method may use multiscale information to improve its pixel-wise segmentation ability, but can still be dependent on scale or resolution. That is why we find it necessary to discuss this issue under a different title, and to provide information on the techniques that provide scale and/or resolution invariance.
In computer vision, any method can become scale invariant if trained with multiple scales of the training set. Some semantic segmentation methods utilise this strategy such as [Farabet2013, Eigen2014, Pinheiro2014, Lin2016Efficient, Yu15]. However, these methods do not possess an inherent scale-invariance property, which is usually obtained by normalisation with a global scale factor (such as in SIFT [Lowe2004])). This approach is not usually preferred in the literature on semantic segmentation. The image sets that exist in semantic segmentation literature are extremely large in size. Thus, the methods are trained to memorise that training set, because in principal, overfitting a largescale training set is actually tantamount to solving the entire problem space.
5.3 Object Detection-based Methods
There has been a recent growing trend in computer vision which aims at specifically resolving the problem of object detection, that is, establishing a bounding box around all objects within an image. Given that the image may or may not contain any number of objects, the architectures utilised to tackle such a problem differ to the existing fully connected/convolutional classification or segmentation models.
The pioneering study that represents this idea is the renowned ‘Regions with CNN features’ (RCNN) network [RCNN2014]. Standard CNNs with fully convolutional and fully connected layers lack the ability to provide varying length output, which is a major flaw for an object detection algorithm that aims to detect an unknown number of images within an image. The simplest way to resolve this problem is to take different regions of interest from the image, and then to employ a CNN in order to detect objects within each region separately. This region selection architecture is called the ‘Region Proposal Network’ (RPN) and is the fundamental structure used to construct the RCNN network (see Figure 5.a). Improved versions of RCNN, namely ‘Fast-RCNN’ [RCNN2014] and ‘Faster-RCNN’ [NIPS2015_FasterRCNN] were subsequently also proposed by the same research group. Because these networks allow for the separate detection of all objects within the image, the idea was easily implemented for instance segmentation, as the ‘Mask-RCNN’ [He2017MaskR].
The basic structure of RCNNs included the RPN, which is the combination of CNN layers and a fully connected structure in order to decide the object categories and bounding box positions. As discussed within the previous sections of this paper, due to their cumbersome structure fully connected layers were largely abandoned with FCNs. RCNNs shared a similar fate when the ‘You-Only-Look-Once’ (YOLO) [YOLO2016] and ‘Single Shot Detector’ (SSD) [SSD16] architectures were proposed. YOLO utilises a single convolutional network that predicts the bounding boxes and the class probabilities for these boxes. It consists of no fully connected layers, and consequently provides real-time performance. SSD proposed a similar idea, in which bounding boxes were predicted after multiple convolutional layers. Since each convolutional layer operates at a different scale, the architecture is able to detect objects of various scales. Whilst slower than YOLO, it is still considered to be faster then RCNNs. This new breed of object detection techniques were immediately applied to semantic segmentation. Similar to MaskRCNN, ‘Mask-YOLO’ [maskyolo2019] and ‘YOLACT’ [YOLACT2019] architectures were implementations of these object detectors to the problem of instance segmentation.
Finding objects within an image prior to segmenting them at the pixel-level is both intuitive and natural, as that is effectively how the human brain supposedly accomplishes this task. Consequently, employing object detection-based methods for semantic segmentation is an area significantly prone to further development in the near future.
|Method||Method Summary||Seg. Type||Refinement|
|Hier. Feat. [Farabet2013]|
|(2013)||Multiscale convolutional network fused parallel with a segmentation framework (either superpixel or CRF-based).||Object||“Parallel” CRF [Farabet2013]|
|Recurr. CNN [Pinheiro2014]|
|(2014)||Recurrent architecture constructed by using different instances of a CNN, in which each network instance is fed with previous label predictions (obtained from the previous instance).||Object||None|
|(2014)||Fully convolutional encoder structure (i.e., no fully connected layers) with skip connections that fuse multiscale activations at the final decision layer.||Object||None|
|(2014)||CNN with dilated convolutions, succeeded by a fully-connected (i.e. Dense) CRF.||Object||Dense CRF [krahenbuhl2011]|
|(2015)||Layers of a pyramidal input are fed to separate FCNs for different scales in parallel. These multiscale FCNs are also connected in series to provide pixel-wise category, depth and normal output, simultaneously..||Object||None|
|(2015)||Encoder/decoder structure with skip connections that connect same levels of ED and final input-sized classification layer.||Object||None|
|(2015)||Encoder/decoder structure (similar to U-Net) with skip connections that transmit only pooling indices (unlike U-Net, for which skip connections concatenate same-level activations).||Object||None|
|(2015)||Encoder/decoder structure (namely ‘the Conv./Deconv. Network’) without skip connections. The encoder (convolutional) part of the network is transferred from the VGG-VD-16L [Simonyan15].||Object||None|
|(2015)||Multiscale context aggregation using only a rectangular prism of dilated convolutional layers, without pooling or subsampling layers, to perform pixel-wise labelling.||Object||None|
|(2015)||Fully convolutional CNN (i.e., FCN) followed by a CRF-as-RNN layer, in which an iterative CRF algorithm is formulated as an RNN.||Object||CRF-as-RNN [ZhengICCV2015]|
|(2016)||Layers of a pyramidal input fed to parallel multiscale feature maps (i.e., CNNS), and later fused in an upsample/concatenation layer to provide the final feature map fed to a Dense CRF Layer.||Object||Dense CRF [krahenbuhl2011]|
|(2016)||Improved version of DeepLab.v1, with additional ‘dilated (atrous) spatial pyramid pooling’ (ASPP) layer.||Object||Dense CRF [krahenbuhl2011]|
|(2017)||CNN followed by a pyramid pooling layer similar to [KaimingHeSPP], but without a fully connected decision layer.||Object||None|
|(2017)||Improved version of DeepLab.v2, with optimisation of ASPP layer hyperparameters and without a Dense CRF layer, for faster operation.||Object||None|
|(2017)||One network predicts labelmaps/tags, while another performs semantic segmentation using these predictions. Both networks use ResNet101 [He2016ResNet] for preliminary feature extraction.||Object||None|
|(2017)||Object Detector Fast-RCNN followed by ROI-pooling and Convolutional layers, applied to instance segmentation (see Figure 5.a).||Instance||None|
|(2017)||Fed by an initial ResNet-based [He2016ResNet] encoder, GCN uses large kernels to fuse high- and low-level features in a multiscale manner, followed by a convolutional Border Refinement (BR) module.||Object||Conv. BR|
|(2018)||Consists of two sub-networks: Smooth Net (SN) and Border Net (BN). SN utilises an attention module and handles global context, whereas BN employs a refinement block to handle borders.||Object||Refin. Resid.|
|Block (RRB) [Yu2018CVPR]|
|(2018)||Aggregates features from different scales via connections between Long Short-term Memory (LSTM) chains.||Object||None|
|(2018)||Improved version of DeepLab.v3, using special encoder-decoder structure with dilated convolutions (with no Dense CRF employed for faster operation).||Object||None|
|(2018)||Followed by a convolutional ‘Appearance Feature Encoder’, a ‘Contextual Feature Encoder’ consisting of LSTMs generates super-pixel features fed to a Softmax-based classification layer.||Object||None|
|(2018)||Fully connected structure to extract context is fed by dense feature maps (obtained from ResNet [He2016ResNet]) and followed by a convolutional prediction layer.||Object||None|
|(2018)||Using an attention module between two convolutional structures, pixels are interconnected through a self-adaptively learnt attention map to provide global context.||Object||None|
|(2018)||Improved version of GCN [Peng2017GCN] for feature fusing which introduces more semantic information into low-level features and more spatial details into high-level features.||Object||Conv. BR|
|(2019)||Novel attention module between two CNN structures converts input feature maps to output feature maps, thus providing global context.||Object||None|
|(2019)||Allows branches of different receptive fields to share the same kernel to facilitate communication among branches and perform feature augmentation inside the network.||Object||None|
|(2019)||Using a distribution of co-occurrent features for a given target in an image, a fine-grained spatial invariant representation is learnt and the CFNet is constructed.||Object||None|
|(2019)||Consists of multiple shallow deconvolutional networks, called SDN units, stacked one by one to integrate contextual information and guarantee fine recovery of localised information.||Object||None|
|(2019)||Object Detector YOLO followed by Class Probability and Convolutional layers, applied to instance segmentation (see Figure 5.b).||Instance||None|
5.4 Proposed Methods
In this section, we present some of the state-of-the-art methods used for semantic segmentation. In this survey paper, we avoid providing a performance-based comparison, as such a benchmarking is deemed unnecessary, given that these methods have already presented their success rates in various challenges. On this issue, we would suggest that readers refer to the leaderboards mentioned in Section 2.
In Table 1, we present several semantic segmentation methods, each with a brief summary explaining the fundamental idea that represents the proposed solutions, the problem type they aim to resolve (such as object, instance or parts segmentation), and whether or not they include a refinement step. The intention is for readers to gain a better evolutionary understanding of the methods and architectures in this field, and a clearer conception of how the field may subsequently progress in the future. Regarding the brief summaries of the listed methods, please refer to the categorisations provided earlier in this section.
Table 1 includes 29 methods spanning a seven-year period, starting with early deep learning approaches through to the most recent state-of-the-art techniques. Most of the listed studies have been quite successful and have significantly high rankings in the previously mentioned leaderboards. Whilst there are many other methods, we believe this list to be a clear depiction of the advances in deep learning-based semantic segmentation approaches. Judging by the picture it portrays, the deep evolution of the literature clearly reveals a number of important implications. First, graphical model-based refinement modules are being abandoned due to their slow nature. A good example of this trend would be the evolution of DeepLab from [Chen14] to [Liang2018a] (see Table 1). Notably, no significant study published in 2019 employed a CRF-based or similar module to refine their segmentation results. Second, studies published in the past two years show no significant leap in performance rates. For this reason, researchers have tended to focus on experimental solutions such as object detection-based or attention-based approaches. Considering the studies of the post-FCN era, the main problem of the field remains efficiently integrating global context to localisation information, which still does not appear to have an off-the-shelf solution.
In this survey, we aimed at reviewing the current developments in the literature regarding deep learning-based 2D image semantic segmentation. We commenced with an analysis on the public image sets and leaderboards for 2D semantic segmantation, and then continued by providing an overview of the techniques for performance evaluation. Following this introduction, our focus shifted to the 10-year evolution seen in this field under three chronological titles, namely the pre- and early- deep learning era, the fully convolutional era, and the post-FCN era. After a technical analysis on the approaches of each period, we presented a table of methods spanning all three eras, with a brief summary of each technique that explicates their contribution to the field.
In our review, we paid particular attention to the key technical challenges of the 2D semantic problem, the deep learning-based solutions that were proposed, and how these solutions evolved as they shaped the advancements in the field. To this end, we observed that the fine-grained localisation of pixel labels is clearly the definitive challenge to the overall problem. Although the title may imply a more ‘local’ interest, the research published in this field evidently show that it is the global context that determines the actual performance of a method. Thus, it is eminently conceivable why the literature is rich with approaches that attempt to bridge local information with a more global context, such as graphical models, context aggregating networks, recurrent approaches, and attention-based modules. It is also clear that efforts to fulfil this local-global semantics gap at the pixel level will continue for the foreseeable future.
Another important revelation from this review has been the profound effect seen from public challenges to the field. Academic and industrial groups alike are in a constant struggle to top these public leaderboards, which has an obvious effect of accelerating development in this field. Therefore, it would be prudent to promote or even contribute to creating similar public image sets and challenges affiliated to more specific subjects of the semantic segmentation problem, such as 2D medical images.
Considering the rapid and continuing development seen in this field, there is an irrefutable need for an update on the surveys regarding the semantic segmentation problem. However, we believe that the current survey may be considered as a milestone in measuring how much the field has progressed thus far, and where the future directions possibly lie.
- FCN [Shelhamer2017] ] was officially published in 2017. However the same group first shared the idea online as pre-printed literature in 2014 [Long2014Arxiv].
- Many methods utilise fully connected layers such as RCNN [Girshick_2015_ICCV], which are discussed in the following sections. However, this and other similar methods that include fully connected layers have mostly been succeeded by fully convolutional versions for the sake of computational efficiency.