On Symbiosis of Attribute Prediction and Semantic Segmentation
Attributes are semantically meaningful characteristics whose applicability widely crosses category boundaries. They are particularly important in describing and recognizing concepts for which no explicit training example is given, e.g., zero-shot learning. Additionally, since attributes are human describable, they can be used for efficient human-computer interaction. In this paper, we propose to employ semantic segmentation to improve person-related attribute prediction. The core idea lies in the fact that many attributes describe local properties. In other words, the probability of an attribute to appear in an image is far from being uniform in the spatial domain. We build our attribute prediction model jointly with a deep semantic segmentation network. This harnesses the localization cues learned by the semantic segmentation to guide the attention of the attribute prediction to the regions where different attributes naturally show up. As a result of this approach, in addition to prediction, we are able to localize the attributes despite merely having access to image-level labels (weak supervision) during training. We first propose semantic segmentation-based pooling and gating, respectively denoted as SSP and SSG. In the former, the estimated segmentation masks are used to pool the final activations of the attribute prediction network, from multiple semantically homogeneous regions. This is in contrast to global average pooling which is agnostic with respect to where in the spatial domain activations occur. In SSG, the same idea is applied to the intermediate layers of the network. Specifically, we create multiple copies of the internal activations. In each copy, only values that fall within a certain semantic region are preserved while outside of that, activations are suppressed. This mechanism allows us to prevent pooling operation from blending activations that are associated with semantically different regions. SSP and SSG, while effective, impose heavy memory utilization since each channel of the activations is pooled/gated with all the semantic segmentation masks. To circumvent this, we propose Symbiotic Augmentation (SA), where we learn only one mask per activation channel. SA allows the model to either pick one, or combine (weighted superposition) multiple semantic maps, in order to generate the proper mask for each channel. SA simultaneously applies the same mechanism to the reverse problem by leveraging output logits of attribute prediction to guide the semantic segmentation task. We evaluate our proposed methods for facial attributes on CelebA and LFWA datasets, while benchmarking WIDER Attribute and Berkeley Attributes of People for whole body attributes. Our proposed methods achieve superior results compared to the previous works. Furthermore, we show that in the reverse problem, semantic face parsing significantly improves when its associated task is jointly learned, through our proposed Symbiotic Augmentation (SA), with facial attribute prediction. We confirm that when few training instances are available, indeed image-level facial attribute labels can serve as an effective source of weak supervision to improve semantic face parsing. That reaffirms the need to jointly model these two interconnected tasks.
Nowadays, state-of-the-art computer vision techniques allow us to teach machines different classes of objects, actions, scenes, and even fine-grained categories. However, to learn a certain notion, we usually need positive and negative examples from the concept of interest. This creates a set of challenges as the instances of different concepts are not equally easy to collect. Also, the number of learnable concepts is linearly capped by the cardinality of the training data. Therefore, being able to robustly learn a set of sharable concepts that go beyond rigid category boundaries is of tremendous importance. Visual attributes are one particular type of these sharable concepts. They are human describable and machine detectable. We can use attributes to describe a variety of objects, scenes, actions, and events. For example, we associate a person who is lying on a beach with the attribute relaxed or a cat that is chasing after a wool ball with the attribute playing.
Attributes are different from category labels in three major aspects. First, category labels are agnostic with respect to the intra-class variations that exist among different instances of a single category. Such flat representation cannot distinguish between a grumpy cat and a joyful one as it only sees them as cats. Second, attributes go across category boundaries. Hence, they can be used to potentially describe an exponential number of object categories (via different combinations) even if the associated category has never been observed before (e.g zero-shot learning). Third, unlike category labels that can be effectively inferred from the object itself, humans heavily rely on the contextual cues for the attribute prediction. Take the examples shown in Figure 1. If we only consider the bounding box around the dog, one would not assign the attribute catching to it. Instead, running may even be a valid attribute. However, leveraging contextual layout where the dog is above the ground, and close to a frisbee, provides human with sufficient indications to not only rule out the attribute running but also confidently label the dog with the attribute catching. Similarly, the table, food and plate, collectively serve as the context, building the ground for associating attribute eating to the person.
Considering the aforementioned characteristics of attributes, we hypothesize that the attribute prediction task would benefit from contextual cues if they are properly represented. One can organize the context supervision into three levels: image-level, instance-level and pixel-level. Image-level supervision represents the context as a binary vector indicating whether an instance of a certain category appears somewhere in the context. Therefore, it is blind to the spatial relationships that exist between underlying components i.e object instances in the scene. In the instance-level supervision, context is available in terms of a set of category label and bounding box tuples. That is, unlike the image-level, instance-level context supervision can model the spatial relationships in the scene. Lastly, in the pixel-level context supervision, we have access to the category labels in a per-pixel fashion. Obviously, this provides a much stronger supervision signal compared to the other two alternatives. In this work, we propose augmenting attribute prediction by transferring weakly pixel-level context supervision, from an auxiliary semantic segmentation task.
So far, we’ve explained attributes in general when they describe an instance of an object in a scene. However, the same is valid when attributes characterize variations of a certain object category. In this paper, we are interested in person-related, specifically facial and full body attributes. We view the concept of contextual cues, previously detailed for attributes of objects in the scene, as the natural correspondence of object attributes to the object parts and their associated layout in the spatial domain of the object boundary.
Naturally, attributes are “additive” to the objects (e.g., glasses for person). It means that an instance of an object may or may not take a certain attribute, while in either case the category label is preserved (e.g., a person with or without glasses is still labeled as person). Hence, attributes are especially useful in problems that aim at modeling intra-category variations such as fine-grained classification. Despite their additive character, attributes do not appear in arbitrary regions of the objects (e.g., hat if appears, is highly likely to show up on the top of person’s head). This notion is the basis of our work. We hypothesize that the attribute prediction can benefit from localization cues. Specifically, to detect an attribute, instead of processing the entire spatial domain at once, one should focus on the region in which that attribute naturally shows up. However, not all attributes have precise correspondences. For example, it is ambiguous from where in the face, we as humans, infer if a person is young or attractive. Hence, instead of hard-coding the correspondences, even where those seem clear (e.g. glasses with nose and eyes), we allow the model to learn how to leverage the localization cues that are transferred from a relevant auxiliary task to the attribute prediction problem.
Using bounding boxes to show the boundary limits of objects is a common practice in computer vision. However, regions that different attributes are associated to drastically vary in terms of appearance. For example, in a face image, one cannot effectively put a bounding box around the region associated to “hair”. In fact, the shape of the region can be used as an indicative signal on the attribute. On top of that, we have the partial occlusion of object parts which introduces further challenges by arbitrarily deforming visible regions. Therefore, we need an auxiliary task that learns detailed pixel-wise localization information without restricting the corresponding regions to be of certain pre-defined shapes. Semantic segmentation has all the aforementioned characteristics. It is the problem of assigning class labels to every pixel in an image. As a result, a successful semantic segmentation approach has to learn pixel-level localization cues which implicitly encode color, structure, and geometric characteristics in fine detail. In this work, since we are interested in person-related attributes, we take face  and human body  semantic parsing problems as auxiliary tasks to steer the spatial focus of the attribute prediction methods accordingly.
To perform attribute prediction, we feed an image to a fully convolutional neural network which generates feature maps that are ready to be aggregated and passed to the classifier. However, global pooling  is agnostic to where, in spatial domain, the attribute-discriminative activations occur. Hence, instead of propagating the attribute signal to the entire spatial domain, we funnel it into the semantic regions. By doing so, our model learns where to attend and how to aggregate the feature map activations. We refer to this approach as Semantic Segmentation-based Pooling (SSP), where activations at the end of the attribute prediction pipeline are pooled within different semantic regions.
Alternatively, we can incorporate the semantic regions into earlier layers of the attribute prediction network with a gating mechanism. Specifically, we propose augmenting max pooling operations such that they do not mix activations that reside in different semantic regions. Our approach generates multiple versions of the activation maps that are masked differently and presumably discriminative for various attributes. We refer to this approach as Semantic Segmentation-based Gating (SSG).
Since the semantic regions are not available for the attribute benchmarks, we learn to estimate them using a deep semantic segmentation network. In our earlier work , we took a conceptually similar approach to  in which an encoder-decoder model was built using convolution and deconvolution layers. However, considering the relatively small number of available data for the auxiliary segmentation task, we had to modify the network architecture. Despite being much simpler than , we found our semantic segmentation network  to be very effective in solving the auxiliary task of semantic face parsing. Examples of the segmentation masks generated for previously unseen images are illustrated in Figure 2. Once trained, such network is able to provide localization cues in the form of masks (decoder output) that decompose the spatial domain of an image into mutually exclusive semantic regions. We show that both SSP and SSG mechanisms outperform almost all the existing state-of-the-art facial attribute prediction techniques while employing them together results in further improvements.
One issue with SSP and SSG is their memory utilization. Since both architectures use the output of semantic segmentation to create (referring to the number of semantic regions) copies of the previous convolution layer activations. Given limited GPU memory budget, this can restrict the application of these layers when grows to large values. Instead, we can circumvent this challenge by learning the proper mask per channel. In contrast to SSP and SSG which mask each and every channel of activations with all the semantic probability maps, in this paper we propose to learn one mask per channel, as weighted superposition of different semantic probability maps (output of semantic segmentation network). Such workaround that can be simply implemented by a convolution, adds minimum memory utilization overhead and also allows us to simplify SSP and SSG, yielding a single unified architecture that based on where it is applied in the architecture, can mimic the behavior of SSP and SSG.
Following the recent trend in semantic segmentation, instead of an encoder-decoder as in , here we utilize a fully convolutional architecture, specifically Inception-V3. Hence, we can unify attribute prediction and semantic segmentation networks by full weight sharing. As a result, unlike , we do not need to pre-train the semantic segmentation network before deploying it in attribute prediction pipeline. Instead, both tasks are learned simultaneously in an end-to-end fashion within a single architecture. We go beyond facial attributes  and demonstrate the effectiveness of employing semantic segmentation in person-related attributes on multiple benchmarks. Finally, we provide comprehensive quantitative evaluation for the case where attributes are jointly trained with semantic segmentation with the aim to boost the latter task.
In summary, the contributions of this work are as follows:
We demonstrate the effectiveness of employing semantic segmentation to improve person-related attribute prediction.
We propose a simple alternative to Semantic Segmentation-based Pooling and Semantic Segmentation-based Gating with focus on minimum memory utilization overhead.
We unify semantic segmentation and attribute prediction through multi-tasking a single network and training it in an end-to-end fashion.
We achieve state-of-the-art results in person-related attribute prediction on CelebA, LFWA, WIDER Attributes, and Berkeley Attributes of People datasets.
We provide comprehensive experiments, detailing how to improve semantic segmentation task by leveraging image-level attribute annotations.
The remainder of this paper is organized as follows. Section 2 offers a detailed review of attribute prediction and semantic segmentation literature. In Section 3, we propose semantic segmentation-based pooling and gating, followed by a simple unifying view of them which benefits from considerably lighter memory footprint. We end this section by providing details of our architectures. Experimental results are shown in Section 4. This includes evaluation of facial and person attributes on four datasets, alongside with comprehensive experiments on the effectiveness of leveraging image-level facial attribute annotations to boost semantic face parsing. Finally, we conclude this paper in Section 5.
2 Related Work
2.1 Attribute Prediction
Early works in modeling attributes  came around with the intention to change the recognition paradigm from naming objects to describing them. Therefore, instead of directly learning the object categories, one begins with learning a set of attributes that are shared among different categories. Object recognition can then be built upon the attribute scores. Hence, novel categories are seamlessly integrated, via attributes, with previously observed ones. This can be used to ameliorate label misalignment between train and test data.
Considering the importance of human category, research in person-related attribute prediction  has flourished over the years. To perform attribute prediction, some of the previous works have invested in modeling the correlation among attributes , while others have focused on leveraging the category information . There are also efforts to exploit the context .
Another way to view the attribute prediction literature is to divide it into holistic versus part-based methods. The common theme among the holistic models is to take the entire spatial domain into account when extracting features from images. On the other hand, part-based methods begin with an attribute-related part detection and then use the located parts, in isolation from the rest of spatial domain, to extract features. It has been shown that part-based models generally outperform the holistic methods. However, they are prone to the localization error as it can affect the quality of the extracted features. Although, there are works that have taken a hybrid approach benefiting from both the holistic and part-based cues . Our proposed methods fall in between the two ends of the spectrum. While we process the image in a holistic fashion, we employ localization cues in form of pixel-level semantic representations.
Among earlier works we refer to  as successful examples of part-based attribute prediction models. More recently, in an effort to combine part-based models with deep learning, Zhang et al.  proposed PANDA, a pose-normalized convolutional neural network (CNN) to infer human attributes from images. PANDA employs poselets  to localize body parts and then extracts CNN features from the located regions. These features are later used to train SVM classifiers for attribute prediction. Inspired by , while seeking to also leverage the holistic cues, Gkioxari et al.  proposed a unified framework that benefits from both holistic and part-based models through utilizing a deep version of poselets  as part detectors. Liu et al.  have taken a relatively different approach. They show that pre-training on massive number of object categories and then fine-tuning on image level attributes is sufficiently effective in localizing the entire face region. Such weakly supervised method provides them with a localized region where they perform facial attribute prediction. In another part-based approach, Singh et al.  use spatial transformer networks  to localize the most relevant region associated to a given attribute. They encode such localization cue in a Siamese architecture to perform localization and ranking for relative attributes. Rudd et al.  have addressed the widely recognized data imbalance issue in attribute prediction, by introducing mixed objective optimization network (MOON). The proposed loss function mixes multiple task objectives with domain adaptive re-weighting of propagated loss.  and  are more examples of recent works that have tried similarly to address the class imbalance in the multi-label problem of attribute prediction. Li et al. have recently proposed lAndmark Free Face AttrIbute pRediction (AFFAIR) , a hierarchy of spatial transformation networks that initially crop and align the face region from the entire —assumed to be in the wild —input image and then localize relevant parts associated with different attributes. Separate neural network architectures then extract feature representations from global and part-based regions where their fusion is used to predict different facial attributes.
In our earlier work , we proposed employing semantic segmentation to capture local characteristics for facial attribute prediction. We utilized semantic masks, obtained from a separate pre-trained semantic segmentation network, to gate and pool the activations, respectively at middle and the end of the attribute prediction architecture. In this journal version of the paper, we extend and improve the proposed framework in  beyond face, and to the human body within the context of person-related attribute prediction. Our driving force in obtaining local cues is semantic parsing of face and human body. Meanwhile, unlike  that uses two separate networks for the main and auxiliary tasks, here we employ a heavy weight sharing strategy, unifying the semantic segmentation and attribute prediction architectures into one. Next, we discuss the semantic segmentation literature.
2.2 Semantic Segmentation
Semantic segmentation can be seen as a dense pixel-level multi-class classification problem, where the spatial (spatio-temporal) domain of images (videos) is partitioned using fine contours (volumes) into clusters of pixels (voxels) with homogeneous class labels. Prior to the wide-spread popularity of deep convolutional neural networks (CNN), semantic segmentation used to be solved via traditional classifiers such as Support Vector Machine (SVM) or Random Forest applied to the super-pixels . Conditional Random Field (CRF) was often used in these methods as the post processing technique to smooth the segmentation results, based on the assumption that pixels which fall within a certain vicinity, with similar color intensity, tend to be associated with the same class labels.
Among earlier efforts in using deep convolutional neural networks for semantic segmentation, we can refer to Ciresan et. al  work on automatic segmentation of neuronal structures in electron microscopy images. Although, since the number of classes was limited to only membrane and non-membrane, their problem in fact reduces to foreground detection task. Later, upon tremendous success of deep convolutional neural networks in image classification, researchers began designing semantic segmentation models on the top of CNN models, which were previously trained for other tasks, mainly image classification . These methods, by leveraging supervised pre-training on strongly correlated tasks (e.g. often labels in two tasks have some overlap), were able to facilitate training procedure for semantic segmentation. However, such an adoption introduces its very own challenges.
Unlike image classification where the activations just before the classifier are flattened via fully connected layer or global average pooling, semantic segmentation task requires the spatial domain to be maintained, specifically the output segmentation maps should be at least of the same size as the input image. Fully Convolutional Networks popularized CNN architectures for semantic segmentation. Long et. al  proposed transforming fully connected layers into convolution layers along with up-sampling intermediate and final activations, whose spatial domain have reduced due to pooling layers through the network architecture. These techniques enable a classification model to output segmentation maps of arbitrary size when operating on input images of any size. Almost all the subsequent state-of-the-art semantic segmentation methods adopted this paradigm. The performance of semantic segmentation task will be compromised if the spatial information is not well preserved through the network architecture. In contrast, architectures designed for image classification very often use pooling layers to aggregate the context activations while discarding the precise spatial coordinates. To alleviate this conceptual discrepancy, two different classes of architectures have evolved.
First is the encoder-decoder based approach  in which the encoder gradually reduces the spatial domain through successive convolution and pooling layers, to generate the bottleneck representation. Then the decoder recovers the spatial domain by applying multiple layers of deconvolution or convolution followed by up-sampling, to the aforementioned bottleneck representation. There are usually shortcut connections from the encoder to the decoder, leveraging details at multiple scales, in order to help decoder recovering fine characteristics more accurately. U-Net SegNet, and RefineNet are the popular architectures from this class.
The second class of architectures developed around the idea of Dilated or Atrous convolutions . Specifically, one can avoid using pooling layers in order to preserve detailed spatial information, but this will dramatically increase the computation cost as the following layers must operate on larger activation maps. However, using Atrous convolution  with dilation rate equal to the stride of the avoided pooling layer, results in the exact same number of operations as the regular convolution operating on pooled activations111It is worth pointing out that while the computation cost remains the same, employing dilated convolution demands more memory since the size of activation maps remains intact.. In other words, dilated or Atrous convolution layer allows for an exponential increase in effective receptive field without reducing the spatial resolution. In a series of works , Chen et. al. demonstrated how Atrous convolution and its multi-scale variation, namely Atrous spatial pyramid pooling (ASPP) can be utilized within the framework of fully convolutional neural networks to improve the performance of the semantic segmentation task. While in earlier efforts , Dense CRF  has been used, more recent works  have shown comparable results without using such post-processing technique.
Semantic segmentation can be applied at a finer granularity where instead of the entire scene, an object is semantically parsed into its parts. Among popular examples, readers are encouraged to refer to  for face,  for general objects, and  for human body and clothing semantic parsing.
In this work, since we are interested in attributes describing human, when alluding to semantic segmentation, we specifically mean face and human body semantic parsing. Our semantic segmentation model is a fully convolutional neural network based on Inception-V3  architecture, where following  we have also incorporated Atrous spatial pyramid pooling (ASPP). In addition to utilizing semantic parsing for person-related attribute prediction, we will provide results on semantic face parsing as well. We show that, training an attribute prediction network with image-level supervision can effectively serve as an initialization for semantic parsing task, when the the number of training instances is limited.
The underlying idea of this work is to exploit semantic segmentation in order to improve person-related attribute prediction. To do so, we first revisit semantic segmentation-based pooling (SSP) and gating (SSG), initially proposed in our earlier work . Then, we propose a considerably simpler architecture, which unifies SSP and SSG designs while approximately mimicking their behavior with drastically smaller memory footprint. Furthermore, unlike , where there were two networks, one for semantic segmentation and the other for attribute prediction, here we unify two networks with fully sharing the weights among two tasks, and train in an end-to-end fashion. Note that in , once trained independently, the semantic segmentation network was frozen during the attribute prediction task. Moving towards more modern architectures than those used earlier in , we describe our new models based on modern Inception-V3  as their backbone. This choice will allow us to further push performance boundaries in person-related attribute prediction task.
3.1 SSP: Semantic Segmentation-based Pooling
We argue that attributes usually have a natural correspondence to certain regions within the object boundary. Hence, aggregating the visual information from the entire spatial domain of an image would not capture this property. This is the case for the global average pooling used in our baseline as it is agnostic to where, in the spatial domain, activations occur. Instead of pooling from the entire activation map, we propose to first decompose the activations of the last convolution layer into different semantic regions and then aggregate only those that reside in the same region. Hence, rather than a single vector representation, we obtain multiple features, each representing only one semantic region. This approach has an interesting intuition behind it. In fact, SSP funnels the back-propagation of the label signals, via multiple paths, associated with different semantic regions, through the entire network. This is in contrast with global average pooling that rather equally affects different locations in the spatial domain. We later explore this by visualizing the activation maps of the final convolution layer.
We can simply concatenate the representations associated with different regions and pass it to the classifier; however, it is interesting to observe if attributes indeed prefer one semantic region to another. Also, whether what our model learns matches human expectation on what attribute corresponds to which region. To do so, we take a similar approach to  where Bilen and Vedaldi employed a two branch network for weakly supervised object detection. We pass the vector representations, each associated with a different semantic region, to two branches one for recognition and another for localization. We implement these branches as linear classifiers that map vector representations to the number of attributes. Hence, we have multiple detection scores for an attribute each inferred based on one and only one semantic region. To combine these detection scores, we normalize outputs of the localization branch using softmax non-linearity across different semantic regions. This is a per-attribute operation, not an across-attribute one. We then compute the final attribute detection scores by a weighted sum of the per-region logits (i.e. outputs of recognition branch) using weights generated by the localization branch. Figure 3 (Left) shows the SSP architecture.
3.2 SSG: Semantic Segmentation-based Gating
Max pooling is used to compress the visual information in the activation maps of the convolution layers. Its efficacy has been proven in many computer vision tasks, such as image classification and object detection. However, attribute prediction is inherently different from image classification. In image classification, we want to aggregate the visual information across the entire spatial domain to come up with a single label for the image. In contrast, many attributes are inherently localized to specific image regions. Consequently, aggregating activations that reside in the “mouth” region with the ones that correspond to “hair”, would confuse the model in detecting “smiling” and “wavy hair” attributes. We propose SSG to cope with this challenge.
Figure 3 (Right), shows our proposed SSG architecture where may or may not be the same as (similarly for and ). To gate the output activations of the convolution layer, we broadcast element-wise multiplication for each of the semantic regions with the entire activation maps. This generates multiple copies of the activations that are masked differently. In other words, such mechanism spatially decomposes the activations into copies, where large values cannot simultaneously occur in two semantically different regions. For example, gating with the semantic mask that corresponds to the “mouth” region, would suppress the activations falling outside its area while preserving those that reside inside it. However, the area which a semantic region occupies varies from one image to another.
We observed that, directly applying the output of the semantic segmentation network results in instabilities in the middle of the network. To alleviate this, prior to the gating procedure, we normalize the semantic masks such that the values of each channel sums up to 1. We then gate the activations right after the convolution and before the batch normalization . This is very important since the batch normalization  enforces a normal distribution on the output of the gating procedure. Then, we can apply max pooling on these gated activation maps. Since, given a channel, activations can only occur within a single semantic region, max pooling operation cannot blend activation values that reside in different semantic regions. We later restore the number of channels using a convolution. It is worth noting that SSG can potentially mimic the standard max pooling by learning a sparse set of weights for the convolution. In a nutshell, semantic segmentation-based gating allows us to process the activations of convolution layers in a per-semantic region fashion while it also learns how to blend the pooled values back in.
3.3 A Simple Unified View to SSP and SSG
In both SSP and SSG architectures, we use the output of semantic segmentation to create copies of the activations. Each copy, assuming semantic parsing outputs are perfect, preserves the activation values residing in one semantic region while suppressing those that are outside that. Hence, both SSP and SSG should maintain times the size of activation maps in the memory. As value grows, this can certainly become problematic due to limited GPU memory budget. A simple workaround for this is to learn the masks per channel. Specifically, instead of masking each and every channels of the previous convolution activations by all the semantic probability maps, we learn one mask per channel (ref. in Figure 4). This can be simply implemented via a convolution on the top of semantic segmentation probability maps. However, in practice, we observed that larger kernels can result in slight performance gain. Similar to SSG, the output logits of the semantic segmentation classifier must be normalized, via batch normalization, prior to being passed to the embedding convolution layer. The output of the embedding should also be spatially normalized. Such embedding allows the model to either pick one or combine (weighted superposition) multiple semantic maps, in order to generate proper mask for each channel. We initialize the convolution kernels of the embedding layers with zeros and no bias. This is inspired by the idea that each channel should start by using all the semantic regions equally. However, through training, it has the freedom to change towards combining only a selected number of regions. We later visualize how the learned convolution kernels of look like in Figures 9 and 7(a).
We now go one step further as the same idea can be used when we reverse the roles of tasks. In particular, we can use the output of attribute prediction to guide the semantic segmentation task. We refer to this joint semantic augmenting model, illustrated in Figure 4, as Symbiotic Augmentation (SA). The architecture of the embedding module in this case, , is the same as except the normalization operations are done differently. Figure 4 shows that in Symbiotic Augmentation, each task augments the other task’s representation, through its corresponding output logits, while simultaneously being trained in an end-to-end fashion. This is different than SSP and SSG, where only a pre-trained semantic segmentation model, while frozen at deployment, augments attribute prediction task. Note that, in addition to a lower memory footprint222The memory footprint of SSP is of while SA’s is of . Here refers to the number of output channels in last (before classifier) convolution layer, while and respectively denote height and width of the final spatial resolution., this approach allows us to simplify the SSP by unifying the recognition and localization branches. That is because the learned masks can properly weigh each channel and the order of consecutive linear operations (matrix multiplication through fully connected layer and scaling through weights of localization branch) is interchangeable.
3.4 Network Architectures
We use Inception-V3  as the convolutional backbone of Symbiotic Augmentation (SA), for both semantic segmentation and attribute prediction models. Its architecture is 48 layers deep and uses global average pooling instead of fully-connected layers which allows operating on arbitrary input image sizes. Inception-V3  has a total output stride of 32. However, to maintain low computation cost and memory utilization, the size of activation maps quickly reduces by a factor of 8 in only first seven layers, referred to as STEM  in Figure 5. This is done by one convolution and two max pooling layers that operate with the stride of 2. The network follows by three blocks of Inception layers separated by two grid reduction modules. Spatial resolution of the activations remains intact within the Inception blocks, while grid reduction modules halve the activation size and increase the number of channels. For more details on the Inception-V3  architecture, readers are encouraged to refer to . Note that, for SSP, SSG and SSP+SSG experiments which were initially reported in , a VGG16-like backbone architecture has been used. Further details are provided in .
In this work, we use a single architecture to simultaneously learn semantic parsing and attribute prediction tasks. This is different than  where semantic segmentation model was pre-trained and then deployed (weights were frozen) into attribute prediction pipeline. Specifically, we share the weights of the Inception-V3  while training with a mixed minibatch that is comprised of equal instances associated to attribute prediction and semantic segmentation tasks. Figure 5 illustrates how we obtain feature representations for both tasks using a single architecture. Note that each element in the minibatch has only one type of annotations, either attribute or semantic segmentation labels. Hence, when and are passed to the Symbiotic Augmentation (SA), shown in Figure 4, depending on the annotation type, either or is calculated.
4.1 Datasets and Evaluation Measures
We evaluate our proposed attribute prediction models on multiple benchmarks. Specifically, we use CelebA and LFWA  for facial attributes, while benchmarking on WIDER Attribute  and Berkeley Attributes of People  for person attribute prediction.
Liu et al.  have used classification accuracy/error as the evaluation measure on CelebA and LFWA. However, we believe that due to significant imbalance between the numbers of positive and negatives instances per attribute, such measure cannot appropriately evaluate the quality of different methods. Similar point has been raised by [52, 23, 12] as well. Therefore, in addition to the classification error, we also report the average precision (AP) of the prediction scores. Following the literature, we solely report AP for WIDER Attribute  and Berkeley Attributes of People . Since attribute benchmarks do not come with pixel-level labels, we train our semantic segmentation model on auxiliary datasets. For experiments corresponding to facial attributes, we use Helen Face  along with segment label annotations supplemented by . For person attribute prediction experiments, we train the semantic parsing model on Look into Person (LIP)  dataset. We use the standard data split of each corresponding dataset.
CelebA  consists of 202,599 images partitioned into training, validation and test splits with approximately 162K, 20K and 20K images in the respective splits. There are a total of 10K identities (20 images per identity) with no identity overlap between evaluation splits. However, we do not use identity annotations. Images are annotated with 40 facial attributes such as, “wavy hair”, “mouth slightly open”, and “big lips”. In addition to the original images, CelebA provides a set of pre-cropped images. We report our results on both of these image sets.
LFWA  has a total of 13,143 images of 5,749 identities with pre-defined train and test splits, which divide the entire dataset into two approximately equal partitions. Each image is annotated with the same 40 attributes used in CelebA.
WIDER Attribute  is collected from 13,789 WIDER images , containing usually many people in each image with huge human variations. Each person in these images is then annotated with a bounding box and 14 distinct human attributes such as “longhair”, “sunglasses”, “hat”, “skirt”, and “facemask”. This results in a total of 57,524 boxes. Out of 13,789 images, WIDER Attribute  is split into 5,509 training, 1,362 validation and 6,918 test images. There are 30 scene-level labels that each image is annotated with. However, we do not use them and solely train and evaluate on bounding boxes of people. We evaluate on the 29,179 bounding boxes from testing images, after training on 28,345 person boxes extracted from aggregation of training and validation images. Unlike CelebA and LFWA , missing attributes are allowed in WIDER Attribute  dataset.
Berkeley Attributes of People  contains 4,013 training and 4,022 test instances. The example images are centered at the person and labeled with 9 attributes namely, “is male”, “has long hair”, “has glasses”, “has hat”, “has tshirt”, “has long sleeves”, “has shorts”, “has jeans”, “has long pants”. Similar to the WIDER Attribute , here unspecified attributes are also allowed.
Helen Face  consists of 2,330 images with highly accurate and consistent annotations of the primary facial components. Smith et. al  have supplemented Helen Face  with 11 segment label 333“background”, “face skin” (excluding ears and neck), “left eyebrow”, “right eyebrow”, “left eye”, “right eye”, “nose”, “upper lip”, “inner mouth”, “lower lip” and “hair” annotations per image. Images are divided into splits of 2000, 230 and 100, respectively for training, validation and test. We train our semantic segmentation model on the aggregation of training and validation splits and evaluate on the test split.
LIP  consists of 30,000 and 10,000 images respectively for train and validation. Each images is annotated with 20 semantic labels444“Background”, “Hat”, “Hair”, “Glove”, “Sunglasses”, “Upper-clothes”, “Dress”, “Coat”, “Socks”, “Pants”, “Jumpsuits”, “Scarf”, “Skirt”, “Face”, “Right-arm”, “Left-arm”, “Right-leg”, “Left-leg”, “Right-shoe” and “Left-shoe”.
4.2 Evaluation of Facial Attribute Prediction
For all the numbers reported here, we want to point out that FaceTracer  and PANDA  use groundtruth landmark points to attain face parts. Wang et al.  use 5 million auxiliary image pairs, collected by the authors, to pre-train their model. Wang et al.  also use state-of-the-art face detection and alignment to extract the face region from CelebA and LFWA images. However, we train all our models with only attribute and auxiliary face/human parsing labels.
We compare our proposed method with the existing state-of-the-art attribute prediction techniques on the CelebA . To prevent any confusion and have a fair comparison, Table I reports the performances in two separate columns distinguishing the experiments that are conducted on the original image set from those where the pre-cropped image set have been used.
Experimental results indicate that under different settings and evaluation protocols, our proposed semantic segmentation-based pooling and gating mechanisms can be effectively used to boost the facial attribute prediction performance. That is particularly important given that our global average pooling baselines already beat almost all the existing state-of-the-art methods. To see if SSP and SSG are complementary to each other, we also report their combination where the corresponding predictions are simply averaged. We observe that such process further boosts the performance.
|Liu et al. ||12.70||–|
|Wang et al. ||12.00||–|
|Zhong et al. ||10.20||–|
|Rudd et al. : Separate||–||9.78|
|Rudd et al. : MOON||–||9.06|
|SSP + SSG||8.84||8.20|
|Symbiotic Augmentation (SA)||8.53||–|
|SSP + SSG||78.74||81.45|
|Symbiotic Augmentation (SA)||80.10||–|
|Balanced Accuracy(%) |
|Huang et al. ||–||84.00|
To investigate the importance of aggregating features within the semantic regions, we replace the global average pooling in our basic model with the spatial pyramid pooling layer . We use a pyramid of two levels and refer to this baseline as SPPNet. While aggregating the output activations in different locations, SPPNet does not align its pooling regions according to the semantic context that appears in the image. This is in direct contrast with the intuition behind our proposed methods. Experimental results shown in Table I confirm that simply pooling the output activations at multiple locations is not sufficient. In fact, it results in a lower performance than global average pooling. This verifies that the improvement obtained by our proposed models is due to their content aware pooling/gating mechanisms.
Naive Approach A naive alternative approach is to consider the segmentation maps as additional input channels. To evaluate its effectiveness, we feed the average pooling basic model with 10 input channels, 3 for RGB colors and 7 for different semantic segmentation maps. The input is normalized using Batch Normalization . We train the network using the same setting as other aforementioned models. Our experimental results indicate that such naive approach cannot leverage the localization cues as good as our proposed methods. Table I shows that at best, the naive approach is on par with the average pooling basic model. We emphasize that feeding semantic segmentation maps along with RGB color channels to a convolutional network results in blending the two modalities in an additive fashion. Instead, our proposed mechanisms take a multiplicative approach by masking the activations using the semantic segmentation probability maps.
Semantic Masks vs. Bounding Boxes To analyze the necessity of semantic segmentation, we generate a baseline, namely BBox, which is similar to SSP. However, we replace the semantic masks in SSP with the bounding boxes on the facial landmarks. Note that we use the groundtruth location of the facial landmarks, provided in CelebA dataset , to construct the bounding boxes. Hence, to some extent, the performance of BBox is the upper bound of the bounding box experiment. There are 5 facial landmarks including left eye, right eye, nose, left mouth and right mouth. We use boxes with area ( gives similar results) and 1:1, 1:2 and 2:1 aspect ratios. Thus, there are a total of 16 regions including the whole image itself. From Table I, we see that our proposed models, regardless of the evaluation measure, outperform the bounding box alternative, suggesting that semantic masks should be favored over the bounding boxes on the facial landmarks.
Balanced Classification Accuracy Given the significant imbalance in the attribute classes, also noted by [23, 52, 12], we suggest using average precision instead of classification accuracy/error to evaluate attribute prediction. Instead, Huang et al.  and later  have adopted balanced accuracy measure. To evaluate our proposed approach in balanced accuracy measure, we fine-tuned our models with the weighted ( imbalance level) binary cross entropy loss. From Table I, we observe that under such measure, all the variations of our proposed model outperform both  and  with large margins.
|Liu et al. ||16.00||–|
|Zhong et al. ||14.10||–|
|Wang et al. ||13.00||–|
|SSP + SSG||12.87||85.28|
To better understand the effectiveness of our proposed approach on facial attributes, we also report experimental results on the LFWA dataset  in Table II. Here, we observe a similar trend to the one in CelebA, where all the proposed models which exploit localization cues successfully improve the baseline. Specifically, SSP + SSG achieves considerably better performance than the average pooling model with margins of 1.86% in classification accuracy and 2.59% in average precision. Our best model also outperforms all other state-of-the-art methods.
Symbiotic Augmentation (SA) All the results reported so far were using a VGG16-like architecture for attribute prediction and a separate pre-trained encoder-decoder architecture for semantic segmentation . However, in SA-based models, we have unified the two architectures and train simultaneously with two objective functions. Table I shows that simply using a stronger convolutional backbone like Inception-V3  boosts the performance on CelebA original image set. Furthermore, SA-based model which is built on the top of such backbone, despite heavily sharing all the weight across two tasks, can achieve even better results, outperforming SSP+SSG and current state-of-the-art AFFAIR . However, on LFWA dataset , we observed that Inception-V3  baseline performs on par with Avg. Pooling baseline reported in Table II and SA cannot obtain a meaningful gain over its counter global average pooling baseline. We also tried (not reported here) solely using LFWA training instances, without pre-training on CelebA, and observed that SA was indeed effective. However it was not able to reach the performance of the model initialized with CelebA. Detailed per-attribute results of our top models for both CelebA and LFWA datasets are shown in Table V.
|Fast R-CNN ||80.00|
|Deep Hierarchical Contexts ||81.30|
|Sarafianos et. al. ||86.40|
|Symbiotic Augmentation (SA)||87.58|
|Fast R-CNN ||87.80|
|Gkioxari et al. ||89.50|
|Deep Hierarchical Contexts ||92.20|
|Symbiotic Augmentation (SA)||94.80|
4.3 Evaluation of Person Attribute Prediction
Table III compares our proposed method with the state-of-the-art on WIDER Attribute  dataset. We observe that the Inception-V3  baseline, despite being considerably shallower, performs on par with ResNet-101. Symbiotic Augmentation (SA) which employs semantic segmentation yields a 2% performance gain over our Inception-V3  baseline surpassing , the current state-of-the-art. For detailed performance comparison between varieties of ResNet  and DenseNet  architectures on WIDER Attribute  dataset, readers are encouraged to refer to .
Table IV compares our proposed method with the state-of-the-art on Berkeley Attributes of People  dataset. Note that  leverages the context in the image while our method solely operates on the bounding box of each person, yet it still outperforms  with 2.6% margin. Similar to WIDER Attribute  dataset, here utilizing semantic segmentation through our proposed Symbiotic Augmentation (SA) results in 2% gain in AP over our already very competitive Inception-V3  baseline. Detailed per-attribute results of our models are shown in Table VI.
Inception -V3: baseline
Symbiotic Aug. (SA)
Inception -V3: baseline
Symbiotic Aug. (SA)
|Classification Accuracy(%)||Average Precision(%)|
|5 o Clock Shadow||94.50||95.07||94.34||94.62||79.72||80.36||83.96||80.42||81.63||83.61|
|Bags Under Eyes||85.42||86.15||85.26||85.60||85.09||67.68||70.05||67.24||67.96||95.19|
|Mouth Slightly Open||92.25||94.19||92.61||92.79||84.29||97.97||98.87||98.10||98.29||88.36|
|Inception-V3: baseline||Symbiotic Augmentation (SA)|
|Berkeley Attributes of People |
|Inception-V3: baseline||Symbiotic Augmentation (SA)|
|Has Long Hair||93.71||94.41|
|Has Long sleeves||96.96||98.01|
|Has Long Pants||99.33||99.55|
Unlike the global average pooling which equally affects a rather large spatial domain, we expect SSP to generate activations that are semantically aligned. To evaluate our hypothesis, in Figure 6, we show the activations for the top fifty channels of the last convolution layer. Top row corresponds to our basic network with global average pooling, while the bottom row is generated when we replace global average pooling with SSP. We observe that, activations generated by SSP are clearly more localized than those obtained from the global average pooling.
To better understand how attribute prediction and semantic segmentation models have learned their corresponding tasks, we visualize the embedding convolution layers and (ref. Figure 4) for simultaneously training of CelebA  (original image set) with Helen face , and WIDER Attribute  with LIP . Figure 9 shows how for each facial attribute (vertical axis), network has learned to employ different semantic regions of face (horizontal axis) in order to predict attributes. Note that these weights are learned through back-propagation and are not hard coded, yet they reveal very interesting observations. First, almost all the attributes give “background” the lowest importance, except attribute “Wearing Necklace” which makes sense as neck falls outside the face region and counted as background in Helen face dataset . Second, the learned importance for the majority of attributes are aligned with human expectations. For instance, all the hair-related attributes are inferred with the most attention of the model being paid to the “Hair” region. The same is true for “Big Nose”, “Pointy Nose” and “Eyeglasses” as the model learns to focus on the “Nose” region. Figure 7 illustrates for the reverse problem where attributes are supposed to improve semantic face parsing. Figure 7(a) and 7(b) show the learned weights of the embedding convolution layer for person attribute prediction and human semantic parsing tasks.
We observe that simultaneously training for attribute prediction and semantic segmentation within Symbiotic Augmentation framework, in addition to the performance gains, provides us with meaningful tools to study how a complex deep neural network infers and relates different semantic labels across multiple tasks.
4.5 Attribute Prediction for Semantic Segmentation
In this work, we have established how semantic segmentation can be used to improve person-related attribute prediction. What if we reverse the roles. Can attributes improve semantic parsing problem? To evaluate this, we focus on facial attributes and compare the performance of semantic face parsing on Helen face . We consider three scenarios. First, initializing Inception-V3  backbone with ImageNet  pre-trained weights. Second, training a baseline attribute prediction network on CelebA  and using the corresponding weights, once training finished, to initialize semantic face parsing network. Third, training facial attribute and semantic face parsing simultaneously through Symbiotic Augmentation (SA) framework. For the sake of simplicity, solely in this experiment, SA only uses the final activations of the CNN backbone instead of concatenating them with intermediate feature maps as shown in Figure 5. We observed that upgrading to full SA model boosts mean class accuracy by 5% and also achieves similar mean IoU. Table VII shows that pre-training on image-level facial attribute annotations delivers a large performance gain over ImageNet based initialization. This shows that there exists an interrelatedness between attribute prediction and semantic segmentation. Furthermore, it suggests that while collecting annotations for semantic parsing is laborious and expensive, instead one can use relevant image-level attribute annotations to initialize a semantic parsing model. The last row in each block of the Table VII demonstrates how training facial attributes and semantic face parsing jointly, through our proposed Symbiotic Augmentation (SA), can further push the performance boundary with significant margin. Therefore, it is easy to see that when few training instances are available, indeed image-level facial attribute labels can serve as an effective source of weak supervision to improve semantic face parsing task. In fact such interrelatedness plays a major role in allowing us to successfully unify semantic segmentation and attribute predictions networks (ref. Section 3) without sacrificing the performance. Jointly training on LIP  and WIDER Attribute , we did not observe meaningful gain in semantic segmentation task on LIP . We hypothesize that, this is due to the fact that LIP  itself already has huge (30,000 instances) number of training annotations. In order to confirm this, conducting an experiment where only a small portion of LIP  training instances are used is needed.
|Intersection over Union(%)|
Aligned with the trend of part-based attribute prediction methods, we proposed employing semantic segmentation to improve person-related attribute prediction. Specifically, we jointly learn attribute prediction and semantic segmentation in order to mainly transfer localization cues from the latter task to the former. To guide the attention of our attribute prediction model to the regions which different attributes naturally show up, we introduced SSP and SSG. While SSP is used to restrict the aggregation procedure of final activations to regions that are semantically consistent, SSG carries the same notion but applies it to the earlier layers. We then demonstrated that there exists a single unified architecture that can mimic the behavior of SSP and SSG, depending on where in the network architecture it is being used. We evaluated our proposed methods on CelebA, LFWA, WIDER Attribute and Berkeley Attributes of People datasets and achieved state-of-the-art performance. We also showed that attributes can improve semantic segmentation (in case of few training instances) when properly used through our Symbiotic Augmentation (SA) framework. We hope to encourage future research works to invest more in the interrelatedness of these two problems.
This material is based upon work supported by the National Science Foundation under Grant No. 174143 and the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA R&D Contract No. D17PC00345. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.
-  (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence 39 (12), pp. 2481–2495. Cited by: §2.2, §2.2.
-  (2013) Poof: part-based one-vs.-one features for fine-grained categorization, face verification, and attribute estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 955–962. Cited by: §2.1, §2.1.
-  (2016) Weakly supervised deep detection networks. In CVPR, Cited by: §3.1.
-  (2011) Describing people: a poselet-based approach to attribute classification. In 2011 International Conference on Computer Vision, pp. 1543–1550. Cited by: §2.1, §2.1, §4.1, §4.1, §4.1, §4.3, TABLE IV, TABLE VI.
-  (2012) Describing clothing by semantic attributes. In European conference on computer vision, pp. 609–623. Cited by: §2.1.
-  (2018) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §2.2, §2.2, §2.2.
-  (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. Cited by: §2.2, §2.2.
-  (2016) Attention to scale: scale-aware semantic image segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3640–3649. Cited by: §2.2.
-  (2014) Detect what you can: detecting and representing objects using holistic models and body parts. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.
-  (2012) Deep neural networks segment neuronal membranes in electron microscopy images. In Advances in neural information processing systems, pp. 2843–2851. Cited by: §2.2.
-  (2013) A deformable mixture parsing model with parselets. In Computer Vision (ICCV), 2013 IEEE International Conference on, pp. 3408–3415. Cited by: §2.2.
-  (2017) Class rectification hard mining for imbalanced deep learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1851–1860. Cited by: §2.1, §4.1, §4.2, TABLE I.
-  (2009) Describing objects by their attributes. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 1778–1785. Cited by: §2.1.
-  (2010) Attribute-centric recognition for cross-category generalization. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 2352–2359. Cited by: §2.1.
-  (2016) Learning attributes equals multi-source domain generalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 87–97. Cited by: §2.1.
-  (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: TABLE III, TABLE IV.
-  (2015) Actions and attributes from wholes and parts. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2470–2478. Cited by: §2.1, §2.1, TABLE IV.
-  (2015) Contextual action recognition with r* cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1080–1088. Cited by: TABLE III, TABLE IV.
-  (2017) Look into person: self-supervised structure-sensitive learning and a new benchmark for human parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 932–940. Cited by: §1, §2.2, §4.1, §4.1, §4.4, §4.5.
-  (2015) Hypercolumns for object segmentation and fine-grained localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 447–456. Cited by: §2.2.
-  (2014) Spatial pyramid pooling in deep convolutional networks for visual recognition. In European Conference on Computer Vision, pp. 346–361. Cited by: §4.2.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.3.
-  (2016) Learning deep representation for imbalanced classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5375–5384. Cited by: §2.1, §4.1, §4.2, TABLE I.
-  (2017) Densely connected convolutional networks.. In CVPR, Vol. 1, pp. 3. Cited by: §4.3.
-  (2011) Sharing features between objects and their attributes. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pp. 1761–1768. Cited by: §2.1.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §3.2, §4.2.
-  (2015) Spatial transformer networks. In Advances in Neural Information Processing Systems, pp. 2017–2025. Cited by: §2.1.
-  (2014) Decorrelating semantic visual attributes by resisting the urge to share. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1629–1636. Cited by: §2.1.
-  (2013) Augmenting crfs with boltzmann machine shape priors for image labeling. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pp. 2019–2026. Cited by: §2.2.
-  (2017) Improving facial attribute prediction using semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6942–6950. Cited by: On Symbiosis of Attribute Prediction and Semantic Segmentation, Fig. 2, §1, §1, §2.1, §3.4, §3.4, §3, §4.2.
-  (2008) Facetracer: a search engine for large collections of images with faces. In European conference on computer vision, pp. 340–353. Cited by: §2.1, §4.2, TABLE I, TABLE II.
-  (2009) Attribute and simile classifiers for face verification. In 2009 IEEE 12th International Conference on Computer Vision, pp. 365–372. Cited by: §2.1, §2.1.
-  (2009) Learning to detect unseen object classes by between-class attribute transfer. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 951–958. Cited by: §2.1.
-  (2012) Interactive facial feature localization. In European Conference on Computer Vision, pp. 679–692. Cited by: §2.2, §4.1, §4.1, §4.4, §4.5, TABLE VII.
-  (2018) Landmark free face attribute prediction. IEEE Transactions on Image Processing 27 (9), pp. 4651–4662. Cited by: §2.1, §4.2, TABLE I, TABLE II.
-  (2016) Human attribute recognition by deep hierarchical contexts. In European Conference on Computer Vision, pp. 684–700. Cited by: §2.1, §4.1, §4.1, §4.1, §4.1, §4.3, §4.3, §4.4, §4.5, TABLE III, TABLE IV, TABLE VI.
-  (2015) Deep human parsing with active template regression. IEEE transactions on pattern analysis and machine intelligence 37 (12), pp. 2402–2414. Cited by: §2.2.
-  (2016) Semantic object parsing with graph lstm. In European Conference on Computer Vision, pp. 125–143. Cited by: §2.2.
-  (2015) Human parsing with contextualized convolutional neural network. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1386–1394. Cited by: §2.2.
-  (2015) Semantic image segmentation with deep convolutional nets and fully connected crfs. In International Conference on Learning Representations, Cited by: §2.2, §2.2.
-  (2016) Refinenet: multi-path refinement networks with identity mappings for high-resolution semantic segmentation. arXiv preprint arXiv:1611.06612. Cited by: §2.2.
-  (2013) Network in network. arXiv preprint arXiv:1312.4400. Cited by: §1.
-  (2011) Recognizing human actions by attributes. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pp. 3337–3344. Cited by: §2.1.
-  (2015) Fashion parsing with video context. IEEE Transactions on Multimedia 17 (8), pp. 1347–1358. Cited by: §2.2.
-  (2015) Matching-cnn meets knn: quasi-parametric human parsing. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1419–1427. Cited by: §2.2.
-  (2015) Multi-objective convolutional learning for face labeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3451–3459. Cited by: §2.2.
-  (2015-12) Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: §2.1, §2.1, §2.1, §4.1, §4.1, §4.1, §4.1, §4.1, §4.2, §4.2, §4.2, §4.2, §4.4, §4.5, TABLE I, TABLE II.
-  (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §2.2, §2.2.
-  (2015) Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1520–1528. Cited by: §1, §2.2.
-  (2011) Interactively building a discriminative vocabulary of nameable attributes. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pp. 1681–1688. Cited by: §2.1.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §2.2.
-  (2016) MOON: a mixed objective optimization network for the recognition of facial attributes. arXiv preprint arXiv:1603.07027. Cited by: §2.1, §4.1, §4.2, TABLE I.
-  (2015) Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115 (3), pp. 211–252. Cited by: §4.5.
-  (2018-09) Deep imbalanced attribute classification using visual attention aggregation. In The European Conference on Computer Vision (ECCV), Cited by: §4.3, TABLE III.
-  (2017) Deep view-sensitive pedestrian attribute inference in an end-to-end model. arXiv preprint arXiv:1707.06089. Cited by: TABLE III.
-  (2008) Semantic texton forests for image categorization and segmentation. In Computer vision and pattern recognition, 2008. CVPR 2008. IEEE Conference on, pp. 1–8. Cited by: §2.2.
-  (2013) Real-time human pose recognition in parts from single depth images. Communications of the ACM 56 (1), pp. 116–124. Cited by: §2.2.
-  (2016) End-to-end localization and ranking for relative attributes. In European Conference on Computer Vision, pp. 753–769. Cited by: §2.1.
-  (2013) Exemplar-based face parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3484–3491. Cited by: §1, §2.2, §4.1, §4.1, TABLE VII.
-  (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826. Cited by: §1, §2.2, Fig. 5, §3.4, §3.4, §3, §4.2, §4.3, §4.3, §4.5.
-  (2014) Understanding objects in detail with fine-grained attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3622–3629. Cited by: §2.1.
-  (2016) Walk and learn: facial attribute representation learning from egocentric video and contextual data. arXiv preprint arXiv:1604.06433. Cited by: §4.2, TABLE I, TABLE II.
-  (2015) Joint object and part segmentation using deep learned potentials. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1573–1581. Cited by: §2.2.
-  (2010) A discriminative latent model of object classes and attributes. In European Conference on Computer Vision, pp. 155–168. Cited by: §2.1.
-  (2016) Zoom better to see clearer: human and object parsing with hierarchical auto-zoom net. In European Conference on Computer Vision, pp. 648–663. Cited by: §2.2.
-  (2015) Recognize complex events from static images by fusing deep channels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1600–1609. Cited by: §4.1.
-  (2013) Paper doll parsing: retrieving similar styles to parse clothing items. In Computer Vision (ICCV), 2013 IEEE International Conference on, pp. 3519–3526. Cited by: §2.2.
-  (2012) Parsing clothing in fashion photographs. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 3570–3577. Cited by: §2.2.
-  (2014) Clothing co-parsing by joint image segmentation and labeling. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3182–3189. Cited by: §2.2.
-  (2015) Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122. Cited by: §2.2, §2.2.
-  (2014) Panda: pose aligned networks for deep attribute modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1637–1644. Cited by: §2.1, §4.2, TABLE I, TABLE II.
-  (2016) Leveraging mid-level deep representations for predicting face attributes in the wild. In Image Processing (ICIP), 2016 IEEE International Conference on, pp. 3239–3243. Cited by: TABLE I, TABLE II.
-  (2017) Learning spatial regularization with image-level supervisions for multi-label image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5513–5522. Cited by: TABLE III.
Mahdi M. Kalayeh received his B.Sc. from Tehran Polytechnic (Amirkabir University of Technology) in 2009 and M.Sc. from Illinois Institute of Technology (IIT) in 2010, both in Electrical Engineering. In 2019, Mahdi graduated with Ph.D. in Computer Science from Center for Research in Computer Vision (CRCV) at the University of Central Florida. His research is on the intersection of Computer Vision and Machine Learning, specifically, it includes Deep Learning, Visual Attribute Prediction, Semantic Segmentation, Complex Event and Action Recognition, Object Recognition and Scene Understanding. Mahdi has published several papers in conferences and journals such as CVPR, ACMMM, and PAMI. He has also served as a reviewer for peer-reviewed conferences and journals including CVPR, ICCV, ECCV, ACCV, IJCV, IEEE Transactions on Image Processing, and IEEE Transactions on Multimedia. Mahdi is currently a Senior Research Scientist at Netflix.
Mubarak Shah Mubarak Shah, the Trustee chair professor of computer science, is the founding director of the Center for Research in Computer Vision at the University of Central Florida (UCF). He is an editor of an international book series on video computing, was editor-in-chief of Machine Vision and Applications journal, and an associate editor of ACM Computing Surveys journal. He was the program cochair of CVPR 2008, an associate editor of the IEEE T-PAMI, and a guest editor of the special issue of the International Journal of Computer Vision on Video Computing. His research interests include video surveillance, visual tracking, human activity recognition, visual analysis of crowded scenes, video registration, UAV video analysis, and so on. He is an ACM distinguished speaker. He was an IEEE distinguished visitor speaker for 1997-2000 and received the IEEE Outstanding Engineering Educator Award in 1997. In 2006, he was awarded a Pegasus Professor Award, the highest award at UCF. He received the Harris Corporations Engineering Achievement Award in 1999, TOKTEN awards from UNDP in 1995, 1997, and 2000, Teaching Incentive Program Award in 1995 and 2003, Research Incentive Award in 2003 and 2009, Millionaires Club Awards in 2005 and 2006, University Distinguished Researcher Award in 2007, Honorable mention for the ICCV 2005 Where Am I? Challenge Problem, and was nominated for the Best Paper Award at the ACM Multimedia Conference in 2005. He is a fellow of the IEEE, AAAS, IAPR, and SPIE.