Attentive Single-Tasking of Multiple Tasks
In this work we address task interference in universal networks by considering that a network is trained on multiple tasks, but performs one task at a time, an approach we refer to as “single-tasking multiple tasks”. The network thus modifies its behaviour through task-dependent feature adaptation, or task attention. This gives the network the ability to accentuate the features that are adapted to a task, while shunning irrelevant ones. We further reduce task interference by forcing the task gradients to be statistically indistinguishable through adversarial training, ensuring that the common backbone architecture serving all tasks is not dominated by any of the task-specific gradients.
Results in three multi-task dense labelling problems consistently show: (i) a large reduction in the number of parameters while preserving, or even improving performance and (ii) a smooth trade-off between computation and multi-task accuracy. We provide our system’s code and pre-trained models at http://vision.ee.ethz.ch/~kmaninis/astmt/.
Real-world problems involve a multitude of visual tasks that call for multi-tasking, universal vision systems. For instance autonomous driving requires detecting pedestrians, estimating velocities and reading traffic signs, while identity recognition, pose, face and hand tracking are required for human-computer interaction.
A thread of works have introduced multi-task networks [Ser+14, EiFe15, He+17, Kokk17] handling an increasingly large number of tasks. Still, it is common practice to train devoted networks for individual tasks when single-task performance is critical. This is supported by negative results from recent works that have aimed at addressing multiple problems with a single network [He+17, Kokk17] - these have shown that there is a limit on performance imposed by the capacity of the network, manifested as a drop in performance when loading a single network with more tasks. Stronger backbones can uniformly improve multi-task performance, but still the per-task performance can be lower than the single-task performance with the same backbone.
This problem, known as task interference, can be understood as facing a the dilemma of invariance versus sensitivity: the most crucial information for one task can be a nuisance parameter for another, which leads to potentially conflicting objectives when training multi-task networks. An example of such a task pair is pose estimation and object detection: when detecting or counting people the detailed pose information is a nuisance parameter that should be eliminated at some point from the representation of a network aiming at pose invariance [He+17]. At the same time, when watching a dance performance, one needs the detailed pose of the dancers, while ignoring the large majority of spectators. More generally this is observed when combining a task that is detail-oriented and requires high spatial acuity with a task that requires abstraction from spatial details, e.g. when one wants to jointly do low- and high-level vision. In other words, one task’s noise is another one’s signal.
We argue that this dilemma can be addressed by single-tasking, namely executing task a time, rather than getting all task responses in a single forward pass through the network. This reflects many practical setups, for instance when one sees the results of a single computational photography task at a time on the screen of a mobile phone, rather than all of them jointly. Operating in this setup allows us to implement an “attention to task” mechanism that changes the network’s behaviour in a task-adapted manner, as shown in Fig. 1. We use the exact same network backbone in all cases, but we modify the network’s behavior according to the executed task by relying on the most task-appropriate features. For instance when performing a low-level task such as boundary detection or normal estimation, the network can retain and elaborate on fine image structures, while shunning them for a high-level task that requires spatial abstraction.
We explore two different task attention mechanisms, as shown in Fig. 2. Firstly, we use data-dependent modulation signals [Per+17] that enhance or suppress neuronal activity in a task-specific manner. Secondly, we use task-specific Residual Adapter [RBV18] blocks that latch on to a larger architecture in order to extract task-specific information which is fused with the representations extracted by a generic backbone. This allows us to learn a shared backbone representation that serves all tasks but collaborates with task-specific processing to build more elaborate task-specific features.
These two extensions can be understood as promoting a disentanglement between the shared representation learned across all tasks and the task-specific parts of the network. Still, if the loss of a single task is substantially larger, its gradients will overwhelm those of others and disrupt the training of the shared representation. In order to make sure that no task abuses the shared resources we impose a task-adversarial loss to the network gradients, requiring that these are statistically indistinguishable across tasks. This loss is minimized during training through double back-propagation [DrLe91], and leads to an automatic balancing of loss terms, while promoting compartmentalization between task-specific and shared blocks.
2 Related Work
Our work draws ideas from several research threads. Multiple Task Learning (MTL): Several works have shown that jointly learning pairs of tasks yields fruitful results in computer vision. Successful pairs include detection and classification [Girs15, Ren+15], detection and segmentation [He+17, Dvo+17], or monocular depth and segmentation [EiFe15, Xu+18]. Joint learning is beneficial for unsupervised learning [Ran+18], when tasks provide complementary information (eg. depth boundaries and motion boundaries [ZLH18]), in cases where task A acts as regularizer for task B due to limited data [LQH17], or in order to learn more generic representations from synthetic data [ReLe18]. Xiao et al. [Xia+18] unify inhomogeneous datasets in order to train for multiple tasks, while [Zam+18] explore relationships among a large amount of tasks for transfer learning, reporting improvements when transferring across particular task pairs.
Despite these positive results, joint learning can be harmful in the absence of a direct relationship between task pairs. This was reported clearly in [Kokk17] where the joint learning of low-, mid- and high-level tasks was explored, reporting that the improvement of one task (e.g. normal detection) was to the detriment of another (e.g. object detection). Similarly, when jointly training for human pose estimation on top of detection and segmentation, Mask R-CNN performs worse than its two-task counterpart [He+17].
This negative result first requires carefully calibrating the relative losses of the different tasks, so that none of them deteriorates excessively. To address this problem, GradNorm [Che+17a] provides a method to adapt the weights such that each task contributes in a balanced way to the loss, by normalizing the gradients of their losses; a more recent work [Sin+18] extends this approach to homogenize the task gradients based on adversarial training. Following a probabilistic treatment [KGC18] re-weigh the losses according to each task’s uncertainty, while Sener and Koltun [SeKo18] estimate an adaptive weighting of the different task losses based on a pareto-optimal formulation of MTL. Similarly, [Guo+18] provide a MTL framework where tasks are dynamically sorted by difficulty and the hardest are learned first.
A second approach to mitigate task interference consists in avoiding the ‘spillover’ of gradients from one task’s loss to the common features serving all tasks. One way of doing this is explicitly constructing complementary task-specific feature representations [PNN, RoTs18], but results in an increase of complexity that is linear in the number of tasks. An alternative, adopted in the related problem of lifelong learning consists in removing from the gradient of a task’s loss those components that would incur an increase in the loss of previous tasks [EWC, GEM]. For domain adaptation [Bou+16] disentangle the representations learned by shared/task-specific parts of networks by enforcing similarity/orthogonality constraints. Adversarial Training has been used in the context of domain adaptation [GaLe15, LQH17] to the feature space in order to fool the discriminator about the source domain of the features.
In our understanding these losses promote a compartmental operation of a network, achieved for instance when a block-structured weight matrix prevents the interference of features for tasks that should not be connected. A deep single-task implementation of this would be the gating mechanism of [AhTo18]. For multi-tasking, Cross Stitch Networks [Mis+16] automatically learn to split/fuse two independent networks in different depths according to their learned tasks, while [Mur+16] estimate a block-structured weight matrix during CNN training.
As we show in the next section, a soft “compartmentalization” effect that does not interfere with the network’s weight matrix can be achieved in the case of multi-task learning through a soft, learnable form of task-specific feature masking. This shares several similarities with attention mechanisms, briefly described below.
Attention mechanisms: Attention has often been used in deep learning to visualize and interpret the inner workings of CNNs [SVZ13, ZeFe14, Sel+17], but has also been employed to improve the learned representations of convolutional networks for scale-aware semantic segmentation [Che+17], fine-grained image recognition [FZM17] or caption generation [Xu+15, Lu+16, And+18]. Squeeze and Excitation Networks [HSS18] and their variants [Woo+18, Hu+18a] modulate the information of intermediate spatial features according to a global representation and be understood as implementing attention to different channels. Deep Residual Adapters [RBV17, RBV18] modulate learned representations depending on their source domain. Several works study modulation for image retrieval [Zha+18] or classification tasks [Per+17, Mud+19], and embeddings for different artistic styles [DSK17]. [Yan+18] learns object-specific modulation signals for video object segmentation, and [RBT18] modulates features according to given priors for detection and segmentation. In our case we learn task-specific modulation functions that allow us to drastically change the network’s behaviour while using identical backbone weights.
3 Attentive Single-Tasking Mechanisms
Having a shared representation for multiple tasks
can be efficient from the standpoint of memory- and sample- complexity, but can result in worse performance if the same resources are serving tasks with unrelated, or even conflicting objectives, as described above.
Our proposed remedy to this problem consists in learning a shared representation for all tasks, while allowing each task to use this shared representation differently for the construction of its own features.
3.1 Task-specific feature modulation
In order to justify our approach we start with a minimal example. We consider that we have two tasks A and B that share a common feature tensor at a given network layer, where are spatial coordinates and are the tensor channels. We further assume that a subset of the channels is better suited for task A, while is better for B. For instance if A is invariant to deformations (detection) while B is sensitive (pose estimation), could be features obtained by taking more context into account, while would be more localized.
One simple way of ensuring that tasks A and B do not interfere while using a shared feature tensor is to hide the features of task B when training for task A:
where is the indicator function of set . If then , which means that the gradient sent by the loss of task A to will be zero. We thereby avoid task interference since Task A does not influence nor use features that it does not need.
Instead of this hard choice of features per task we opt for a soft, differentiable membership function that is learned in tandem with the network and allows the different tasks to discover during training which features to use. Instead of a constant membership function per channel we opt for an image-adaptive term that allows one to exploit the power of the squeeze-and-excitation block [HSS18].
In particular we adopt the squeeze-and-excitation (SE) block (also shown in Fig. 2), combining a global average pooling operation of the previous layer with a fully-connected layer that feeds into a sigmoid function, yielding a differentiable, image-dependent channel gating function. We set the parameters of this layer to be task-dependent, allowing every task to modulate the available channels differently. As shown in Section 5, this can result in substantial improvements when compared to a baseline that uses the same SE block for all tasks.
3.2 Residual Adapters
The feature modulation described above can be understood as shunning those features that do not contribute to the task while focusing on the more relevant ones. Intuitively, this does not add capacity to the network but rather cleans the signal that flows through it from information that the task should be invariant to. We propose to complement this by appending task-specific sub-networks that adapt and refine the shared features in terms of residual operations of the following form:
where denotes the default behaviour of a residual layer, is the task-specific residual adapter of task , and is the modified layer. We note that if and were linear layers this would amount to the classical regularized multi-task learning of [EvPo04].
These adapters introduce a task-specific parameter and computation budget that is used in tandem with that of the shared backbone network. We show in Section 5 that this is typically a small fraction of the budget used for the shared network, but improves accuracy substantially.
When employing disentangled computation graphs with feature modulation through SE modules and/or residual adapters, we also use task-specific batch-normalization layers, that come with a trivial increase in parameters (while the computational cost remains the same).
4 Adversarial Task Disentanglement
The idea behind the task-specific adaptation mechanisms described above is that even though a shared representation has better memory/computation complexity, every task can profit by having its own ‘space’, i.e. separate modelling capacity to make the best use of the representation - by modulating the features or adapting them with residual blocks.
Pushing this idea further we enforce a strict separation of the shared and task-specific processing, by requiring that the gradients used to train the shared parameters are statistically indistinguishable across tasks. This ensures that the shared backbone serves all tasks equally well, and is not disrupted e.g. by one task that has larger gradients.
We enforce this constraint through adversarial learning. Several methods, starting from Adversarial Domain Adaptation [Gan+16], use adversarial learning to remove any trace of a given domain from the learned mid-level features in a network; a technique called Adversarial multi-task training [LQH17] falls in the same category.
Instead of removing domain-specific information from the features of a network (which serves domain adaptation), we remove any task-specific trace from the gradients sent from different tasks to the shared backbone (which serves a division between shared and task-specific processing). A concurrent work [Sin+18] has independently proposed this idea.
|Database||Type||# Train Im.||# Test Im.||Edge||S.Seg||H. Parts||Normals||Saliency||Depth||Albedo|
As shown in Fig. 4 we use double back-propagation [DrLe91] to ‘expose’ the gradient sent from a task to a shared layer , say . By exposing the variable we mean that we unfold its computation graph, which in turn allows us to back-propagate through its computation. By back-propagating on the gradients we can force them to be statistically indistinguishable across tasks through adversarial training.
In particular we train a task classifier on top of the gradients lying at the interface of the task-specific and shared networks and use sign negation to make the task classifier fail [GaLe15]. This amounts to solving the following optimization problem in terms of the discriminator weights, and the network weights, :
where is the gradient of task computed with , is the discriminator’s output for input , and is the cross-entropy loss for label that indicates the source task of the gradient.
Intuitively this forces every task to do its own processing within its own blocks, so that it does not need from the shared network anything different from the other tasks. This results in a separation of the network into disentangled task-specific and shared compartments.
|H. Parts||P. Context||mIoU||64.3||64.9* [Che+17]|
5 Experimental Evaluation
Datasets We validate our approach on different datasets and tasks. We focus on dense prediction tasks that can be approached with fully convolutional architectures. Most of the experiments are carried out on the PASCAL [Eve+12] benchmark, which is popular for dense prediction tasks. We also conduct experiments on the smaller NYUD [Sil+12] dataset of indoor scenes, and the recent, large scale FSV [Krah18] dataset of synthetic images. Statistics, as well as the different tasks used for each dataset are presented in Table 1.
Base architecture: We use our re-implementation of Deeplab-v3+ [Che+18] as the base architecture of our method, due to its success on dense semantic tasks. Its architecture is based on a strong ResNet encoder, with a-trous convolutions to preserve reasonable spatial dimensions for dense prediction. We use the latest version that is enhanced with a parallel a-trous pyramid classifier (ASPP) and a powerful decoder. We refer the reader to [Che+18] for more details. The ResNet-101 backbone used in the original work is replaced with its Squeeze-and-Excitation counterpart (Fig. 3), pre-trained on ImageNet [Rus+15]. The pre-trained SE modules serve as an initialization point for the task-specific modulators for multi-tasking experiments.
The architecture is tested for a single task in various competitive benchmarks for dense prediction: edge detection, semantic segmentation, human part segmentation, surface normal estimation, saliency, and monocular depth estimation. We compare the results obtained with various competitive architectures. For edge detection we use the BSDS500 [Mar+01] benchmark and its optimal dataset F-measure (odsF) [MFM04]. For semantic segmentation we train on PASCAL VOC trainaug [Eve+12, Har+11] (10582 images), and evaluate on the validation set of PASCAL using mean intersection over union (mIoU). For human part segmentation we use PASCAL-Context [Che+14] and mIoU. For surface normals we train on the raw data of NYUD [Sil+12] and evaluate on the test set using mean error (mErr) in the predicted angles as the evaluation metric. For saliency we follow [Kokk17] by training on MSRA-10K [Che+15], testing on PASCAL-S [Li+14] and using the maximal F-measure (maxF) metric. Finally, for depth estimation we train and test on the fully annotated training set of NYUD using root mean squared error (RMSE) as the evaluation metric. For implementation details, and hyper-parameters, please refer to the Appendix.
Table 2 benchmarks our architecture against popular state-of-the-art methods. We obtain competitive results, for all tasks. We emphasize that these benchmarks are inhomogeneous, i.e. their images are not annotated with all tasks, while including domain shifts when training for multi-tasking (eg. NYUD contains only indoor images). In order to isolate performance gains/drops as a result of multi-task learning (and not domain adaptation, or catastrophic forgetting), in the experiments that follow, we use homogeneous datasets.
Multi-task learning setup: We proceed to multi-tasking experiments on PASCAL. We keep the splits of PASCAL-Context, which provides labels for edge detection, semantic segmentation, and human part segmentation. In order to keep the dataset homogeneous and the architecture identical for all tasks, we did not use instance level tasks (detection, pose estimation) that are provided with the dataset. To increase the number of tasks we automatically obtained ground-truth for surface normals and saliency through label-distillation using pre-trained state-of-the-art models ([Ban+17] and [Che+18], respectively), since PASCAL is not annotated with those tasks. For surface normals, we masked out predictions from unknown and/or invalid classes (eg. sky) during both training and testing. In short, our benchmark consists of 5 diverse tasks, ranging from low-level (edge detection, surface normals), to mid-level (saliency) and high-level (semantic segmentation, human part segmentation) tasks.
Evaluation metric: We compute multi-tasking performance of method as the average per-task drop with respect to the single-tasking baseline (i.e different networks trained for a single task each):
where if a lower value means better for measure of task , and otherwise. Average relative drop is computed against the baseline that uses the same backbone.
To better understand the effect of different aspects of our method, we conduct a number of ablation studies and present the results in Table 3.
We construct a second baseline, which tries to learn all tasks simultaneously with a single network, by connecting task-specific convolutional classifiers ( conv layers) at the end of the network. As also reported by [Kokk17], a non-negligible average performance drop can be observed (-6.6% per task for R-26 with SE). We argue that this drop is mainly triggered by conflicting gradients during learning.
Effects of modulation and adversarial training: Next, we introduce the modulation layers described in Section 3. We compare parallel residual adapters to SE (Table (b)b) when used for task modulation. Performance per task recovers immediately by separating the computation used by each task during learning (-1.4 and -0.6 vs. -6.6 for adapters and SE, respectively). SE modulation results in better performance, while using slightly fewer parameters per task. We train a second variant where we keep the computation graph identical for all tasks in the encoder, while using SE modulation only in the decoder (Table (c)c). Interestingly, this variant reaches the performance of residual adapters (-1.4), while being much more efficient in terms of number of parameters and computation, as only one forward pass of the encoder is necessary for all tasks.
In a separate experiment, we study the effects of adversarial training described in Section 4. We use a simple, fully convolutional discriminator to classify the source of the gradients. Results in Table (d)d show that adversarial training is beneficial for multi-tasking, increasing performance compared to standard multi-tasking (-4.4 vs -6.6). Even though the improvements are less significant compared to modulation, they come without extra parameters or computational cost, since the discriminator is used only during training.
The combination of SE modulation with adversarial training (Table (d)d) leads to additional improvements (-0.1% worse than the single-task baseline), while further adding residual adapters surpasses single-tasking (+0.45%), at the cost of 12.3% more parameters per task (Fig. 5).
Deeper Architectures: Table (e)e shows how modulation and adversarial training perform when coupled with deeper architectures (R-50 and R-101). The results show that our method is invariant to the depth of the backbone, consistently improving the standard multi-tasking results.
Resource Analysis: Figure 5 illustrates the performance of each variant as a function of the number of parameters, as well as the FLOPS (multiply-adds) used during inference. We plot the relative average per-task performance compared to the single-tasking R-101 variant (blue cross), for the 5 tasks of PASCAL. Different colors indicate different backbone architectures. We see a clear drop in performance by standard multi-tasking (crosses vs. circles), but with fewer parameters and FLOPS. Improvements due to adversarial training come free of cost (triangles) with only a small overhead for the discriminator during training.
Including modulation comes with significant improvements, but also with a very slight increase of parameters and a slight increase of computational cost when including the modules on the decoder (rectangles). The increase becomes more apparent when including those modules in the encoder as well (diamonds). Our most accurate variant using all of the above (hexagons) outperforms the single-tasking baselines by using only a fraction of their parameters.
We note that the memory and computational complexities of the SE blocks and the adapters are negligible, but since it affects the outputs of the layer it means that we cannot share the computation of the ensuing layers across all tasks, and thus the increased number of multiply-adds.
Learned Disentangled Representations: In order to highlight the importance of task modulation, we plot the learned representations for different tasks in various depths of our network. Figure 6 shows the t-SNE representations [MaHi08] of the SE activations in equally spaced levels of the network. The activations are averaged for the first 32 samples of the validation set, following [HSS18], and they are sorted per task. The resulting plots show that in the early stages of the network the learned representations are almost identical. They gradually become increasingly different as depth grows, until they are completely different for different tasks at the level of the classifier. We argue that this disentanglement of learned representations also translates to performance gains, as shown in Table 3.
Validation on additional datasets: We validate our approach in two additional datasets, NYUD [Sil+12] and FSV [Krah18]. NYUD is an indoor dataset, annotated with labels for instance edge detection, semantic segmentation into 41 classes, surface normals, and depth. FSV is a large-scale synthetic dataset, labelled with semantic segmentation (3 classes), albedo, and depth (disparity).
Table 4 presents our findings for both datasets. As in PASCAL, when we try to learn all tasks together, we observe a non-negligible drop compared to the single-tasking baseline. Performance recovers when we plug in modulation and adversarial training. Interestingly, in NYUD and FSV we observe larger improvements compared to PASCAL. Our findings are consistent with related works [Xu+18, EiFe15] which report improved results for multi-tasking when using depth and semantics.
Figures 7 and 8 illustrate some qualitative examples, obtained by our method on PASCAL and NYUD, respectively. Results in each row are obtained with a single network. We compare our best model to the baseline architecture for multi-tasking (without per-task modulation, or adversarial training). We observe a quality degradation in the results of the baseline. Interestingly, some errors are obtained clearly as a result of standard multi-tasking. Edge features appear during saliency estimation in Fig 7, and predicted semantic labels change on the pillows, in areas where the surface normals change, in Fig 8. In contrast, our method provides disentangled predictions that are able to recover from such issues, reach, and even surpass the single-tasking baselines.