Modeling Camera Effects to Improve Deep Vision for Real and Synthetic Data
Recent work has focused on generating synthetic imagery and augmenting real imagery to increase the size and variability of training data for learning visual tasks in urban scenes. This includes increasing the occurrence of occlusions or varying environmental and weather effects. However, few have addressed modeling the variation in the sensor domain. Unfortunately, varying sensor effects can degrade performance and generalizability of results for visual tasks trained on human annotated datasets. This paper proposes an efficient, automated physically-based augmentation pipeline to vary sensor effects – specifically, chromatic aberration, blur, exposure, noise, and color cast – across both real and synthetic imagery. In particular, this paper illustrates that augmenting training datasets with the proposed pipeline improves the robustness and generalizability of object detection on a variety of benchmark vehicle datasets.
Keywords:Deep learning, image augmentation, object detection.
Deep learning has enabled performance increases across a range of computer vision tasks. Increasing size and variation of labeled training datasets has contributed to these improvements, with some benchmark datasets providing millions of hand-labeled images for training deep neural networks (DNNs) [1, 2]. Ideally, we could compile a large, comprehensive training set that is representative of all domains and is labelled for all visual tasks. Unfortunately, it is expensive and time-consuming to collect and label large amounts of training data. Furthermore, it is impossible to gather a single real dataset that captures all of the variability that exists in the real world.
Recent work has shown success in training DNNs with synthetic data and testing on real data. Rendering engines can be used to generate large amounts of synthetic data that look highly photorealistic . Pixel-wise labels can be generated automatically, greatly reducing the cost and effort it takes to create ground truth labels for different tasks. Augmenting real data is another way to increase dataset size without requiring additional manual labels . Both synthetic rendering and augmentation pipelines seek to increase variability of scene features across an image set. In particular, recent research has focused on modeling environmental effects such as scene lighting, time of day, scene background, weather, and occlusions in an effort to increase the representation of these factors in training sets, thereby increasing robustness to these cases during test time [5, 6]. Another approach is to increase the occurrence of objects of interest, in order to provide more examples during training of those objects in different scenes and spatial configurations [4, 7].
However, even with varying spatial layout and environmental factors there remain challenges to achieving robustness and generalizability of results. To further understand the gaps between synthetic and real datasets, and even between different real datasets, it is worthwhile to consider the failure modes of DNNs in learning visual tasks. One factor that has been shown to contribute to degradation of performance and cross-dataset generalization for various benchmark datasets is sensor bias [8, 9, 10, 11]. The interaction between the camera model and lighting in the environment can greatly influence the pixel-level artifacts, distortions, and dynamic range induced in each image [12, 13, 14]. Sensor effects including blur and overexposure have been shown to decrease performance of object detection networks in urban driving scenes , as shown in the left column of Figure 1. Still, there exists a gap in the literature in how to improve failure modes due to sensor effects for learned visual tasks in the wild.
In this work, we explore the influence of varying sensor models on DNN performance on computer vision tasks for autonomous driving in urban scenes. We propose to model information loss due to sensor effects through a novel image augmentation pipeline. Our augmentation pipeline is based on effects that occur in image formation and processing that can produce failure modes in learning frameworks – chromatic aberration, blur, exposure, noise and color cast. We aim to achieve robustness to these effects in our learning framework by training on data that contains a representative set of realistic sensor effects. We augment both real and synthetic data to show that our proposed method improves performance for object detection in vehicle datasets (Figure 1). Software and datasets will be made publicly available upon completion of blind review.
2 Related Work
2.0.1 Synthetic datasets:
Rendering and gaming engines have been used to synthesize large, labelled datasets that contain a wide variety of environmental factors that could not be feasibly captured during real data collection [3, 16]. Such factors include time of day, weather, and community architecture. However, synthetic datasets need to be orders of magnitude larger than real datasets to achieve the same level of performance achieved when training on real data , and often require fine-tuning on real data regardless . In comparison, the method we propose can augment synthetic data to achieve higher performance with smaller datasets.
2.0.2 Augmenting synthetic data:
Shrivastava et al. recently developed SimGAN, a generative adversarial network (GAN) to augment synthetic data to appear more realistic. They evaluated their method on the the tasks of gaze estimation and hand pose estimation . Similarly, Sixt et al. proposed RenderGAN, a generative network that uses structured augmentation functions to augment synthetic images of markers attached to honeybees . The augmented images are used to train a detection network to track the honeybees. Both of these approaches focus on image sets that are homogeneously structured and low resolution. We instead focus on the application of autonomous driving, which features highly varied scenes and environmental conditions. Our approach also works on high resolution imagery.
2.0.3 Augmenting real data:
Alhaija et al. demonstrate that augmenting real data with rendered cars improves results for object detection with Faster R-CNN . Although not the focus of their work, their results show that augmented images that are post-processed with hand-tuned chromatic aberration, color curve shifts, and motion blur effects yield a significant performance boost. Our work builds off of this result and exploits variation in sensor models to boost both performance and cross-dataset generalization.
2.0.4 Geometric and photometric data augmentation:
Geometric augmentations such as rotation, translation, and mirroring have become commonplace in deep learning for achieving invariance to spatial factors that do not affect an object’s classification. Krizhevsky et al. introduced fancy PCA to perform color jittering in which RGB intensities are slightly changed to achieve invariance to differing illumination color and intensity . These augmentations induce small changes that do not produce loss of information that can occur due to varying sensor effects. In contrast, our augmentations are modeled directly from real sensor effects and can induce large changes in the input data that mimics the loss of information that occurs in real data.
2.0.5 Sensor effects in learning:
More generally, recent work has demonstrated that elements of the image formation and processing pipeline can have a large impact upon learned representation [21, 22, 10]. Andreopoulos and Tsotsos demonstrate the sensitivities of popular vision algorithms under variable illumination, shutter speed, and gain . Doersch et al. show there is dataset bias introduced by chromatic aberration in visual context prediction and object recognition tasks . They correct for chromatic aberration to eliminate this bias. Diamond et al. demonstrate that blur and noise degrade neural network performance on classification tasks . They propose an end-to-end denoising and deblurring neural network framework that operates directly on raw image data. Rather than correcting for the effects of the camera during image formation, we propose to augment both real and synthetic images to simulate these effects. As many of these effects can lead to loss of information, correcting for them is non-trivial and may result in the hallucination of visual information in the restored image.
3 Sensor-based Image Augmentation
Figure 2 shows the architecture of the proposed sensor-based image augmentation pipeline. We consider a general camera framework, which transforms radiant light captured from the environment into an image . There are several stages that comprise the process of image formation and post-processing steps, as shown in the first row of Figure 2. The incoming light is first focused by the camera lens to be incident upon the camera sensor. Then the camera sensor transforms the incident light into RGB pixel intensity. On-board camera software manipulates the image (e.g., color space conversion and dynamic range compression) to produce the final output image. At each stage of the image formation pipeline, loss of information can occur to degrade the image. Lens effects can introduce visual distortions in an image, such as chromatic aberration and blur. The sensor effects can introduce over- or under-saturation depending on exposure, and high frequency pixel artifacts, based on characteristic sensor noise. Lastly, post-processing effects are implemented to shift the color cast to create a desirable output. Our image augmentation pipeline focuses on five total sensor effects augmentations to model loss of information that can occur at each stage during image formation and post-processing: chromatic aberration, blur, exposure, noise, and color shift. We implement the image processing pipeline as a composition of physically-based augmentation functions across these five effects:
Note that these chosen augmentation functions are not exhaustive, and are meant to approximate the camera image formation pipeline. Each augmentation function is described in detail in the following subsections.
3.1 Chromatic Aberration
Chromatic aberration is a lens effect that causes color distortions, or fringes, along edges that separate dark and light regions within an image. There are two types of chromatic aberration, longitudinal and lateral, both of which can be modeled by geometrically warping the color channels with respect to one another . Longitudinal chromatic aberration occurs when different wavelengths of light converge on different points along the optical axis, effectively magnifying the RGB channels relative to one another. We model this aberration type by scaling the green color channel of an image by a value . Lateral chromatic aberration occurs when different wavelengths of light converge to the different points within the image plane. We model this by applying translations to each of the color channels of an image. We combine these two effects into the following affine transformation, which is applied to each pixel location in a given color channel of the image:
This image warping is implemented using the spatial transformer layer described in .
While there are several types of blur that occur in image-based datasets, we focus on out-of-focus blur, which can be modeled using a Gaussian filter :
where and are spatial coordinates of the filter and is the standard deviation. The output image is given by:
where is image intensity, indicates incoming light intensity, or exposure, and is a constant value for contrast. We use this model to re-expose an image as follows:
We vary to model changing exposure, where a positive relates to increasing the exposure, which can lead to over-saturation, and a negative value indicates decreasing exposure.
The sources of image noise caused by elements of the sensor array can be modeled as either signal-dependent or signal-independent noise. Therefore, we use the Poisson-Gaussian noise model proposed in :
where is the ground truth image at pixel location , is the signal-dependent Poisson noise, and is the signal-independent Gaussian noise. We sample the noise for each pixel based upon its location in a GBRG Bayer grid array assuming bilinear interpolation as the demosaicing function.
In standard camera pipelines, post-processing techniques, such as white balancing or gamma transformation, are nonlinear color corrections performed on the image to compensate for the presence of different environmental illuminants. These post-processing methods are generally proprietary and cannot be easily characterized . We model these effects by performing translations in the CIELAB color space, also known as L*a*b* space, to remap the image tonality to a different range  . Given that our chosen datasets are all taken outdoors during the day, we assume a D65 illuminant in our L*a*b* color space conversion.
3.6 Generating Augmented Training Data
Using the above augmentation pipeline, we manipulated the input parameters by hand to determine parameter ranges that yield visually realistic images. To augment images, we randomly sample in these visually realistic parameter ranges and input an unaugmented image to the augmentation pipeline. Figures 3, 4 and 5 show sample images augmented with individual sensor effects as well as our full proposed sensor-based image augmentation pipeline.
We focus on the task of object detection on benchmark vehicle datasets to evaluate the validity of our sensor-based image augmentation pipeline. We evaluate our image augmentation pipeline on both multiple real and synthetic training datasets, each of which contains a different distribution of specific sensor effects and image quality. To evaluate our method in the real domain, we augment two benchmark vehicle datasets, KITTI   and Cityscapes  using the proposed method. Both of these datasets share many spatial and environmental visual features: both are captured during similar times of day, in similar weather conditions, and in cities regionally close together, with the camera located on a car pointing at the road. In spite of these similarities, images from these datasets are visibly different (a side-by-side comparison can be found in Figure 6). This suggests that these two real datasets differ in their global pixel statistics. Qualitatively, KITTI images feature more pronounced effects due to blur and over-exposure. Cityscapes has a distinct color cast compared to KITTI. For synthetic data, we use Virtual KITTI (VKITTI) , which features over 21000 images and is designed to models the spatial layout of KITTI with varying environmental factors such as weather and time of day. We also augment Grand Theft Auto (GTA)  , which features 25000 images and is noted for its high quality and increased photorealism compared to VKITTI.
4.1 Performance on Object Detection Benchmarks
To evaluate the proposed augmentation method for 2D object detection, we used Faster R-CNN as our base network . Faster R-CNN achieves relatively high performance on the KITTI benchmark test dataset, and many state-of-the-art object detection networks that improve upon these results use Faster R-CNN as their base architecture. For all datasets, we trained each Faster R-CNN network for 10 epochs using four Titan X Pascal GPUs in order to control for potential confounds between performance and training time. We provide experiments trained on the full training datasets, as well as experiments trained on subsets of 2975 images to allow comparison of performance across different datasets. All of the trained networks are tested on a held out validation set of 1480 images from the KITTI training data and we report the Pascal VOC value for the car class. We also report the gain in , which is the difference in performance relative to the baseline (unaugmented) dataset. Table 2 shows results trained on synthetic data (VKITTI and GTA) and Table 1 shows results trained on real data (KITTI and Cityscapes).
For both synthetic and real training sets, data augmented with the proposed method sees significant performance gains over the baseline (unaugmented) datasets. The gain is higher when augmenting synthetic data with sensor effects compared to augmenting real data. This is expected as, in general, synthetic data does not realistically model sensor effects such as noise, blur, and chromatic aberration as accurately as our proposed approach.
Another important result for the synthetic datasets (both VKITTI and GTA), is that, by leveraging our approach, we are able to outperform the networks trained on over 20000 unaugmented images with a tiny subset of 2975 images augmented with using our approach. This means that not only can networks be trained faster but also when training with synthetic data, varying camera effects can outweigh the value of simply generating more data with varied spatial features.
The VKITTI baseline dataset tested on KITTI performs relatively well compared to GTA, even though GTA is a more photorealistic dataset. This can most likely be attributed to the similarity in spatial layout and image features between VKITTI and KITTI. With our proposed approach, VKITTI gives comparable performance to the network trained on the Cityscapes baseline, showing that synthetic data augmented with our proposed sensor-based image pipeline can perform comparably to real data for cross-dataset generalization.
4.2 Performance on Semantic Segmentation Benchmarks and Comparison to Domain Transfer
We also tested our proposed method on a semantic segmentation network to demonstrate that sensor-based image augmentations impact other visual tasks, beyond object detection. For training, we use the baseline Cityscapes dataset  and the baseline GTA5 dataset . We augment each baseline dataset with our proposed image augmentation pipeline. We used Fully Convolutional Network 8 (FCN-8s) as our base network structure . We intialized the FCN8 network with weights from the VGG-16 model trained on ImageNet  , and finetuned on the baseline data and our augmented data. While this paper is not directly approaching the domain transfer problem we will present results to compare to state-of-the-art domain transfer approaches to contextualize our results. Table 3 shows semantic segmentation results. Note that the mean IOU gain of our approach is similar to that achieved by state-of-the-art domain transfer methods   . These works alter the underlying network structure to shift the learned representation of the source and target datasets. These approaches have access to both source and target data during training and are explicitly optimizing to improve results in the new domain. Our work focuses on augmenting the training data directly so that it can be input into any network regardless of the architecture and our approach and a direct domain adaptation approach could be used in series further boosting performance. Future work will conduct more extensive experiments on semantic segmentation and the combination of both paradigms.
4.3 Ablation Study
|Training Set||Augmentation Type||Gain|
|2975 Prop. Method||Chrom. Ab.||61.08||6.48|
|2975 Prop. Method||Blur||59.72||5.12|
|2975 Prop. Method||Exposure||57.37||2.77|
|2975 Prop. Method||Sensor Noise||58.60||4.00|
|2975 Prop. Method||Color Shift||58.59||3.99|
|Training Set||Augmentation Type||Gain|
|2975 Prop. Method||Chrom. Ab.||48.92||2.09|
|2975 Prop. Method||Blur||49.17||2.34|
|2975 Prop. Method||Exposure||47.95||1.12|
|2975 Prop. Method||Sensor Noise||48.09||1.26|
|2975 Prop. Method||Color Shift||48.61||1.78|
|Training Set||Augmentation Type||Gain|
|2975 Prop. Method||Chrom. Ab.||79.72||0.60|
|2975 Prop. Method||Blur||79.97||0.85|
|2975 Prop. Method||Exposure||80.35||1.23|
|2975 Prop. Method||Sensor Noise||80.74||1.62|
|2975 Prop. Method||Color Shift||80.23||1.11|
|Training Set||Augmentation Type||Gain|
|2975 Prop. Method||Chrom. Ab.||63.25||0.56|
|2975 Prop. Method||Blur||62.10||0.59|
|2975 Prop. Method||Exposure||62.28||0.41|
|2975 Prop. Method||Sensor Noise||62.81||0.12|
|2975 Prop. Method||Color Shift||64.14||1.45|
To evaluate the contribution of each sensor effect augmentation on performance, we used the proposed method to generate datasets that had only one sensor effect augmentation. We trained Faster-RCNN on each of these datasets augmented with single augmentation functions. Tables 4 and 5 show resulting for each network for synthetic and real data, respectively.
Performance increases across all ablation experiments for training on synthetic data. This further validates our hypothesis that sensor effects are important for closing the gap between synthetic and real data. For ablation experiments trained on real data, in general, the performance increases, although blur and exposure show slight decreases in performance for networks trained on Cityscapes and tested on KITTI. Note that for augmenting Cityscapes, the color shift contributes the most to improving performance. As mentioned above, baseline Cityscapes has a noticably distinct color cast from KITTI, and the color cast across the Cityscapes dataset is less varied in general. This result demonstrates that increasing variation in color cast across the Cityscapes dataset has a relatively high impact on cross-dataset generalization.
4.4 Failure Mode Analysis
Figures 7 and 8 shows qualitative results of failure modes across each training dataset, where the blue bounding box indicates correct detections and the red bounding box indicate a missed detection for the baseline that was correctly detected by our proposed method. Qualitatively, it appears that our method more reliably detects instances of cars that are small in the image, in particular in the far background, at a scale in which the pixel statistics of the image are more pronounced. Note that our method also improves performance on car detections for cases where the image is over-saturated due to increased exposure, which we are directly modeling through our proposed augmentation pipeline. Additionally, our method produces improved detections for other effects that obscure the presence of a car, such as occlusion and shadows, even though we do not directly model these effects. This may be attributed to increased robustness to effects that lead to loss of information about an object in general.
We have proposed a novel sensor-based image augmentation pipeline for augmenting training data input to DNNs for the task of object detection in urban driving scenes. Our augmentation pipeline models a range of physically-realistic sensor effects that occur throughout the image formation and post-processing pipeline. These effects were chosen as they lead to loss of information or distortion of a scene, which degrades network performance on learned visual tasks. By training on our augmented datasets, we can effectively increase dataset size and variation in the sensor domain, without the need for further labeling, in order to improve robustness and generalizability of resulting object detection networks. We achieve significantly improved performance across a range of benchmark vehicle datasets, including training with both real and synthetic data. Overall, our results reveal insight into the importance of modeling sensor effects for the specific problem of training on synthetic data and testing on real data.
-  Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems. (2012) 1097–1105
-  Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2017)
-  Ros, G., Sellart, L., Materzynska, J., Vazquez, D., Lopez, A.M.: The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 3234–3243
-  Alhaija, H.A., Mustikovela, S.K., Mescheder, L., Geiger, A., Rother, C.: Augmented reality meets deep learning for car instance segmentation in urban scenes. In: Proceedings of the British Machine Vision Conference. Volume 3. (2017)
-  Zhang, H., Sindagi, V., Patel, V.M.: Image de-raining using a conditional generative adversarial network. arXiv preprint arXiv:1701.05957 (2017)
-  Veeravasarapu, V., Rothkopf, C., Visvanathan, R.: Adversarially tuned scene generation. arXiv preprint arXiv:1701.00405 (2017)
-  Huang, S., Ramanan, D., undefined, undefined, undefined, undefined: Expecting the unexpected: Training detectors for unusual pedestrians with adversarial imposters. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 00 (2017) 4664–4673
-  Andreopoulos, A., Tsotsos, J.K.: On sensor bias in experimental methods for comparing interest-point, saliency, and recognition algorithms. Volume 34., IEEE (2012) 110–126
-  Song, S., Lichtenberg, S.P., Xiao, J.: Sun rgb-d: A rgb-d scene understanding benchmark suite. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2015) 567–576
-  Dodge, S., Karam, L.: Understanding how image quality affects deep neural networks. In: Quality of Multimedia Experience (QoMEX), 2016 Eighth International Conference on, IEEE (2016) 1–6
-  Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: Proceedings of the IEEE International Conference on Computer Vision. (2015) 1422–1430
-  Grossberg, M.D., Nayar, S.K.: Modeling the space of camera response functions. Volume 26., IEEE (2004) 1272–1282
-  Couzinie-Devy, F., Sun, J., Alahari, K., Ponce, J.: Learning to estimate and remove non-uniform image blur. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2013) 1075–1082
-  Foi, A., Trimeche, M., Katkovnik, V., Egiazarian, K.: Practical poissonian-gaussian noise modeling and fitting for single-image raw-data. Volume 17., IEEE (2008) 1737–1754
-  Ramanagopal, M.S., Anderson, C., Vasudevan, R., Johnson-Roberson, M.: Failing to learn: Autonomously identifying perception failures for self-driving cars. CoRR abs/1707.00051 (2017)
-  Gaidon, A., Wang, Q., Cabon, Y., Vig, E.: Virtual worlds as proxy for multi-object tracking analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 4340–4349
-  Johnson-Roberson, M., Barto, C., Mehta, R., Sridhar, S.N., Rosaen, K., Vasudevan, R.: Driving in the matrix: Can virtual worlds replace human-generated annotations for real world tasks? In: Robotics and Automation (ICRA), 2017 IEEE International Conference on, IEEE (2017) 746–753
-  Sankaranarayanan, S., Balaji, Y., Jain, A., Lim, S., Chellappa, R.: Unsupervised domain adaptation for semantic segmentation with gans. CoRR abs/1711.06969 (2017)
-  Shrivastava, A., Pfister, T., Tuzel, O., Susskind, J., Wang, W., Webb, R.: Learning from simulated and unsupervised images through adversarial training. arXiv preprint arXiv:1612.07828 (2016)
-  Sixt, L., Wild, B., Landgraf, T.: Rendergan: Generating realistic labeled data. arXiv preprint arXiv:1611.01331 (2016)
-  Kanan, C., Cottrell, G.W.: Color-to-grayscale: does the method matter in image recognition? PloS one 7(1) (2012) e29740
-  Diamond, S., Sitzmann, V., Boyd, S., Wetzstein, G., Heide, F.: Dirty pixels: Optimizing image classification architectures for raw sensor data. (2017)
-  Karaimer, H.C., Brown, M.S.: A software platform for manipulating the camera imaging pipeline. In: European Conference on Computer Vision, Springer (2016) 429–444
-  Kang, S.B.: Automatic removal of chromatic aberration from a single image. In: Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, IEEE (2007) 1–8
-  Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. In: Advances in Neural Information Processing Systems. (2015) 2017–2025
-  Cheong, H., Chae, E., Lee, E., Jo, G., Paik, J.: Fast image restoration for spatially varying defocus blur of imaging sensor. Sensors 15(1) (2015) 880–898
-  Bhukhanwala, S.A., Ramabadran, T.V.: Automated global enhancement of digitized photographs. IEEE Transactions on Consumer Electronics 40(1) (Feb 1994) 1–10
-  Messina, G., Castorina, A., Battiato, S., Bosco, A.: Image quality improvement by adaptive exposure correction techniques. In: Multimedia and Expo, 2003. ICME ’03. Proceedings. 2003 International Conference on. Volume 1. (July 2003) I–549–52 vol.1
-  Hunter, R.S.: Accuracy, precision, and stability of new photoelectric color-difference meter. In: Journal of the Optical Society of America. Volume 38. (1948) 1094–1094
-  Annadurai, S.: Fundamentals of digital image processing, Pearson Education India (2007)
-  Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE (2012) 3354–3361
-  Fritsch, J., Kuehnl, T., Geiger, A.: A new performance measure and evaluation benchmark for road detection algorithms. In: International Conference on Intelligent Transportation Systems (ITSC). (2013)
-  Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 3213–3223
-  Richter, S.R., Vineet, V., Roth, S., Koltun, V.: Playing for data: Ground truth from computer games. In Leibe, B., Matas, J., Sebe, N., Welling, M., eds.: European Conference on Computer Vision (ECCV). Volume 9906 of LNCS., Springer International Publishing (2016) 102–118
-  Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems. (2015) 91–99
-  Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (June 2015)
-  Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)
-  Zhang, Y., David, P., Gong, B.: Curriculum domain adaptation for semantic segmentation of urban scenes. In: The IEEE International Conference on Computer Vision (ICCV). Volume 2. (2017) 6
-  Hoffman, J., Wang, D., Yu, F., Darrell, T.: Fcns in the wild: Pixel-level adversarial and constraint-based adaptation. CoRR abs/1612.02649 (2016)