Automated Augmentation with Reinforcement Learning and GANs for Robust Identification of Traffic Signs using Front Camera Images
Traffic sign identification using camera images from vehicles plays a critical role in autonomous driving and path planning. However, the front camera images can be distorted due to blurriness, lighting variations and vandalism which can lead to degradation of detection performances. As a solution, machine learning models must be trained with data from multiple domains, and collecting and labeling more data in each new domain is time consuming and expensive. In this work, we present an end-to-end framework to augment traffic sign training data using optimal reinforcement learning policies and a variety of Generative Adversarial Network (GAN) models, that can then be used to train traffic sign detector modules. Our automated augmenter enables learning from transformed nightime, poor lighting, and varying degrees of occlusions using the LISA Traffic Sign and BDD-Nexar dataset. The proposed method enables mapping training data from one domain to another, thereby improving traffic sign detection precision/recall from 0.70/0.66 to 0.83/0.71 for nighttime images.
Vehicle path planning and autonomous functionalities rely heavily on timely and robust identification of traffic signs regardless of visibility-related challenges . Several works till date  have focused on classification of cropped traffic signs using deep learning models, but there continues to be a lack of end-to-end generalizable methods that treat complete front camera images. Since data augmentation is a key module for robust training of deep learning detectors for traffic signs, we present a novel automated augmenter that can map labelled training data from day to night time domains, while ensuring classification performance enhancement. However, the day and night-time images do not require paired labeling for the model training purposes and can be further extended to weather condition variations as well. Thus, the proposed method can significantly increase the volumes of annotated data for machine learning applications such as robust traffic sign identifications.
Several Generative Adversarial Network (GAN) models have been developed till date for style and textural transformations that can aid daytime to night-time image transformation while preserving features that are crucial for specific region of interest (ROI) classification. One such implementation: deep convolutional GAN (DCGAN)  uses transposed convolutions to generate fake night-time images from a random noise vector followed by a convolutional neural network (CNN) discriminator model that aims at separating real and fake images. Another implementation: super resolution GAN (SRGAN)  uses low resolution night images as generator input and CNNs with residual layers for generator and discriminator design to yield high resolution daytime images. The implementation: styleGAN  further changes the generator model significantly by a mapping network that uses an intermediate latent space to control style and introduces noise variations at each point in the generator model. Another implementation cycleGAN : eliminates the need for paired images for the CNN-based generator model and relies on cyclic consistencies between day to night to day transformations followed by a CNN-based discriminator model. Though all these models produce varying degrees of realistic sceneries and face/car images, they suffer from poor resolution of generated ROIs for traffic sign regions in large field of view images, acquired from automotive grade front cameras. In this work, we implement a bounding box GAN (BBGAN) that minimizes transformations around the ROIs, i.e., traffic sign bounding boxes using a feed-forward generator with U-net  architecture and a CNN-based discriminator. Additionally, reinforcement learning (RL) models are invoked to identify optimal transformation policies to the traffic sign bounding boxes from a set of 20 policies that involve shear, color transformations and occlusions to generate occlusion-based transformations around traffic signs. Our analysis shows that the optimal RL-based traffic sign modification policies along with the BBGAN generated images collectively generalize day-to-night time images and are robust to vandalism and weather-related occlusions as well. Finally we evaluate the usefulness of the automated augmenter by comparatively analyzing the performance of an object detection system (ODS) without and with the inclusion of augmented data in the ODS training set.
This paper makes two key contributions. First, a combination of RL  and BBGAN models  are presented to generate pose and lighting variations along with day-to-night time transformations to traffic sign identification datasets, thereby enabling 4 times data augmentation for a single annotated front camera image. Second, the proposed framework operates on automotive grade wide field camera images to conserve ROI-specific structural and textural information that are significant to traffic sign classification tasks, rather than focusing on cropped traffic sign images only. Also, the manually annotated images for variations in image blurriness, orientation and lighting condition used in this work will enable generalizable bench-marking for new methodologies. The proposed automated augmenter is comparatively evaluated with several image transformation strategies that vary in training complexities to assess its generalizability for data augmentation in ROI-specific classification and detection tasks.
Ii Data Augmentation Methods
In this work we improve the ODS performance on data that is out-of-distribution (OOD) with respect to the original training data. The proposed method significantly reduces annotation costs by generating illumination and structural variations to the annotated images, thereby allowing the same annotations to serve multiple images. The architecture of the proposed automated augmenter is shown in Fig. 1. Here, an ODS (YOLOv3  network) is trained on daytime images and corresponding artificially transformed nighttime images that are derived through various augmentation methods described below. To establish the performance of the proposed automated augmenter and for baselining purposes, the ODS is trained on 80% of daytime images from the LISA Traffic Sign Dataset (LISA) , and tested on 20% day time images from the same data set as well as on annotated real night time images from the Berkeley DeepDrive  and Nexar  data sets.
The baseline performance the ODS without data augmentation has precision/recall of 0.897/0.883 and 0.70/0.662 on daytime and night time images, respectively. The discrepancy in the ODS performance between the day and night time images occurs due to the fact that the night time images are OOD. The primary purpose of the automated argumentation methods described below is to increase the performance of the ODS on the night time test data while preserving the detection performance in the day time domain.
Ii-a Easy Augmentation Methods
For comparative assessment of automated augmenters with the baseline (no augmentation) ODS performance, we use two easy augmentation methods described below.
Ii-A1 Blender (BLEND)
The use of 3D-modelling software has previously been used to successfully implement automated pipelines for generation of annotated training data for classifiers . Inspired by this we generate traffic signs using Blender . We randomly render traffic signs from various angles and backgrounds in collected images of night time traffic scenarios. The world space coordinates of the sign model are automatically transformed to screen space and used as annotations for each rendered image. Examples of BLEND rendered night time images are shown in Fig.2(a).
Ii-A2 SimpleAugment (SAUG)
This method augments the day time images from LISA dataset using three simple pixel-based transformation steps. First, the pixels corresponding to the blue color-plane are decreased based on the initial RGB-vector values per pixel. Here, brighter pixels with higher values are decreased exponentially more than the darker pixels with lower values. This process creates a darker version of the input image. Second, the pixels corresponding to the top half of the image are further decreased in intensity to make the sky region appear darker. Third, the bounding box region corresponding to the traffic signs are retained from the original image to highlight the ROIs. An example of the output of SAUG is shown in Fig. 2(b).
Ii-B GAN Models for Domain Transfer and Augmentation
One significant class of methods used for data augmentation are the various variants of GANs. GANs have been used to demonstrate that artificial images can be automatically generated to appear significantly similar to actual camera-acquired or hand-painted images . This is achieved through a training process that learns an implicit distribution of the training data set from a training set of images . This adversarial training process involves two steps. First step is generation of fake images following distribution that are minimally dissimilar from real images. This is performed using a trained generator with parameters that accepts inputs corresponding to image structure and/or image noise. The second step is maximization of discriminatory performance for a classifier with parameters towards the real and fake images/ROIs in images. The loss () which is minimized by the GAN optimization routine is given by
The GAN models analyzed in this work and described below are trained using a day time image as input and generating a corresponding night time image as shown in Fig. 2. We observe a daytime image of a stop sign that is converted into its corresponding night time equivalent.
Ii-B1 CycleGAN (CG)
This method was developed as a tool for domain transfer without the need for paired images from the different domains . The CG model, on inference, takes an image as input and outputs the same images with a different style. The key difference from a traditional GAN is that it preserves the content of the input image instead of creating new content from noise. CG comes with the option of choosing different generative models, either a residual network or a U-net architecture. The U-net generates different sections of the image at a time and therefore does not get full context of the image, potentially preventing it from learning certain features. In this work a residual network with significantly high memory requirements is used. Due to this high memory usage, the CG is trained on reduced field of view images (further described in Sec. III). We observe that CG often generates very dark night-time images, with the corresponding traffic sign being very hard to detect and classify even by a human. To circumvent this problem we implement a version of CG followed by insertions of traffic signs from the daytime image directly. In this way the scene is converted to night time but leaving the content of the ROI unaltered as can be seen in Fig. 2(c).
Ii-B2 Bounding Box GAN (BBGAN)
Although CG is one of the state-of-the art methods for domain transfer, the fact that the resulting images suffered from dark traffic signs limited the performance of the ODS with this augmentation method (as discussed in III). To preserve the appearance of the traffic signs in the night time images we leverage the fact that we know the location of the traffic sign in the input image. Hence a customized BBGAN, inspired by the work in , is developed to transfer style from day-to-night time while preserving the content of the bounding box part of the image that contains the traffic sign. The BBGAN minimizes the loss
where, is a trainable weight parameter and denote subsets of the images from , corresponding to the ROIs. The last term in Eq.(2) represents a content preserving loss that penalizes the pixel by pixel difference between input and output image in the ROI. An example of style transfer using BBGAN can be seen in Fig. 2(d).
Ii-C Reinforcement Learning based Augmentation
The RL based data augmentation method (RLAUG) in  automatically searches for image processing policies or operations that can improve ODS performance by data augmentation. This method relies on altering the existing image quality/structure as opposed to the generative networks that directly generate new images as an output. Here, a policy is defined as a sequence of image processing operations to modify an existing image, such as application of rotation, shear and color contrast transformations on images. In , a policy consists of 5 sub-policies, and a sub-policy comprises of two operations such that a search algorithm is applied to find the best set of policies that allow an ODS to yield the best validation accuracy on a target dataset. Here, the search algorithm uses a recurrent neural network (RNN) controller, which samples a policy, and a child neural network, which is trained with the policy, to produce a reward signal to update the controller. This augmentation algorithm applies 16 operations such that each operation’s probability and magnitude is discretized into uniformly spaced 11 and 10 values, respectively. Thus, the search space for finding 1 policy (containing 5 sub-policies) has about (16 x 10 x 11) possibilities. The process proceeds as follows: for every image in each batch, one sub-policy is randomly chosen to produce a transformed image to train the child model. In each RNN controller training epoch, the training set is augmented by applying 5 sub-policies to train the child model. The child model is then evaluated to measure the ODS accuracy, which is used as reward signal to train the RNN controller, which in turn gets updated to predict better policies. Thus, the controller samples about 15000 policies for each dataset. Here, a modified policy is created to generate varying degrees of image blurriness and occlusions to learn from vandalized traffic signs as shown in Fig. 2(e)
RLAUG and BBGAN are two standalone automated augmentation methods. However, these methods are combined into one RLAUGBBGAN augmentation method, where the BBGAN first generates night time transformations from the daytime images followed by the RLAUG that converts the day and night time images to their best RL transformed augmented versions. Thus, for each daytime test image in Fig. 2(f), we obtain its RL transformed version, night time equivalent and RL transformed version of night time image, thus resulting in 4 times data augmentation.
Iii Experiments and Results
As discussed in Sec. II-B, training GAN-based methods require significant computing power to process large images. The size of images in the LISA dataset varies from [640x480] to [1280x960] pixels. For image standardization all images are cropped to [256x256] pixels while retaining most of the traffic signs withing field of view. The ODS is trained on the LISA training data set containing 7819 images to set a performance baseline on which the different augmentation methods can then be compared.
The LISA data set contains both the LISA-TS data set and the LISA Extension data set . To test the performance on night time images, the trained ODS is tested on 1992 manually annotated real night-time images from the Berkeley DeepDrive  and Nexar  datasets with images containing the same traffic signs as in the test LISA dataset. We have made this annotated data set publicly available111https://sites.google.com/site/sohiniroychowdhury/research-autonomous-driving and believe that it makes an important contribution to the development of robust perception systems for autonomous vehicle technology. The traffic sign detector is further tested on 2105 day-time LISA images to assure that the daytime performance remained high despite the additional night-time training data. The data set composition is summarized in Table I.
|LISA||LISA TS + LISA Extension (256x256)||Day||-||10503||9924|
|LISA test||LISA test split ( 20%)||Day||Test||2161||2105|
|LISA training||LISA training split ( 80%)||Day||Train||8342||7819|
|BDDNex||BDD + Nexar (256x256)||Night||Test||2248||1992|
In the adversarial training process, the mapping between day and night time images requires a training set of night time images, specifically for the discriminator. This training set is extracted from both the Nexar and the BDD datasets. As these data sets contain both night and daytime images, the separation of the night and day time images is included in the pre-processing part of the data pipe-line. The resulting data set for training all GANs consists of 9652 night time images. The various augmentation methods are comparatively analyzed for their capability of improving traffic sign detection and classification in terms of precision, which represents the ratio between the number of correctly classified traffic signs and number of positively classified traffic signs per class; and recall, which represents the fraction of actual traffic signs per class that are correctly classified. The training and test data sets are randomly sampled to gauge data sensitivity to the classifier.
The comparative performance of the ODS without (No Aug) and with various augmentation methods is shown in Table II. Here, we observe that the BLEND method yields the best night time traffic sign classification recall. However, it is noteworthy that the BLEND method uses high-quality image textures for each sign as opposed to low-quality real-world examples used by the other methods. This allows the ODS to learn intricate details of each sign and distinguish similar signs. Thus, the BLEND method requires an unscalable amount of manual labor to set up 3D blender models, and may not necessarily generalize well for other types of data augmentation needs.
|Augmentation Method||Test Data||Precision||Recall|
With respect to precision, the increase in performance is significant for all methods except for CG. We attribute the low performance of CG to the fact that the content of the traffic signs obtained by this method are too dark to read in many cases. CG produces low quality images because it relies on the ability to map back and forth between the daytime and nighttime domains. The problem is that there is very little information in the night time images which in turn makes a night to day transformation difficult.
The method with best improvement in precision is RLAUGBBGAN where night time recall improved from 0.662 to 0.913. The use of RLAUG to apply policies to the bounding box part of the image does not necessarily provide examples of dark signs. It however allows the ODS to further generalize its identification of well lit signs. This in combination with the ability of BBGAN to increase the night time performance makes it a scalable method that allows for a significant increase in traffic sign classification performance.
Iv Conclusions and Discussion
In this work we demonstrate that for automotive grade front camera images, optimal RL-based modification policies along with GAN generated images collectively generalize day to night time images for traffic sign identification tasks. The proposed automated augmenter aids training object detectors that are robust to image blurriness and vandalism-related occlusions as well. The proposed augmenter (RLAUGBBGAN) enhances precision and recall for traffic sign classification by 3-7%, while eliminating any test time processing overheads. Examples of improvements in the object detector without and with the proposed augmenter are shown in Fig. 4.
Thus, the proposed automated augmenter can be robustly used to train other object detector modules related to autonomous driving and path planning functionalities.
-  Home of the blender project - free and open 3d creation software. Note: [Online; accessed 14-February-2019] External Links: Cited by: §II-A1.
-  (2018-10-09) AutoAugment: learning augmentation policies from data. Google Brain. External Links: Cited by: §I, §II-C.
-  (2018-06-26) Dataset augmentation with synthetic images improves semantic segmentation. arXiv e-prints. External Links: Cited by: §II-A1.
-  (2013) Detection of traffic signs in real-world images: the german traffic sign detection benchmark. In The 2013 international joint conference on neural networks (IJCNN), pp. 1–8. Cited by: §I.
-  (2018) An introduction to image synthesis with generative adversarial nets. arXiv preprint arXiv:1803.04469. Cited by: §II-B2.
-  (2018) A style-based generator architecture for generative adversarial networks. CoRR abs/1812.04948. External Links: Cited by: §I.
-  (2016) Photo-realistic single image super-resolution using a generative adversarial network. CoRR abs/1609.04802. External Links: Cited by: §I.
-  (2012) Vision based traffic sign detection and analysis for intelligent driver assistance systems: perspectives and survey. IEEE Transactions on Intelligent Transportation Systems. External Links: Cited by: §II, §III.
-  (2017) NEXAR challenge ii. External Links: Cited by: §II, §III.
-  (2015-11) Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv e-prints, pp. arXiv:1511.06434. External Links: Cited by: §I, §II-B.
-  (2018-04-08) YOLOv3: an incremental improvement. University of Washington. External Links: Cited by: §II.
-  (2015) U-net: convolutional networks for biomedical image segmentation. CoRR abs/1505.04597. External Links: Cited by: §I.
-  (2018) Conditional transfer with dense residual attention: synthesizing traffic signs from street-view imagery. In 2018 24th International Conference on Pattern Recognition (ICPR), pp. 553–559. Cited by: §I.
-  (2018) BDD100K: A diverse driving video database with scalable annotation tooling. CoRR abs/1805.04687. External Links: Cited by: §II, §III.
-  (2018) Style separation and synthesis via generative adversarial networks. In 2018 ACM Multimedia Conference on Multimedia Conference, pp. 183–191. Cited by: §I.
-  (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. CoRR abs/1703.10593. External Links: Cited by: §I, §II-B1.