EventGAN: Leveraging Large Scale Image Datasets for Event Cameras

EventGAN: Leveraging Large Scale Image Datasets for Event Cameras

Abstract

Event cameras provide a number of benefits over traditional cameras, such as the ability to track incredibly fast motions, high dynamic range, and low power consumption. However, their application into computer vision problems, many of which are primarily dominated by deep learning solutions, has been limited by the lack of labeled training data for events. In this work, we propose a method which leverages the existing labeled data for images by simulating events from a pair of temporal image frames, using a convolutional neural network. We train this network on pairs of images and events, using an adversarial discriminator loss and a pair of cycle consistency losses. The cycle consistency losses utilize a pair of pre-trained self-supervised networks which perform optical flow estimation and image reconstruction from events, and constrain our network to generate events which result in accurate outputs from both of these networks. Trained fully end to end, our network learns a generative model for events from images without the need for accurate modeling of the motion in the scene, exhibited by modeling based methods, while also implicitly modeling event noise. Using this simulator, we train a pair of downstream networks on object detection and 2D human pose estimation from events, using simulated data from large scale image datasets, and demonstrate the networks’ abilities to generalize to datasets with real events.

\newfloatcommand

capbtabboxtable[][\FBwidth] \cvprfinalcopy

1

1 Introduction

Deep learning has led a revolution for many computer vision tasks which had been considered incredibly challenging. The ability to leverage immense amounts of data to train neural networks has resulted in significant improvements in performance for many tasks. As a vision modality, event cameras have a lot to gain from deep learning. By combining the neural networks with the advantages of event cameras, we stand to be able to extend the operating volume of speeds and lighting conditions significantly beyond that which is achievable by traditional cameras.

However, these networks for events are limited by the amount of labeled training data available, due to the camera’s relative infancy and the cost of acquiring accurate ground truth labels. While some works have been able to bypass this issue with self-supervised approaches [44, 30, 39], some problems, such as detection and classification, cannot currently be solved without a large corpus of labeled training data. In this work, we focus on an alternative to costly data labelling, by leveraging the large set of labeled image datasets via image to event simulation.

The highest fidelity event camera simulators today [29, 27, 21] all operate with a similar framework, by simulating optical flow in the image either through 3D camera motion, or a parametrized warping (e.g. affine) of the image, in order to precisely track the generation of events as each point in the image moves to a new pixel. However, these scenarios either require simulation of the full 3D scene, or severely constrain the motion in the image. In addition, modeling event noise, both in terms of erroneous events and noise in the event measurements, is a challenging open problem.

In this work, we present EventGAN, a novel method for image to event simulation, where we apply a convolutional neural network as the function between images and events. By learning this function with data, our method does not require any explicit knowledge of the scene or the relationship between images and events, but is instead able to regress a realistic set of events given only images as input. In addition, our network is able to learn the noise distribution over the events, which are currently not modeled by the competing methods. Finally, our proposed method has a fast, constant time simulation which is easily parallelizable on GPUs and integrable into any modern neural network architecture, as opposed to the prior work which requires 3D simulations of the scene.

Our network is trained on a set of image and event pairs, which are directly output by event cameras such as the DAVIS [3]. At training time, we apply an adversarial loss to align the generated events with the real events. In addition, we pre-train a pair of CNNs to perform optical flow estimation and image reconstruction from real events, and constrain our generator to produce events which allow these pre-trained networks to generate accurate outputs. In other words, we constrain the generated events to retain the motion and appearance information present in the real data.

Using this event simulation network, we train a set of downstream networks to perform object detection on cars and 2D human pose estimation, given images and labels from large scale image datasets such as KITTI [9], MPII [1] and Human3.6M [17]. We then evaluate performance on these downstream tasks on real event datasets, MVSEC [42] for car detection, and DHP19 [5] for human pose, demonstrating the generalization ability of these networks despite having mostly seen simulated data at training time. All data and code will be released at a later date.

Our main contributions can be summarized as:

  • A novel pipeline for supervised training of deep neural networks for events, by simulating events from existing large scale image datasets and training on the simulated events and image labels.

  • A novel network, EventGAN, for event simulation from a pair of images, trained using an adversarial loss and cycle consistency losses which constrain the generator network to generate events from which pre-trained networks are able to extract accurate optical flow and image reconstructions.

  • A test dataset for car detection, with manually labeled bounding boxes for cars from the MVSEC [42] dataset.

  • Experiments demonstrating the generalizability of the networks trained on simulated data to real event data, by training object detection and human pose networks on simulated data, and evaluating on real data.

Figure 1: Overview of the EventGAN pipeline. A pair of grayscale images are passed into the generator, which predicts a corresponding event volume. This output is constrained by an adversarial loss, as well as a pair of cycle consistency losses which constrain the generated volume to encode image and flow information.

2 Related Work

2.1 Event Simulation

Prior works on event simulation have focused on differencing log intensity frames, in order to simulate the condition required to trigger an event:

(1)

Earlier works by Bi et al. [2] and Kaiser et al. [20] simulating events by directly applying this equation to the log intensity difference between each pair of successive images. These methods were limited by the temporal resolution of these images, and as such could only handle relatively slow moving scenes. To improve fidelity, Rebecq et al. [29], Mueggler et al. [27] and Li et al. [21] perform full 3D simulations of a scene. This allows them to simulate images at arbitrary temporal resolution, while also having access to the optical flow within the scene, allowing for accurate event trajectories. However, these methods are limited to fully simulated scenes, or images where the motion is known (or where a simplified motion model such as an affine transform is applied). Performing 3D simulations is also a relatively expensive procedure, requiring complex rendering engines. In addition, these methods do not properly model the noise properties of the sensor. Rebecq et al. [29] apply Gaussian noise to the trigger threshold, , as an approximation, but no true model of the event noise distribution exists to our knowledge.

Our work, in contrast, runs in constant time using a CNN which is easily parallelizable and optimized for modern GPUs. The network learns both the motion information in the scene, as well as the noise distribution of the events.

2.2 Sim2Real/Domain Adapation

Learning from simulations and other modalities has been a rapidly growing topic, with deep learning approaches for many robotics problems in particular requiring much more training data than is practical to collect on a physical platform. However, this remains a challenging open problem, as conventional simulators often cannot perfectly model the data distribution in the real world, resulting in many methods attempting to bridge this gap [28, 19]. One popular approach to this problem in the image space is the use of Generative Adversarial Networks (GANs) [10], which consists of a generator trained to model the data distribution of the training set, while a discriminator is trained to differentiate between outputs from the fake and real data. With particular relevance to this work, conditional GANs [25, 18] are able to model relationships between data distributions, while CycleGANs apply additional cycle consistency losses [45].

A successful application of cross modality transfer is in the field of image to lidar transform. A number of recent works [40, 36, 37] have approached the problem of simulating lidar measurements from images, which allow networks to better reason about 3D scenes more efficiently.

With a similar motivation to our work, Iacono et al. [15] and Zanardi et al. [41] address the issue of transferring learning from images to events by running a network trained on images on the grayscale images produced by some event cameras such as the DAVIS [3], and using these outputs as ground truth to train a similar network for events. However, these methods treat the frame based outputs as ground truth, and so will learn biases and mistakes made by the frame based network (e.g. the best mAP of the grayscale network in Zanardi et al. [41] is 0.59, resulting in a mAP for the event based network of 0.26).

As an alternative approach, our work follows the philosophy of using GANs for image to event simulation. We then use the simulated events to train directly on the ground truth labels for the corresponding images, which should be at least as accurate if not better than outputs from a frame based network trained on these labels.

3 Method

The generative portion of our pipeline consists of a U-Net [32] encoder-decoder network, as used in Zhu et al. [43] and Rebecq et al. [30]. The generator takes as input a pair of grayscale images, concatenated along the channel dimension, and outputs a volumetric representation of the events, described in Section 3.1. To constrain this output, we apply an adversarial loss, described in Section 3.2, as well as a pair of cycle consistency losses, described in Section 3.3. The full pipeline for our method can be found in Figure 1.

3.1 Event Representation

The most compact way to represent a set of events is as a set of 4-tuples, consisting of the position, timestamp, , and polarity, . However, regressing points in general is a difficult task, and faces challenges such as varying numbers of events and permutation invariance.

In this work, we bypass this issue by instead regressing an intermediate representation of the events as proposed by Zhu et al. [43]. In this representation, the events are scattered into a fixed size 3D spatiotemporal volume, where each event, is inserted into the volume, which has temporal channels, with a linear kernel:

(2)
(3)

This retains the distribution of the events in x-y-t space, and has shown success in a number of tasks  [43, 6, 30].

However, we deviate from the prior work in that we generate separate volumes for each polarity, and concatenate them along the time dimension. This results in a volume which is strictly non-negative, allowing for a ReLU as the final activation of the network, such that the sparsity in the volume is easily preserved.

In addition, we normalize this volume similar to Rebecq et al. [30], with an additional clipping step, as follows:

(4)

where is the 98th percentile value in the set of non-zero values of . This equates to a clipping operation, followed by a normalization such that the volume lies in . The clipping is designed to reduce the effect of hot pixels, which have an erroneously low contrast thresholds and thus generate a disproportionately many events, skewing the range.

(a) l1-flow-recons
(b) adv.
(c) adv.-recons
(d) adv.-flow
(e) adv-flow-rec
(f) real
Figure 2: Outputs from models trained with subsets of our proposed loss, all models trained with the same hyperparameters. Events are visualized as average timestamp images, i.e. the average timestamp at each pixel. Any voxel with non zero value will generate a color in the average timestamp image, allowing us to see the sparsity of the volume. (a): L1 reconstruction loss in place of the adversarial loss, causing artifacts in the events, and no sparsity achieved, as observed in the interior of the ‘LOVE’ symbol in the time image. (b): Adversarial loss only. Model struggles to converge, and requires significant hyperparameter tuning in order to achieve good results. (c): Adversarial loss and reconstruction loss. Model is now stable, but the events do not have motion information. The image should have a gradient in the motion direction. (d): Adversarial loss and flow loss. Motion direction can now be seen in the time image, but events are not generated in many areas. (e): Adversarial loss, flow and reconstruction losses. Motion trails can now be clearly seen in the time image (see letters). (f): Real events. Note that our method typically underestimates the amount of motion in the scene.

3.2 Adversarial Loss

Perhaps the most direct way to supervise this network is to apply a direct numerical error, such as a L1 or L2 loss, between the predicted and real events. However, given a pair of images, the number of plausible event distributions between the images is extremely large (two images can not constrain the exact motion in between them). Such a direct loss would likely cause the network to overfit to the trajectories observed in the training set and fail to generalize.

Instead, we apply an adversarial loss [10]. This loss simply constrains the generated events to follow the same distribution as the real ones, and avoids directly constraining the network to memorizing the trajectories seen at training time. For each event-image pair, , we regress a generated event volume using our network, , and then pass the generated events and real events through a discriminator network, , which predicts the probability that its input is from real data. Our discriminator is a 4 layer PatchGAN classifier [18]. We alternatingly train the generator and discriminator, with the discriminator trained 2 steps for every 1 of the generator, using the hinge adversarial loss [22, 35]:

(5)
(6)

3.3 Cycle Consistency Losses

However, GANs are typically difficult to train, especially with a high dimensional output space such as an event volume. In addition, there are no guarantees on the simulated events retaining the salient information in the images, such as accurate motion and intensity information.

To this end, we apply an additional pair of losses which constrain the generated events to encode this motion and intensity information. In particular, we pre-train a pair of networks for optical flow estimation and image reconstruction from real events, using the pipeline in EV-FlowNet [44].

The flow network takes as input the event volume, and outputs a per pixel optical flow. Supervision is applied by warping the previous image to the time of the next image using the predicted flow, and applying an L1 loss between the warped and original image, as well as a local smoothness constraint.

The image reconstruction network takes as input the previous image and the event volume, and outputs the predicted next image, and is directly supervised by a L1 loss between the reconstructed and original image. The previous image is provided as input as we found that the image reconstruction network tended to overfit to the training set without it. Prior work by Rebecq et al. [30] has circumvented this by training in a recurrent fashion, but doing so would require multiple passes through the recurrent network, which is undesirably expensive when the goal is to train the generator network. In addition, we summarize the event volume by summing along the time dimension. This is to maintain the invariance to permutation across time of the events. For example, two events occurring at the start of the window vs. two events at the end of the window should generate the same output image. The input, then, to the reconstruction network, is a 2-channel image consisting of the previous image and the summed event volume.

In summary, the cycle consistency losses are:

(7)
(8)
(9)

When training the generator network, we pass the output from the generator as input to each of the pre-trained networks, and apply the same losses used to train each. However, in this case, we freeze the weights of each pre-trained network, such that the generator must tune its output to generate the best input for each pre-trained network. Both cycle consistency networks share the same architecture as the generator network, with the losses applied each time the generator is updated in the adversarial framework. The final losses at each step are:

Generator step: (10)
Discriminator step: (11)

These losses provide useful gradients early in training, when the adversarial loss is typically unstable, and embed motion and appearance information in the predicted event volumes. Figure 2 shows the effect of each loss on the output of the generator.

In summary, the adversarial loss enforces sparsity in the event volume and similarity between the fake and real event distributions. The flow loss enforces motion information to be present within the volume, while the reconstruction loss enforces regularity in the number of events generated by the same point. This is particularly evident when one visualizes the image of the average timestamp at each pixel, where extremely low (but non-zero) values may be hidden in the count image, and where motion trails are clearly visible.

We also implement the tips prescribed by Gulrajani et al. [12] and Brock et al. [4]. In particular, we apply spectral normalization [26] in the encoder of the generator, and batch normalization [16] for the entire generator, while the discriminator has neither types of normalization. We also add noise to the labels seen by the discriminator by randomly flipping the labels from real to fake 10% of the time, as recommended by Chintala et al. [7].

(a) Input Frame
(b) EventGAN
(c) ESIM
Figure 3: Sample outputs generated by EventGAN, compared to ESIM [30], visualized as images of the average timestamp at each pixel. Top images are from the KITTI dataset [9], bottom are from MPII [1]. Compared to ESIM, our method is able to more accurately capture the motion in the scene, and capture fine grain information.

EventGAN

ESIM

Frame

Figure 4: Selected qualitative results of our car detection pipeline using the YOLOv3 network [31]. Detections are in blue, GT labels in green, and don’t care regions in red. For explanation of the methods, please see Table 1.

4 Experiments

We train our network using the RAdam [24] optimizer for 100 epochs on events and images from the indoorflying and outdoorday sequences in the MVSEC dataset [42], as well as a newly collected dataset consisting of recordings from a DAVIS-346b camera [3], consisting of short (60s) sequences with a number of different scenes and motions, in order to capture a large range of event distributions. As the objective of this work is to produce an event simulator which operates well on existing image datasets, we did not train on scenes which are challenging for images (e.g. night time driving). In total, the training set consists around 30 mins of data. During training, we perform weighted sampling from this dataset, with a 80%/20% split between the new data and MVSEC. Each input to the network consists of a pair of images, randomly picked between 1 and 6 frames apart, and the events between them.

Quantitative evaluations of generative models is difficult, as measuring how well the predicted events fit the true event distribution requires knowledge of the true event distribution. For images, networks trained a large corpus of image data are used to model these distributions, and metrics such as the Inception Score [34] or the Fréchet Inception Distance [14] are applied using these networks. However, this results in a second chicken and egg problem, as no such corpus of event data currently exists.

Instead, we evaluate our method directly on a set of downstream tasks, and demonstrate that our simulated events are able to train networks for complex tasks which generalize to data with real events. In Sections 4.1 and 4.2 we describe our experiments for 2D human pose estimation and object detection, respectively.

4.1 2D Human Pose Estimation

We train a 2D human pose detector for events based on the publicly available code from Xiao et al. [38], which uses an encoder-decoder style network to regress a heatmap for each desired joint. We use a ResNet-50 [13] encoder, pretrained on ImageNet [33]. For event inputs, we modify the number of input channels in the first layer, and randomly initialize the weights of this layer. The network is then trained on a 80%/20% split between the MPII [1] and Human3.6M [17] datasets. For each ground truth pose, the pair of images either 1 or 2 frames before and after the target frame are selected at random, and passed into the generator network to generate a simulated event volume.

We evaluate our method on the DHP19 [5] dataset, which consists of 3D joint positions of a human subject, recorded with motion capture, with events from four cameras surrounding the subject. Using the camera calibrations, we project these 3D joint positions into 2D image positions for each camera. Following the experiment schedule by Calabrese et al. [5], we use as a test set data from subjects 13-17 and cameras 2-3. As our method does not include any temporal consistency, we remove sequences with hand motions only, where most of the body is static and does not generate any events. This results in 16 motions across 5 subjects and 2 cameras. Following Calabrese et al. [5], we divide each sequence into chunks of 7500 events per camera, and evaluate on the average pose within each window.

Training Data Precision Easy recall Hard recall Comb recall AP F-1
EventGAN 0.42 0.57 0.34 0.45 0.30 0.44
ESIM 0.23 0.08 0.02 0.05 0.02 0.09
Frame 0.57 0.48 0.27 0.37 0.29 0.45
Table 1: Object detection results on the Event Car Detection dataset. Metrics adopted from the PASCAL VOC challenge [8]. The EventGAN and ESIM models are trained on simulated events from the KITTI dataset, while the Frame model is trained on the real image frames from the KITTI dataset.

One issue with this direct evaluation is that the marker positions for DHP19 vary significantly from those in MPII and H36M. In order to overcome this offset between the joint positions, we freeze all but the final linear layer of our network, and fine tune this layer on the DHP19 training set (subjects 1-12, cameras 2-3). This is equivalent to training a linear model on the activations from the second to last layer, as is common in the self-supervised learning literature [11].

4.2 Object Detection

We train a detection network using the YOLOv3 pipeline [31]. We initialize the network from a pretrained YOLOv3 network with spatial pyramid pooling, with the first input layer randomly initialized. The network is trained on simulated events from the KITTI Object Detection dataset [9], with the target frame and either the frame one or two frames prior.

Real Events
EventGAN-fine-30
ESIM-fine-30
EventGAN Evaluated on Custom Data
Figure 5: Qualitative results of our human pose estimation on real event data. The first three sets are evaluated on samples from the DHP19 dataset [5], where ground truth is in white and predictions are in blue. Our model is able to achieve accuracy on par with a model directly trained on the real data after 30 epoch of fine tuning only the last linear layer. The last set shows our YOLOv3 detection pipeline combined with our human pose estimator. The detection network is trained on MPII to detect the human in the scene (blue box), which is fed into the human pose estimator to estimate the 2D joint positions (MPII format). Best viewed in color.
Pretrained only 1 Epoch 30 Epochs 140 epochs

EventGAN ESIM EventGAN ESIM Real EventGAN ESIM Real Real
MPJPE 14.55 19.57 6.76 7.58 8.94 6.44 6.54 6.75 6.39
PCKh@50 45.47 40.53 87.70 85.89 80.55 90.19 89.93 87.53 89.86
Table 2: Human pose estimation results in MPJPE (pix.) (lower is better) and PCKh50 (higher is better). All EventGAN and ESIM models are first pretrained on simulated events from the MPII and H36M datasets, and then the final linear layer is fine tuned on the DHP19 training set for the specified number of epochs. The Real models are trained directly (whole model) on the DHP19 training set for the specified number of epochs.

4.3 The Event Car Detection Dataset

For evaluation, we generated a novel dataset for car bounding box annotations for event data. Our dataset consists of 250 labeled images from the MVSEC [42] outdoor driving dataset, with corresponding timestamps. For each image, raters label bounding boxes for all cars within the scene, while also separating the cars into easy (large, no occlusion), hard (medium, or partial occlusion) or don’t care (mostly occluded or too small) categories. In total, there are 451 easy instances, 506 hard instances and 959 don’t care instances. This dataset will be publicly available.

4.4 Competing Methods

We additionally simulate the MPII, H36M and KITTI datasets using ESIM [29], by simulating a random affine transform of each image in the dataset, similar to the method used by Rebecq et al. [30]. Using this simulated data, we train the same networks described in Sections 4.1 and 4.2. For both experiments, we also train networks on real data as a baseline. For object detection, we train a network on the grayscale frames from KITTI, and evaluate on the grayscale frames from MVSEC and DDD17. For human pose estimation, we train a network on the events in the training set (subjects 1-12) of DHP19.

5 Results

5.1 2D Human Pose Estimation

We evaluate our method on the mean per joint position error (MPJPE) [5], , as well as PCKh50 (percentage of correct keypoints) [1], which measures the percentage of joint predictions with error less than of the head size. We define head size as 0.6 the distance between the head and the midpoint between the shoulders.

In Table 2, we compare a network trained on simulated events from EventGAN, ESIM, and a network trained directly on the DHP19 training set. We also report results from fine tuning the final linear layer of the network on the DHP19 training set for both EventGAN and ESIM. Qualitative results from both DHP19 and out of sample data can be found in Figure 5. From these results, we can see that the data generated by EventGAN is able to train a network to learn representations that are very close to the true data. After only one epoch of fine tuning, and only of the final layer, we are able to achieve significantly higher accuracy than training on the real data, and come close to the accuracy of a network trained for 140 epochs on real data. However, the gap between ESIM and our method is also relatively small. This is largely due to the low difficulty of the dataset, as even training on real events converges to a relatively good solution after only one epoch of training. This was observed even when testing with much smaller networks, although they converge to a lower accuracy. The dataset is also much cleaner, and as such is closer to the ESIM outputs.

5.2 Object Detection

We evaluate our method according to the precision-recall statistics defined by the PASCAL VOC challenge [8]. Predictions with confidence are removed, and non-maximum suppression is applied for boxes with IoU . In total, we report precision, recall on the easy and hard classes, as well as the AP and F-1 scores for each training input in Table 1. We provide qualitative results in Figure 4.

From these results, we observed that our method is able to achieve reasonably strong results, and comes close to matching the performance of the network with frame inputs, which was trained on real data. The difference in performance implies a small sim-to-real gap, but may also simply be due to a stronger signal in the images for certain frames (although this may also be true the other way round). On the other hand, the sim-to-real gap is significant when training on ESIM. As the true event distribution differs largely from the simulated data, the network is only able to perform accurate detections when the input has relatively low noise (e.g. Figure 4 right), resulting in very low recall.

6 Conclusions

In this work, we have proposed a novel method for training supervised neural networks for events using image data by way of image to event simulation. Given events and images from an event camera, our deep learning pipeline is able to accurately simulate events from a pair of grayscale images from existing image datasets. These events can be used to train downstream networks for complex tasks such as object detection and 2D human pose estimation, and generalize to real events.

The largest limitation of this work is the need for a pair of frames (video), thus prohibiting the use of larger image datasets such as ImageNet [33] and COCO [23]. While it is possible to train a GAN to predict events from a single image, this would become a complex future prediction task, as the GAN must hallucinate the motion within the image. Other promising future directions include exploring other event representations, more complicated adversarial architectures, and exploring more complex downstream tasks.

7 Acknowledgements

Thanks to Tobi Delbruck and the team at iniLabs and iniVation for providing and supporting the DAVIS-346b cameras. This work was supported in part by the Semiconductor Research Corporation (SRC) and DARPA. We also gratefully appreciate support through the following grants: NSF-IIP-1439681 (I/UCRC), NSF-IIS-1703319, NSF MRI 1626008, ARL RCTA W911NF-10-2-0016, ONR N00014-17-1-2093, ARL DCIST CRA W911NF-17-2-0181, the Amazon Research Award, the AWS Cloud Credits for Research Program and Honda Research Institute.

Footnotes

  1. The code for this project can be found at: https://github.com/alexzzhu/EventGAN.

References

  1. M. Andriluka, L. Pishchulin, P. Gehler and B. Schiele (2014-06) 2D Human pose estimation: new benchmark and state of the art analysis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, Figure 3, §4.1, §5.1.
  2. Y. Bi and Y. Andreopoulos (2017) PIX2NVS: parameterized conversion of pixel-domain video frames to neuromorphic vision streams. In 2017 IEEE International Conference on Image Processing (ICIP), pp. 1990–1994. Cited by: §2.1.
  3. C. Brandli, R. Berner, M. Yang, S. Liu and T. Delbruck (2014) A 240 180 130 db 3 s latency global shutter spatiotemporal vision sensor. IEEE Journal of Solid-State Circuits 49 (10), pp. 2333–2341. Cited by: §1, §2.2, §4.
  4. A. Brock, J. Donahue and K. Simonyan (2019) Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, External Links: Link Cited by: §3.3.
  5. E. Calabrese, G. Taverni, C. Awai Easthope, S. Skriabine, F. Corradi, L. Longinotti, K. Eng and T. Delbruck (2019) DHP19: dynamic vision sensor 3D human pose dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §1, Figure 5, §4.1, §5.1.
  6. K. Chaney, A. Zihao Zhu and K. Daniilidis (2019) Learning event-based height from plane and parallax. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §3.1.
  7. S. Chintala, E. Denton, M. Arjovsky and M. Mathieu (2016) How to train a gan. In NIPS 2016 Workshop on Adversarial Training, Cited by: §3.3.
  8. M. Everingham, L. Van Gool, C. K. Williams, J. Winn and A. Zisserman (2010) The pascal visual object classes (VOC) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: Table 1, §5.2.
  9. A. Geiger, P. Lenz and R. Urtasun (2012) Are we ready for autonomous driving? The KITTI vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, Figure 3, §4.2.
  10. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.2, §3.2.
  11. P. Goyal, D. Mahajan, A. Gupta and I. Misra (2019) Scaling and benchmarking self-supervised visual representation learning. arXiv preprint arXiv:1905.01235. Cited by: §4.1.
  12. I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin and A. C. Courville (2017) Improved training of wasserstein gans. In Advances in neural information processing systems, pp. 5767–5777. Cited by: §3.3.
  13. K. He, X. Zhang, S. Ren and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.1.
  14. M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler and S. Hochreiter (2017) GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637. Cited by: §4.
  15. M. Iacono, S. Weber, A. Glover and C. Bartolozzi (2018) Towards event-driven object detection with off-the-shelf deep learning. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1–9. Cited by: §2.2.
  16. S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pp. 448–456. Cited by: §3.3.
  17. C. Ionescu, D. Papava, V. Olaru and C. Sminchisescu (2014-07) Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (7), pp. 1325–1339. Cited by: §1, §4.1.
  18. P. Isola, J. Zhu, T. Zhou and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §2.2, §3.2.
  19. S. James, P. Wohlhart, M. Kalakrishnan, D. Kalashnikov, A. Irpan, J. Ibarz, S. Levine, R. Hadsell and K. Bousmalis (2019) Sim-to-real via sim-to-sim: data-efficient robotic grasping via randomized-to-canonical adaptation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12627–12637. Cited by: §2.2.
  20. J. Kaiser, J. C. V. Tieck, C. Hubschneider, P. Wolf, M. Weber, M. Hoff, A. Friedrich, K. Wojtasik, A. Roennau and R. Kohlhaas (2016) Towards a framework for end-to-end control of a simulated vehicle with spiking neural networks. In 2016 IEEE International Conference on Simulation, Modeling, and Programming for Autonomous Robots (SIMPAR), pp. 127–134. Cited by: §2.1.
  21. W. Li, S. Saeedi, J. McCormac, R. Clark, D. Tzoumanikas, Q. Ye, Y. Huang, R. Tang and S. Leutenegger (2018) InteriorNet: mega-scale multi-sensor photo-realistic indoor scenes dataset. In 29th British Machine Vision Conference 2018, Cited by: §1, §2.1.
  22. J. H. Lim and J. C. Ye (2017) Geometric gan. arXiv preprint arXiv:1705.02894. Cited by: §3.2.
  23. T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár and C. L. Zitnick (2014) Microsoft COCO: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §6.
  24. L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao and J. Han (2019) On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265. Cited by: §4.
  25. M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §2.2.
  26. T. Miyato, T. Kataoka, M. Koyama and Y. Yoshida (2018) Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, External Links: Link Cited by: §3.3.
  27. E. Mueggler, H. Rebecq, G. Gallego, T. Delbruck and D. Scaramuzza (2017) The event-camera dataset and simulator: event-based data for pose estimation, visual odometry, and slam. The International Journal of Robotics Research 36 (2), pp. 142–149. Cited by: §1, §2.1.
  28. X. B. Peng, M. Andrychowicz, W. Zaremba and P. Abbeel (2018) Sim-to-real transfer of robotic control with dynamics randomization. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8. Cited by: §2.2.
  29. H. Rebecq, D. Gehrig and D. Scaramuzza (2018) ESIM: An open event camera simulator. In Conference on Robot Learning, pp. 969–982. Cited by: §1, §2.1, §4.4.
  30. H. Rebecq, R. Ranftl, V. Koltun and D. Scaramuzza (2019) Events-to-video: bringing modern computer vision to event cameras. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3857–3866. Cited by: §1, Figure 3, §3.1, §3.1, §3.3, §3, §4.4.
  31. J. Redmon and A. Farhadi (2018) YOLOv3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: Figure 4, §4.2.
  32. O. Ronneberger, P. Fischer and T. Brox (2015) U-Net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §3.
  33. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Document Cited by: §4.1, §6.
  34. T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford and X. Chen (2016) Improved techniques for training gans. In Advances in neural information processing systems, pp. 2234–2242. Cited by: §4.
  35. D. Tran, R. Ranganath and D. Blei (2017) Hierarchical implicit models and likelihood-free variational inference. In Advances in Neural Information Processing Systems, pp. 5523–5533. Cited by: §3.2.
  36. Y. Wang, W. Chao, D. Garg, B. Hariharan, M. Campbell and K. Q. Weinberger (2019) Pseudo-LiDAR from visual depth estimation: bridging the gap in 3d object detection for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8445–8453. Cited by: §2.2.
  37. X. Weng and K. Kitani (2019) Monocular 3D object detection with pseudo-LiDAR point cloud. arXiv preprint arXiv:1903.09847. Cited by: §2.2.
  38. B. Xiao, H. Wu and Y. Wei (2018) Simple baselines for human pose estimation and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 466–481. Cited by: §4.1.
  39. C. Ye, A. Mitrokhin, C. Fermüller, J. A. Yorke and Y. Aloimonos (2019) Unsupervised learning of dense optical flow, depth and egomotion from sparse event data. arXiv preprint arXiv:1809.08625 25. Cited by: §1.
  40. Y. You, Y. Wang, W. Chao, D. Garg, G. Pleiss, B. Hariharan, M. Campbell and K. Q. Weinberger (2019) Pseudo-LiDAR++: accurate depth for 3d object detection in autonomous driving. arXiv preprint arXiv:1906.06310. Cited by: §2.2.
  41. A. Zanardi, A. Aumiller, J. Zilly, A. Censi and E. Frazzoli (2019) Cross-modal learning filters for rgb-neuromorphic wormhole learning. Robotics: Science and System XV, pp. P45. Cited by: §2.2.
  42. A. Z. Zhu, D. Thakur, T. Özaslan, B. Pfrommer, V. Kumar and K. Daniilidis (2018) The multivehicle stereo event camera dataset: an event camera dataset for 3D perception. IEEE Robotics and Automation Letters 3 (3), pp. 2032–2039. Cited by: 3rd item, §1, §4.3, §4.
  43. A. Z. Zhu, L. Yuan, K. Chaney and K. Daniilidis (2019) Unsupervised event-based learning of optical flow, depth, and egomotion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 989–997. Cited by: §3.1, §3.
  44. A. Z. Zhu and L. Yuan (2018) EV-FlowNet: self-supervised optical flow estimation for event-based cameras. In Robotics: Science and Systems, Cited by: §1, §3.3.
  45. J. Zhu, T. Park, P. Isola and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §2.2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
402622
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description