Learning to Detect Objects with a 1 Megapixel Event Camera

Learning to Detect Objects with a 1 Megapixel Event Camera


Event cameras encode visual information with high temporal precision, low data-rate, and high-dynamic range. Thanks to these characteristics, event cameras are particularly suited for scenarios with high motion, challenging lighting conditions and requiring low latency. However, due to the novelty of the field, the performance of event-based systems on many vision tasks is still lower compared to conventional frame-based solutions. The main reasons for this performance gap are: the lower spatial resolution of event sensors, compared to frame cameras; the lack of large-scale training datasets; the absence of well established deep learning architectures for event-based processing. In this paper, we address all these problems in the context of an event-based object detection task. First, we publicly release the first high-resolution large-scale dataset for object detection. The dataset contains more than 14 hours recordings of a 1 megapixel event camera, in automotive scenarios, together with 25M bounding boxes of cars, pedestrians, and two-wheelers, labeled at high frequency. Second, we introduce a novel recurrent architecture for event-based detection and a temporal consistency loss for better-behaved training. The ability to compactly represent the sequence of events into the internal memory of the model is essential to achieve high accuracy. Our model outperforms by a large margin feed-forward event-based architectures. Moreover, our method does not require any reconstruction of intensity images from events, showing that training directly from raw events is possible, more efficient, and more accurate than passing through an intermediate intensity image. Experiments on the dataset introduced in this work, for which events and gray level images are available, show performance on par with that of highly tuned and studied frame-based detectors.



1 Introduction

Event cameras [31, 47, 57, 15] promise a paradigm shift in computer vision by representing visual information in a fundamentally different way. Rather than encoding dynamic visual scenes with a sequence of still images, acquired at a fixed frame rate, event cameras generate data in the form of a sparse and asynchronous events stream. Each event is represented by a tuple corresponding to an illuminance change by a fixed relative amount, at pixel location and time , with the polarity indicating whether the illuminance was increasing or decreasing. Fig. 1 shows examples of data from an event camera in a driving scenario.

Since the camera does not rely on a global clock, but each pixel independently emits an event as soon as it detects an illuminance change, the events stream has a very high temporal resolution, typically of the order of microseconds [31]. Moreover, due to a logarithmic pixel response characteristic, event cameras have a large dynamic range (often exceeding [15]. Thanks to these properties, event cameras are well suited for applications in which standard frame cameras are affected by motion blur, pixel saturation, and high latency.

Despite the remarkable properties of event cameras, we are still at the dawn of event-based vision and their adoption in real systems is currently limited. This implies scarce availability of algorithms, datasets, and tools to manipulate and process events. Additionally, most of the available datasets have limited spatial resolution or they are not labeled, reducing the range of possible applications [12, 17].

To overcome these limitations, several works have focused on the reconstruction of gray-level information from an event stream [23, 5, 43, 50]. This approach is appealing since the reconstructed images can be fed to standard computer vision pipelines, leveraging more than 40 years of computer vision research. In particular, it was shown [50] that all information required to reconstruct high-quality images is present in the event data. However, passing through an intermediate intensity image comes at the price of adding considerable computational cost. In this work, we show how to build an accurate event-based vision pipeline without the need of gray-level supervision.

Figure 1: Results of our event-camera detector on examples of the released 1Mpx Automotive Detection Dataset. Our method is able to accurately detect objects for a large variety of appearances, scenarios, and speeds. This makes it the first reliable event-based system on a large-scale vision task. All figures in this work are best seen in electronic form.

We target the problem of object detection in automotive scenarios, which is characterized by important objects dynamics and extreme lighting conditions. We make the following contributions to the field: First, we acquire and release the first large scale dataset for event-based object detection, with a high resolution (1280720) event camera [15]. We also define a fully automated labeling protocol, enabling fast and cheap dataset generation for event cameras. The dataset we release contains more than 14 hours of driving recording, acquired in a large variety of scenarios. We also provide more than 25 million bounding boxes of cars, pedestrians and two-wheelers, labeled at 60Hz.

Our second contribution is the introduction of a novel architecture for event-based object detection together with a new temporal consistency loss. Recurrent layers are the core building block of our architecture, they introduce a fundamental memory mechanism needed to reach high accuracy with event data. At the same time, the temporal consistency loss helps to obtain more precise localization over time. Fig. 1 shows some detections returned by our method on the released dataset. We show that directly predicting the object locations is more efficient and more accurate than applying a detector on the gray-level images reconstructed with a state-of-the-art method [50]. In particular, since we do not impose any bias coming from intensity image supervision, we let the system learn the relevant features for the given task, which do not necessarily correspond to gray-level values.

Finally, we run extensive experiments on ours and another dataset for which gray-level images are also available, showing comparable accuracy to standard frame-based detectors and improved state-of-the-art results for event-based detection. To the best of our knowledge, this is the first work showing an event-based system with on par performance to a frame-based one on a large vision task.

2 Related Work

Several machine learning architectures have been proposed for event cameras  [62, 54, 39]. Some of these methods, such as Spiking Neural Networks [22, 29, 55, 59], exploit the sparsity of the data and can be applied event by event, to preserve the temporal resolution of the events stream [29, 7, 56, 44]. However, efficiently applying these methods to inputs with large event rate remains difficult. For these reasons, their efficacy has mainly been demonstrated on low-resolution classification tasks.

Alternative approaches map the events stream to a dense representation [2, 36, 68, 50]. Once this representation is computed, it can be used as input to standard architectures. Even if these methods lose some of the event’s temporal resolution, they gain in terms of accuracy and scalability [66, 36].

Recently, the authors of [50] showed how to use a recurrent UNet [53] to reconstruct high-quality gray-level images from event data. The results obtained with this method show the richness of the information contained in the events. However, reconstructing a gray-level image before applying a detection algorithm adds a further computational step, which is less efficient and less accurate than directly using the events, as we will show in our experiments.

Very few other works have focused directly on the task of event-based object detection. In [7], the authors propose a sparse convolutional network inspired by the YOLO network [51]. While in [30], temporally pooled binary images from the event camera are fed to a faster-RCNN [52]. However, these methods have only been tested on simple sequences, with a few moving objects on a static background. As we will see, feed-forward architectures are less accurate in more general scenarios.

The lack of works on event-based object detection is also related to the scarce availability of large benchmarked datasets. Despite the increasing effort of the community [45, 3, 6, 67], very few datasets provide ground-truth for object detection. The authors of [40] provide a pedestrian detection dataset. However, it is composed of only 12 sequences of 30 seconds. Simulation [49, 18] is an alternative way to obtain large datasets. Unfortunately, existing simulators use too simplified hardware models to accurately reproduce all the characteristics of event cameras. Recently [12] released an automotive dataset for detection. However, it is acquired with a low-resolution QVGA event camera and it contains low frequency labels (Hz). We believe instead that high-spatial resolution and high-labeling frequency are crucial to properly evaluate an automotive detection pipeline.

3 Event-based Object Detection

In this section, we first formalize the problem of object detection with an event camera, then we introduce our method and the architecture used in our experiments.

3.1 Problem Formulation

Let be an input sequence of events, with and the spatial coordinates of the event, the event’s polarity and its timestamp. We characterize objects by a set of bounding boxes , where, are the coordinates of the top left corner of the bounding box, its width and height, the label object class, and the time at which the object is present in the scene.

A general event-based detector is given by a function , mapping E to . Since we want our system to work in real time, we will assume that the output of a detector at time will only depend on the past, i.e. on events generated before : , where outputs bounding boxes at time . In this work, we want to learn .

Applying the detector at every incoming event is too expensive and often not required by the final applications, since the apparent motion of objects in the scene is typically much slower than the pixels response time. For this reason, we only apply the detector at fixed time intervals of size :


with 1. However, a function working an all past events for every , would be computationally intractable, since the number of input events would indefinitely increase over time.

A solution would be to consider at each step , only the events in the interval , as it is done for example in [66, 36] for other event-based tasks. However, as we will see in Sec. 5, this approach leads to poor results for object detection. This is mainly due to two reasons: first, it is hard to choose a single (or a fixed number of events) working for objects having very different speeds and sizes, such as cars and pedestrians. Secondly, since events contain only relative change information, an event-based object detector must keep a memory of the past. In fact, when the apparent motion of an object is zero, it does not generate events anymore. Tracking objects using hard-coded rules is generally not accurate for edge cases such as reflections, moving shadows or object deformations.

For these reasons, we decide to learn a memory mechanism end-to-end, directly from the input events. To use past events information while keeping computational cost tractable, we choose such that


where is an internal state of our model encoding past information at time . For each , we define by a recursive formula , with . In the next sections, we describe the recurrent neural network architecture we propose to learn and .

Figure 2: Overview of the proposed architecture. Input events are used to build a tensor map at every time step . Feed-forward convolutional layers extract low-level features from . Then, ConvLSTM layers extract high-level spatio-temporal patterns. Finally, multiscale features from the recurrent layers are passed to the output layers, to predict bounding box locations and class. Thanks to the memory of the ConvLSTM layers, temporal information is accumulated and preserved over time, allowing robust detections even when objects stop generating events in input.

3.2 Method

In this section, we describe the recurrent architecture we use to learn the detector . In order to apply our model, we first preprocess the events to build a dense representation. More precisely, given input events we compute a tensor map , with the number of channels. We denote in the following. Our method is not constrained to a particular (cfr. Sec. 5.1).

To extract relevant features from the spatial component of the events, is fed as input to a convolutional neural network  [26, 35]. In particular, we use Squeeze-and-Excitation layers [20], as they performed better in our experiments. In addition, we want our architecture to contain a memory state to accumulate meaningful features over time and to remember the presence of objects, even when they stop generating events. For this, we use ConvLSTM layers [64], which have been successfully used to extract spatio-temporal information from data [16, 34].

Our model first uses feed-forward convolutional layers to extract high-level semantic features that are then fed to the remaining ConvLSTM layers (cfr. Fig. 2). This is to reduce the computational complexity and memory footprint of the method due to recurrent layers operating on large feature maps, and more importantly to avoid the recurrent layers to model the dynamics of low-level features that is not necessary for the given task. We denote this first part of the network feature extractor.

The output of the feature extractor is fed to a bounding box regression head. In this work, we use Single Shot Detectors (SSD) [35], since they are a good compromise between accuracy and computational time. However, our feature extractor could be used in combination with other detector families, such as two-stage detectors. Since we want to extract objects for a large range of scales, we feed features at different resolutions to the regression head. In practice, we use the feature map from each of the recurrent layers. A schematic representation of our architecture is provided in Fig. 2.

As typically done for object detection, to train the parameters of our network, we optimize a loss function composed of a regression term for the box coordinates and a classification term for the class. We use smooth loss [35] for regression and the softmax focal loss [32] for classification. More precisely, for a set of ground-truth bounding boxes at time , we encode their coordinates in a tensor of size , as done in [35], where is the number of default boxes of the regression head matching a ground-truth box. Let be the output of the regression head, with the tensor encoding the prediction for the above default boxes and the class probability distribution for all default boxes. Then, the regression and classification terms of the loss are:


where is the probability of the correct class . We set the constant to 2 and also adapt the unbalanced biases for softmax logits in the spirit of [32].

(a) (b)
Figure 3: (a) Detail of the box regression heads. In order to regularize temporally our network, we introduce a secondary regression head, predicting, at time , the boxes for time . We impose predictions corresponding to the same time step to be consistent. (b) IoU between ground truth tracks and predicted boxes over time. The consistency loss helps obtaining more precise boxes.

Dual Regression Head and Temporal Consistency Loss

To have temporally consistent detections, we would like the internal states of the recurrent layers to learn high-level features that are stable over long periods of time. Even if ConvLSTM can, to some extent, learn slow-changing representations, we further improve detections consistency by introducing an auxiliary loss and an additional regression head trained to predict bounding boxes one time-step into the future. The idea is inspired by unsupervised learning methods such as word2vec [41] and CPC [61], which constraint the latent representation to preserve some domain structure. In our case, given that features are shared for both heads, we argue that this has the additional effect of inducing representations that account for object motion, something which is currently used as regularization, but would require further analysis that goes beyond the scope of the current work.

Given the input tensor computed in the time interval , the two regression heads will output bounding boxes and , trying to match ground truth and respectively. This dual regression mechanism is shown in Fig. 3(a).

To train the two regression heads, we add to our loss an auxiliary regression term between and . This term, when applied at every time step , indirectly constraints the output ’s of the second head to be close to the predictions ’s of the first head at the next time step, cfr. Fig. 3(a). However, since the two heads are independent, they could converge to different solutions. Therefore, we further regularize training by adding another loss term explicitly imposing to be close to . In summary, the auxiliary loss is:


Then, the final loss we use during training is given by . We minimize it during training using truncated backpropagation through time [63].

4 The 1 Megapixel Automotive Detection Dataset

In this section, we describe an automated protocol to generate datasets for event cameras. We apply this protocol to generate the detection dataset used in our experiments. However, our approach can be easily adapted to other computer vision tasks, such as face detection and 3D pose estimation.

Setup and Fully Automated Labeling Protocol The key component to obtaining automated labels is to do recordings with an event camera and a standard RGB camera side by side. Labels are first extracted from the RGB camera and then transferred to the event camera pixel coordinates by using a geometric transformation. In our work, we used the 1 megapixel event camera of [15] and a GoPro Hero6. The two cameras were fixed on a rigid mount side by side, as close as possible to minimize parallax errors. For both cameras, we used a large field of view:  110 degrees for the event camera and  120 degrees for the RGB camera. The video stream of the RGB camera is recorded at 4 megapixels and 60fps. Once data are acquired from the setup, we perform the following label transfer: 1. Synchronize the time of the event and frame cameras; 2. Extract bounding boxes from the frame camera images; 3. Map the bounding box coordinates from the frame camera to the event camera. The bounding boxes from the RGB video stream are obtained using a commercial automotive detector, outperforming freely available ones. The software returns labels corresponding to pedestrians, two-wheelers, and cars. The time synchronization can be done using a physical connection between the cameras. However, since this is not always possible, we also propose in the supplementary material an algorithmic way of synchronizing them. Once the 2 signals are synchronized temporally, we need to find a geometric transformation mapping pixels from the RGB camera to the event camera. Since the distance between the two cameras is small, the spatial registration can be approximated by a homography. Both time synchronization and homography estimation can introduce some noise in the labels. Nonetheless, we observed time synchronization errors smaller than the discretization step we use, and that the homography assumption is good enough for our case, since objects encountered in automotive scenarios are relatively far compared to the cameras baseline.

Recordings and Dataset Statistics Once the labeling protocol is defined, we can easily collect and label a large amount of data. To this end, we mounted the event and frame cameras behind the windshield of a car. We asked a driver to drive in a variety of scenarios, including city, highway, countryside, small villages, and suburbs. The data collection was conducted over several months, with a large variety of lighting and weather conditions during daytime. At the end of the recording campaign, a total of 14.65 hours was obtained. We split them in 11.19 hours for training, 2.21 hours for validation, and 2.25 hours for testing. The total number of bounding boxes is 25M. More statistics can be found in the supplementary material, together with examples from the dataset. To the best of our knowledge, the presented event-based dataset is the largest in terms of labels and classes. Moreover, it is the only available high-resolution detection dataset for event cameras2.

5 Experiments

In this section, we first evaluate the importance of the main components of our method in an ablation study. Then, we compare it against state-of-the-art detectors. We consider the COCO metrics [33] and we report COCO mAP, as is it widely used for evaluating detection algorithms. Even if this metric is designed for frame-based data, we explain in the supplementary material how we extend it to event data. Since labeling was done with a 4 Mpx camera, but the input events have lower resolution, in all our experiments, we filter boxes with diagonal smaller than 60 pixels. All networks are trained for 20 epochs using ADAM [24] and learning rate with exponential decay of every epoch. We then select the best model on the validation set and apply it to the test set to report the final mAP.

5.1 Ablation Study

As explained in Sec. 3.2, our network can take different representations as input. Here we compare the commonly used Histograms of events [42, 36], Time Surfaces [27], and Event Volumes [68]. The results are given in Tab. 1. We see that Event Volume performs the best. Time Surface is points less accurate than Event Volume, but more accurate than simple Histograms. In a second set of experiments, we show the importance of the internal memory of our network. To do so, we train while constraining the internal state of the recurrent layers to zero. As we can see in Tab. 1, the performance drops by 12%, showing that the memory of the network is fundamental to reach good accuracy. Finally, we show the advantage of using the loss of Sec. 3.2.1. When training with this term, mAP increases by , and COCO increases by , Tab. 1. This shows the advantage of using , especially with regard to box precision. In order to better understand the impact of the loss on the results, we compute the Intersection over Union (IoU) between 1000 ground truth tracks from the validation set and the predicted boxes. We normalize the duration of the tracks to obtain the average IoU. As shown in Fig. 3(b), with , IoU is higher for all tracks duration.

Figure 4: Detections on a 1 Mpx Dataset sequence. From top to bottom: Events-RetinaNet, E2Vid-RetinaNet (with the input reconstructed images), and our method RED. Thanks to the memory representation learned by our network, RED can detect objects even when they stop generating events, as for example the stopped car on the right, even when occluded by the motorbike.
Histogram Time Surface Event Volume
0.37 0.39 0.41
w/o memory w/o loss memory + loss
0.29 | 0.26 0.41 | 0.40 0.43 | 0.44
Table 1: Ablation study on the 1Mpx Dataset. Left: mAP for different input representations (without consistency loss). Right: mAP and without some components of our method, with Event Volume as input. "w/o memory" means forcing the internal state to be zero, for all recurrent layers.

5.2 Comparison with the State-of-the-art

We now compare our method with the state-of-the-art on the 1 Mpx Detection Dataset and the Gen1 Detection Dataset [12], which is another automotive dataset acquired with a QVGA event camera [46].

We denote our approach RED for Recurrent Event-camera Detector. For these experiments, we consider Event Volumes of 50ms. Since there are not many available algorithms for event-based detection, we use as a baseline a feed-forward architecture applied to the same input representation as ours, thus emulating the approach of [40] and [30]. We considered several architectures, leading to similar results. We report here those of RetinaNet [32] with ResNet50 [19] backbone and a feature pyramid scheme, since it gave the best results. We refer to this approach as Events-RetinaNet. Then, we consider the method of [50], which is currently the best method to reconstruct graylevel images from events and uses a recurrent Unet. For this, we use code and network publicly released by the authors. Then, we train the RetinaNet detector on these images. We refer to this approach as E2Vid-RetinaNet. For all methods, before passing the input to the first convolutional layer of the detector, input height and width are downsampled by a factor 2. For the Gen1 Detection Dataset, we report also results available from the literature [39].

Finally, since the 1 Mpx Dataset was recorded together with a RGB camera, we can train a frame-based detector on these images. Since events do not contain color information, we first convert the RGB images to grayscale. Moreover, to have the same level of noise in the labels due to the automated labeling, we map frame camera pixels to the same resolution and FOV as the event camera. In this way, we can have an estimation of how a grayscale detector would perform on our dataset. Similarly, since the Gen1 Dataset was acquired using an event camera providing gray levels, we asked the authors of [12] to run the RetinaNet detector on them. We refer to this approach as Gray-RetinaNet.

The results we obtain are given in Tab. 2. We also report the number of parameters of the networks and the methods runtime, including both events preprocessing and detector inference, on a i7 CPU at 2.70GHz and a GTX980 GPU. Qualitative results are provided in Fig. 4 and in the supplementary material. From Fig. 4, we see in particular that our model continues detecting the car even when it does not generate events. While Events-RetinaNet becomes unstable and E2Vid-RetinaNet begins oversmoothing the image, and thus loses the detection. As we can see from Tab. 2, our method outperforms all the other event-based ones by a large margin. On the 1Mpx dataset the images reconstructed by [50] are of good quality and therefore E2Vid-RetinaNet is the second best method, even if 18% points behind ours. Instead, on the Gen1 Dataset, the model of [50] does not generalize well and images are of lower quality. Therefore, on this dataset, Events-RetinaNet scores better. Our method reaches the same mAP as Gray-RetinaNet on the 1Mpx Dataset, making it the first event-camera detector with comparable precision to that of commonly used frame-camera detectors. On the Gen1 Dataset our method performs slightly worse, this is due to the higher level of noise of the QVGA sensor and also because the labels lower frequency makes training a recurrent model more difficult. Finally, we observe that our method has less parameters than the others, it runs realtime and, on the 1Mpx Dataset, it is 21x faster than E2Vid-RetinaNet, which reconstructs intensity images.

1Mpx Detection Dataset Gen1 Detection Dataset
mAP runtime (ms) params (M) mAP runtime (ms)
MatrixLSTM [8] - - - 0.31 -
SparseConv [39] - - - 0.15 -
Events-RetinaNet 0.18 44.05 32.8 0.34 18.29
E2Vid-RetinaNet 0.25 840.66 43.5 0.27 263.28
RED (ours) 0.43 39.33 24.1 0.40 16.70
Gray-RetinaNet 0.43 41.43 32.8 0.44 17.35
Provided by the authors, using a pretrained YOLOv3.
Table 2: Evaluation on the two automotive detection datasets.
Figure 5: Top: Gray-Retinanet applied to night recordings of a HDR automotive camera. Bottom: Our detector RED applied to recordings of a 1 Mpx event camera of the same scene. The detectors were trained on day light data. Gray-Retinanet does not generalize well on night images. In contrast, RED generalizes on night sequences because event data is invariant to absolute illuminance levels.

5.3 Generalization to Night Recordings

We now study the generalization capabilities of our detector on night recordings. Since event cameras are invariant to absolute illuminance levels, an event-based detector should generalize better than a frame-based one. To test this, we apply RED and Gray-RetinaNet detectors on new recorded night sequences, captured using the event camera of Sec. 4 and a HDR automotive camera. We stress the fact that these networks have been trained exclusively on daylight data. Since the frame-based labeling software of Sec. 4 is not accurate enough for night data, we report qualitative results in Fig. 5 and in the appendix. It can be observed that the accuracy of Gray-RetinaNet drops considerably. This is due to the very different lighting and the higher level of motion blur inherently present in night sequences. On the contrary, our method performs well also in these conditions.

6 Conclusion

We presented a high-resolution event-based detection dataset and a real-time recurrent neural network architecture which can detect objects from event cameras with the same accuracy as mainstream gray-level detectors. We showed it is possible to consistently detect objects over time without the need for an intermediate gray-level image reconstruction. However, our method still needs to pass through a dense event representation. In the future, we plan to exploit the sparsity of the events to further reduce computational cost and latency. This could be done for example by adapting our method to run on neuromorphic hardware [1, 11].

Broader Impact

The integration of an event-based object detection pipeline in real-world applications could positively impact several aspects of existing systems. First, the camera’s high temporal resolution would allow faster reaction time and be more robust in situations where standard cameras suffer from motion blur or high latency. Secondly, they could also improve performance in HDR or low light scenes. Both these aspects are essential to increase the safety of driving assistance solutions or autonomous vehicles [9]. Similarly, these characteristics could be useful in applications where there is an interaction between humans and robots (e.g., in a production line or in a warehouse). Finally, the adoption of similar pipelines in other contexts, like the Internet of Things, could reduce the power consumption and the data storage of existing systems [37].

Although, as demonstrated in [50], the possibility to reconstruct intensity images from events stream could create privacy issues, the proposed method allows better privacy management. Encoding events in not human-readable structures and not requiring to have an image-like representation prevents the easy use of the recorded data for purposes different than those defined by the original algorithm, and it limits the possibility to identify people, vehicles or places.

Further advances in event-based processing and neuromorphic architectures might also open the future to a new class of extremely low-power and low-latency artificial intelligence systems [48]. In a world where power-hungry deep learning techniques are becoming a commodity, and at the same time, environmental concerns are increasingly pressuring our way of life, neuromorphic systems could be an essential component of a sustainable society [13].

Concerning possible negative outcomes, since our method relies on training data, it will leverage the bias and the limitations contained in it. Similarly, since it relies on deep learning architectures, it might be deceived by adversarial attacks. To mitigate these consequences, several methods have been recently proposed to de-bias deep learning models and make them more robust to adversarial examples  [65, 28, 21, 58]. A failure of the system might cause dangerous incidents and have severe consequences on people and facilities  [60]. Similarly, its integration in a fully autonomous vehicle, poses the ethical question of replacing the human morale in the so called Trolley Problem [38]. Moreover, autonomous vehicles may impact the careers of millions of people [4].

Finally, we think it is essential to be aware that the event-based perception and similar detection systems could be exploited to harm people and threaten human rights. For example, developing modified versions of this algorithm for mass surveillance [14] or military applications [10, 25].


We would like to thank Davide Migliore for the organization and the support inside the company during the whole duration of paper preparation. We would also like to thank Cécile Pignon for acquiring and reviewing most of the sequences of 1 Megapixel dataset. Finally, thank to Matteo Matteucci and Marco Cannici from Politecnico di Milano for fruitful discussions and for providing the MatrixLSTM results of Table 2.

This work was funded in part by the EU NEOTERIC H2020-ICT-2019-2 project 871330.


  1. Our method can also be applied on a fixed number of events. For clarity, we only describe the fixed case.
  2. A link to the dataset will be made available in the final version of the paper.


  1. F. Akopyan, J. Sawada, A. Cassidy, R. Alvarez-Icaza, J. Arthur, P. Merolla, N. Imam, Y. Nakamura, P. Datta and G. Nam (2015) Truenorth: design and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip. IEEE transactions on computer-aided design of integrated circuits and systems 34 (10), pp. 1537–1557. Cited by: §6.
  2. I. Alonso and A. C. Murillo (2019) Ev-segnet: semantic segmentation for event-based cameras. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §2.
  3. A. Amir, B. Taba, D. Berg, T. Melano, J. McKinstry, C. Di Nolfo, T. Nayak, A. Andreopoulos, G. Garreau and M. Mendoza (2017) A low power, fully event-based gesture recognition system. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7243–7252. Cited by: §2.
  4. A. Balakrishnan (2017) Goldman sachs analysis of autonomous vehicle job loss. CNBC. Cited by: Broader Impact.
  5. P. Bardow, A. J. Davison and S. Leutenegger (2016) Simultaneous optical flow and intensity estimation from an event camera. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 884–892. Cited by: §1.
  6. J. Binas, D. Neil, S. Liu and T. Delbruck (2017) DDD17: end-to-end davis driving dataset. arXiv preprint arXiv:1711.01458. Cited by: §2.
  7. M. Cannici, M. Ciccone, A. Romanoni and M. Matteucci (2019) Asynchronous convolutional networks for object detection in neuromorphic cameras. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §2, §2.
  8. M. Cannici, M. Ciccone, A. Romanoni and M. Matteucci (2020) Matrix-lstm: a differentiable recurrent surface for asynchronous event-based data. arXiv preprint arXiv:2001.03455. Cited by: Table 2.
  9. T. S. Combs, L. S. Sandt, M. P. Clamann and N. C. McDonald (2019) Automated vehicles and pedestrian safety: exploring the promise and limits of pedestrian detection. American journal of preventive medicine 56 (1), pp. 1–7. Cited by: Broader Impact.
  10. M. Cummings (2017) Artificial intelligence and the future of warfare. Chatham House for the Royal Institute of International Affairs London. Cited by: Broader Impact.
  11. M. Davies, N. Srinivasa, T. Lin, G. Chinya, Y. Cao, S. H. Choday, G. Dimou, P. Joshi, N. Imam and S. Jain (2018) Loihi: a neuromorphic manycore processor with on-chip learning. IEEE Micro 38 (1), pp. 82–99. Cited by: §6.
  12. P. de Tournemire, D. Nitti, E. Perot, D. Migliore and A. Sironi (2020) A large scale event-based detection dataset for automotive. arXiv, pp. arXiv–2001. Cited by: §1, §2, §5.2, §5.2.
  13. C. Eliasmith and T. Stewart (2020) Scaling, low power needed for a neuromorphic future. EE Times. https://www.eetimes.com/scaling-low-power-needed-for-a-neuromorphic-future/. Cited by: Broader Impact.
  14. S. Feldstein (2019) The global expansion of ai surveillance. Carnegie Endowment. https://carnegieendowment. org/2019/09/17/global-expansion-of-ai-surveillance-pub-79847. Cited by: Broader Impact.
  15. T. Finateu, A. Niwa, D. Matolin, K. Tsuchimoto, A. Mascheroni, E. Reynaud, P. Mostafalu, F. Brady, L. Chotard and F. LeGoff (2020) 5.10 a 1280 720 back-illuminated stacked temporal contrast event-based vision sensor with 4.86 m pixels, 1.066 geps readout, programmable event-rate controller and compressive data-formatting pipeline. In 2020 IEEE International Solid-State Circuits Conference-(ISSCC), pp. 112–114. Cited by: §1, §1, §1, §4.
  16. C. Finn, I. Goodfellow and S. Levine (2016) Unsupervised learning for physical interaction through video prediction. In Advances in neural information processing systems, pp. 64–72. Cited by: §3.2.
  17. G. Gallego, T. Delbruck, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. Davison, J. Conradt and K. Daniilidis (2019) Event-based vision: a survey. arXiv preprint arXiv:1904.08405. Cited by: §1.
  18. D. Gehrig, M. Gehrig, J. Hidalgo-Carrió and D. Scaramuzza (2019) Video to events: bringing modern computer vision closer to event cameras. arXiv preprint arXiv:1912.03095. Cited by: §2.
  19. K. He, X. Zhang, S. Ren and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §5.2.
  20. J. Hu, L. Shen, S. Albanie, G. Sun and E. Wu (2017) Squeeze-and-excitation networks. External Links: 1709.01507 Cited by: §3.2.
  21. S. Jha, S. Raj, S. L. Fernandes, S. K. Jha, S. Jha, G. Verma, B. Jalaian and A. Swami (2019) Attribution-driven causal analysis for detection of adversarial examples. arXiv preprint arXiv:1903.05821. Cited by: Broader Impact.
  22. N. Kasabov, K. Dhoble, N. Nuntalid and G. Indiveri (2013) Dynamic evolving spiking neural networks for on-line spatio-and spectro-temporal pattern recognition. Neural Networks 41, pp. 188–201. Cited by: §2.
  23. H. Kim, A. Handa, R. Benosman, S. Ieng and A. J. Davison (2008) Simultaneous mosaicing and tracking with an event camera. J. Solid State Circ 43, pp. 566–576. Cited by: §1.
  24. D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.
  25. Knight,Will (2019) Military artificial intelligence can be easily and dangerously fooled. MIT Techonology Review. https://www.technologyreview.com/2019/10/21/132277/military-artificial-intelligence-can-be-easily-and-dangerously-fooled/. Cited by: Broader Impact.
  26. A. Krizhevsky, I. Sutskever and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §3.2.
  27. X. Lagorce, G. Orchard, F. Galluppi, B. E. Shi and R. B. Benosman (2016) Hots: a hierarchy of event-based time-surfaces for pattern recognition. IEEE transactions on pattern analysis and machine intelligence 39 (7), pp. 1346–1359. Cited by: §5.1.
  28. M. Lecuyer, V. Atlidakis, R. Geambasu, D. Hsu and S. Jana (2019) Certified robustness to adversarial examples with differential privacy. In 2019 IEEE Symposium on Security and Privacy (SP), pp. 656–672. Cited by: Broader Impact.
  29. J. H. Lee, T. Delbruck and M. Pfeiffer (2016) Training deep spiking neural networks using backpropagation. Frontiers in neuroscience 10, pp. 508. Cited by: §2.
  30. J. Li, F. Shi, W. Liu, D. Zou, Q. Wang, P. K. Park and H. Ryu (2017) Adaptive temporal pooling for object detection using dynamic vision sensor.. In BMVC, Cited by: §2, §5.2.
  31. P. Lichtsteiner, C. Posch and T. Delbruck (2008-02) A 128 128 120 db 15 s latency asynchronous temporal contrast vision sensor. IEEE Journal of Solid-State Circuits 43 (2), pp. 566–576. External Links: Document, ISSN 1558-173X Cited by: §1, §1.
  32. T. Lin, P. Goyal, R. Girshick, K. He and P. Dollár (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §3.2, §5.2.
  33. T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §5.
  34. M. Liu and M. Zhu (2018) Mobile video object detection with temporally-aware feature maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5686–5695. Cited by: §3.2.
  35. W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu and A. C. Berg (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §3.2, §3.2, §3.2.
  36. A. I. Maqueda, A. Loquercio, G. Gallego, N. García and D. Scaramuzza (2018) Event-based vision meets deep learning on steering prediction for self-driving cars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5419–5427. Cited by: §2, §3.1, §5.1.
  37. M. M. Martín-Lopo, J. Boal and Á. Sánchez-Miralles (2020) A literature review of iot energy platforms aimed at end users. Computer Networks, pp. 107101. Cited by: Broader Impact.
  38. A. Maxmen (2018) Self-driving car dilemmas reveal that moral choices are not universal. Nature 562 (7728), pp. 469–469. Cited by: Broader Impact.
  39. N. Messikommer, D. Gehrig, A. Loquercio and D. Scaramuzza (2020) Event-based asynchronous sparse convolutional networks. arXiv preprint arXiv:2003.09148. Cited by: §2, §5.2, Table 2.
  40. S. Miao, G. Chen, X. Ning, Y. Zi, K. Ren, Z. Bing and A. C. Knoll (2019) Neuromorphic benchmark datasets for pedestrian detection, action recognition, and fall detection. Frontiers in neurorobotics 13, pp. 38. Cited by: §2, §5.2.
  41. T. Mikolov, K. Chen, G. Corrado and J. Dean (2013) Efficient estimation of word representations in vector space. In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §3.2.1.
  42. D. P. Moeys, F. Corradi, E. Kerr, P. Vance, G. Das, D. Neil, D. Kerr and T. Delbrück (2016) Steering a predator robot using a mixed frame/event-driven convolutional neural network. In 2016 Second International Conference on Event-based Control, Communication, and Signal Processing (EBCCSP), pp. 1–8. Cited by: §5.1.
  43. G. Munda, C. Reinbacher and T. Pock (2018) Real-time intensity-image reconstruction for event cameras using manifold regularisation. International Journal of Computer Vision 126 (12), pp. 1381–1393. Cited by: §1.
  44. D. Neil, M. Pfeiffer and S. Liu (2016) Phased lstm: accelerating recurrent network training for long or event-based sequences. In Advances in neural information processing systems, pp. 3882–3890. Cited by: §2.
  45. G. Orchard, A. Jayawant, G. K. Cohen and N. Thakor (2015) Converting static image datasets to spiking neuromorphic datasets using saccades. Frontiers in neuroscience 9, pp. 437. Cited by: §2.
  46. C. Posch, D. Matolin and R. Wohlgenannt (2010) A qvga 143 db dynamic range frame-free pwm image sensor with lossless pixel-level video compression and time-domain cds. IEEE Journal of Solid-State Circuits 46 (1), pp. 259–275. Cited by: §5.2.
  47. C. Posch, T. Serrano-Gotarredona, B. Linares-Barranco and T. Delbruck (2014) Retinomorphic event-based vision sensors: bioinspired cameras with spiking output. Proceedings of the IEEE 102 (10), pp. 1470–1484. Cited by: §1.
  48. B. Rajendran, A. Sebastian, M. Schmuker, N. Srinivasa and E. Eleftheriou (2019) Low-power neuromorphic hardware for signal processing applications: a review of architectural and system-level design approaches. IEEE Signal Processing Magazine 36 (6), pp. 97–110. Cited by: Broader Impact.
  49. H. Rebecq, D. Gehrig and D. Scaramuzza (2018) ESIM: an open event camera simulator. In Conference on Robot Learning, pp. 969–982. Cited by: §2.
  50. H. Rebecq, R. Ranftl, V. Koltun and D. Scaramuzza (2019) High speed and high dynamic range video with an event camera. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1, §1, §2, §2, §5.2, §5.2, Broader Impact.
  51. J. Redmon, S. Divvala, R. Girshick and A. Farhadi (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §2.
  52. S. Ren, K. He, R. Girshick and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §2.
  53. O. Ronneberger, P. Fischer and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §2.
  54. Y. Sekikawa, K. Hara and H. Saito (2019) EventNet: asynchronous recursive event processing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3887–3896. Cited by: §2.
  55. S. B. Shrestha and G. Orchard (2018) Slayer: spike layer error reassignment in time. In Advances in Neural Information Processing Systems, pp. 1412–1421. Cited by: §2.
  56. A. Sironi, M. Brambilla, N. Bourdis, X. Lagorce and R. Benosman (2018) HATS: histograms of averaged time surfaces for robust event-based object classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1731–1740. Cited by: §2.
  57. B. Son, Y. Suh, S. Kim, H. Jung, J. Kim, C. Shin, K. Park, K. Lee, J. Park and J. Woo (2017) 4.1 a 640 480 dynamic vision sensor with a 9m pixel and 300meps address-event representation. In 2017 IEEE International Solid-State Circuits Conference (ISSCC), pp. 66–67. Cited by: §1.
  58. J. Svoboda, J. Masci, F. Monti, M. M. Bronstein and L. Guibas (2018) Peernets: exploiting peer wisdom against adversarial attacks. arXiv preprint arXiv:1806.00088. Cited by: Broader Impact.
  59. A. Tavanaei, M. Ghodrati, S. R. Kheradpisheh, T. Masquelier and A. Maida (2019) Deep learning in spiking neural networks. Neural Networks 111, pp. 47–63. Cited by: §2.
  60. S. Tsugawa (2006) Trends and issues in safe driver assistance systems: driver acceptance and assistance for elderly drivers. IATSS research 30 (2), pp. 6–18. Cited by: Broader Impact.
  61. A. van den Oord, Y. Li and O. Vinyals (2018) Representation learning with contrastive predictive coding. CoRR abs/1807.03748. External Links: Link, 1807.03748 Cited by: §3.2.1.
  62. Q. Wang, Y. Zhang, J. Yuan and Y. Lu (2019) Space-time event clouds for gesture recognition: from rgb cameras to event cameras. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1826–1835. Cited by: §2.
  63. P. J. Werbos (1990) Backpropagation through time: what it does and how to do it. Proceedings of the IEEE 78 (10), pp. 1550–1560. Cited by: §3.2.1.
  64. S. Xingjian, Z. Chen, H. Wang, D. Yeung, W. Wong and W. Woo (2015) Convolutional lstm network: a machine learning approach for precipitation nowcasting. In Advances in neural information processing systems, pp. 802–810. Cited by: §3.2.
  65. B. H. Zhang, B. Lemoine and M. Mitchell (2018) Mitigating unwanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 335–340. Cited by: Broader Impact.
  66. A. Zhu, L. Yuan, K. Chaney and K. Daniilidis (2018-06) EV-flownet: self-supervised optical flow estimation for event-based cameras. In Proceedings of Robotics: Science and Systems, Pittsburgh, Pennsylvania. External Links: Document Cited by: §2, §3.1.
  67. A. Z. Zhu, D. Thakur, T. Özaslan, B. Pfrommer, V. Kumar and K. Daniilidis (2018) The multivehicle stereo event camera dataset: an event camera dataset for 3d perception. IEEE Robotics and Automation Letters 3 (3), pp. 2032–2039. Cited by: §2.
  68. A. Z. Zhu, L. Yuan, K. Chaney and K. Daniilidis (2019) Unsupervised event-based learning of optical flow, depth, and egomotion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 989–997. Cited by: §2, §5.1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description