Event-based Convolutional Networks for Object Detection in Neuromorphic Cameras

Event-based Convolutional Networks for Object Detection in Neuromorphic Cameras

Abstract

Event-based cameras are bioinspired sensors able to perceive changes in the scene at high frequency with a low power consumption. Becoming available only very recently, a limited amount of work addresses object detection on these devices. In this paper we propose two neural networks architectures for object detection: YOLE, which integrates the events into frames and uses a frame-based model to process them; eFCN, a event-based fully convolutional network that uses a novel and general formalization of the convolutional and max pooling layers to exploit the sparsity of the camera events. We evaluated the algorithm with different extension of publicly available dataset, and on a novel custom dataset.


Event-based Convolutional Networks for Object Detection in Neuromorphic Cameras


Marco Cannici marco.cannici@mail.polimi.it Marco Ciccone marco.ciccone@polimi.it Andrea Romanoni andrea.romanoni@polimi.it Matteo Matteucci matteo.matteucci@polimi.it Politecnico di Milano Milano, Italy

 

\@footnotetext

© 2018. The copyright of this document resides with its authors.
It may be distributed unchanged freely in print or electronic forms.

1 Introduction

Fundamental techniques underlying Computer Vision base on the ability to extract meaningful features. To this extent, Convolutional Neural Networks (CNNs) rapidly became the first choice to develop computer vision applications such as image classification [? ? ? ? ], object detection [? ? ? ], semantic scene labeling [? ? ? ] and they have been recently extended also to non-euclidean domains such as manifolds and graphs [? ? ].

In most of the cases the input of these networks are RGB images. On the other hand, neuromorphic cameras [? ? ? ] are becoming more and more widespread. These devices are bio-inspired vision sensors that attempt to emulate the functioning of biological retinas. As opposed to conventional cameras, which generate frames at a constant frame rate, these sensors output data only when a brightness change is detected. Whenever this happens, an event is generated indicating the position and the instant at which the change has been detected and its polarity , i.e., if the brightness change is positive or negative. The result is a sensor able to produce a stream of asynchronous events that sparsely encodes what happens inside the scene with microseconds resolution and with minimum requirements in terms of power consumption and bandwidth.

The growth in popularity of these type of sensors, and their advantages in terms of temporal resolution and reduced data redundancy, have led new algorithms and paradigms to fully exploit the advantages of event-based vision for varied applications, e.g., features detection [? ], visual odometry [? ? ] optical flow estimation [? ]. The most popular method to deal with this kind of visual information are Spiking Neural Networks (SNNs)[? ], a processing model aiming to improve the biological counterpart of artificial neural networks. The use of spikes as a mean of communication between neurons, however, limits the range of predictable values and prevents SNNs to be used to solve complex computer vision problems such as object detection. While in classification the use of spikes does not constitute a problem [? ? ] as label predictions can be obtained by looking at the output neuron that spikes first or more often, in object detection there is no obvious way of using a spike-based encoding to output bounding boxes information. Moreover, this type of communication makes SNNs not differentiable, making them difficult to train and use in complex scenarios.

An alternative solution to deal with event-based cameras is to make use of frame reconstruction procedures and conventional frame-based neural networks [? ] which can instead rely on optimized training procedures. A recent solution proposed by ? ], instead, makes use of LSTM cells to accumulate events over time and perform classification.

Even if event-cameras are becoming increasingly popular, due to their relative novelty, only very few datasets for object detection in event-based data streams are available. As a result, a limited number of object detection algorithms have indeed been proposed in literature [? ? ? ]. In this paper we present a hybrid approach to feature extraction in neuromorphic cameras. Our framework allows the design of object detection networks able to exploit events and sparsely recompute features while still preserving the advantages of conventional neural networks. The key contributions of this paper are:

  1. A frame integration procedure inspired by SNNs which encodes the timing of the events in the intensity of each pixel (Section 2).

  2. A basic model for object detection in neuromorphic cameras that integrates events (Section 2).

  3. The formalization of event-based convolutional and max-pooling layers used to define fully-convolutional neural networks for event-based object detection (Section 3).

  4. A set of novel datasets to test object detection with event cameras (Section 4).

2 YOLE: You Only Look Events

In this section we describe an approach to adapt conventional frame-based neural networks to deal with event-based inputs. Sparse events generated by the neuromorphic camera are integrated into a volatile frame, a spatial structure to maintain events information through time. After this frame has been reconstructed, a general neural network can be applied as with classical images.

Leaky Integrator.

The design of the proposed frame integration mechanism takes inspiration from the functioning of Spiking Neural Networks (SNNs) to maintain memory of past events. Every time an event with coordinates with timestamp is received, the corresponding pixel of the integrated frame is incremented of a fixed amount . At the same time, the whole frame is decremented of a quantity that depends on the time elapsed between the last received event and the previous one. The described procedure can be formalized by the following relations:

(1)

where is the value of the pixel in position of the integrated frame and . and are correlated: determines how much information is contained in each single event whereas defines the decrement of information in time. Given a certain choice of these parameters, similar frames can be obtained by using, for instance, a higher increment and a higher temporal . For this reason, we fix and we vary only the one based on the dataset to be processed. Pixel values are prevented to become negative by means of the max operation.

Other frame reconstruction procedures, such as the one in [? ] divide the time in constant and predefined intervals. Frames are obtained by setting each pixel to a binary value (depending on the polarity) if at least an event has been received within the reconstruction interval. With this mechanism however, time resolution is lost and the same importance is given to each event, even if it is noise. The proposed method, instead, does not distinguish the polarity of the events, obtaining frames invariant to the object movement, and performs continuous and incremental integration, characteristics that allowed us to develop the event-based framework presented in Section 3.

Event-based Object Detection.

We identified YOLO [? ] as a good candidate for object detection to be used in our event-based framework: it is fully-differentiable and it is able of producing predictions with small input-output delays. By means of a standard CNN and with a single run of the model, it is able to simultaneously predict not only the class, but also the position and dimension of every object present inside the scene. We used the YOLO training mechanism and the previous frame integration procedure to train our YOLE model. Its architecture is depicted in Figure 1. Note that in this context, we use the term YOLO to refer only to the training procedure proposed by ? ] and not to the specific network architecture. We used indeed a simpler structure for our models as explained in Section 4. Nevertheless, YOLE did not exploit the sparse nature of events; thus, in the next section, we propose a fully event-based framework for convolutional networks.

Figure 1: The YOLE detection network based on YOLO. The input frames are divided into a grid of regions which predict bounding boxes each.

3 Event-based Fully Convolutional Networks (eFCN)

Conventional CNNs for video analysis usually treat every frame independently and recompute all the feature maps entirely, even if consecutive frames differ from each other only in small portions. Beside being a significant waste of power and computations, this approach does not match the nature of event-based cameras which directly output the location where the changes happen.

To exploit the event nature of neuromorphic vision, we propose a modification of the forward pass of fully convolutional architectures. In the following the convolution and pooling operations are reformulated to produce the final prediction by recomputing only the features in correspondence of the regions affected by the events. Feature maps maintain their state over time and are updated only as a consequence of incoming events. At the same time, the leaking mechanism that allows past information to be forgotten acts independently on each layer of the CNN and enables features computed in the past to fade away as their visual information starts to disappear in the volatile frame.

This method can be applied to any convolutional architecture. Indeed, a CNN trained to process frames reconstructed from streams of events can be easily converted into an event-based CNN without any modification on its layers composition and by using the same weights learned while observing frames maintaining its output unchanged.

3.1 Volatile Frame Reconstruction Layer

The volatile frame reconstruction layer extends the method presented in Section 2 to recover the integrated frame so that sequences of multiple events can be processed at the same time. Every pixel that corresponds to an event coordinate is incremented by and the leaking mechanism is applied on the whole frame by using the timestamp of the most recent event in the sequence. All the pixels are decremented by the same negative quantity which is summed up to their previous value.

To allow subsequent layers to locate changes in reconstructed frames, the frame integration layer performs also the following set of operations: (i) it sends the list of incoming events to the next layer so that modified pixels can be updated. (ii) it communicates the decrement to all the subsequent layers to update feature maps in correspondence of regions not affected by any event. (iii) it communicates which pixels have been reset to to prevent their value becoming negative (max operator in Equation (1)).

3.2 Event-based Convolutional Layer

The event-based convolutional (e-conv) layer proposed uses events to determine where the input feature map has changed with respect to the previous time step and, therefore, which parts of its internal state, i.e., the feature map computed at the previous time step, must be recomputed and which parts can be reused. As opposed to the frame reconstruction layer, the update mechanism that allows past information to leak away over time does not only depend on the time elapsed from the last time the state was updated. Indeed, the transformations applied by previous layers and the composition of their activation functions may cause to act differently in different parts of the feature map. We face this issue by storing an additional set of information and by using a particular class of activation functions for the hidden layers of the network.

Let’s consider the first layer of a CNN which processes the frames reconstructed by the frame integration layer and which computes the convolution of a set of filters with bias and activation function . The computation performed on each receptive field is:

(2)

where select a pixel of the current receptive field at time and its corresponding value in the kernel , while the indices indicate the location in the output feature map.

When a new event arrives, the frame reconstruction layer decreases all the pixels by , i.e., a pixel not directly affected by the event becomes: , with .

At time Equation (2) becomes:

(3)

If is a piecewise linear activation function, as ReLU or Leaky ReLU, and we assume that the updated value will not cause the activation function to change the sign of the output with respect to the input, Equation 3 can be rewritten as follows:

(4)

where is the coefficient applied by the piecewise function which depends on the feature value in position . When the previous assumption is not satisfied, the feature is recalculated like its receptive field was affected by an event.

Consider now a second convolutional layer attached to the first one:

(5)

The equation can be easily extended by induction as follows:

(6)

where expresses how visual inputs are transformed by the network in every receptive field .

The operator applied by the frame integration layer can be interpreted as a ReLU, and Equation (4) becomes:

(7)

where the value is if is negative and if it is positive. Notice that needs to be updated only when the corresponding feature changes enough to make the activation function use a different coefficient , e.g., from 0 to 1 in case of ReLU. Events are used to communicate the change to subsequent layers so that also their update matrix can be updated accordingly.

(a)
(b)
Figure 2: The core structure of the e-conv (a) and e-max-pooling layers (b). The internal states and the update matrices are recomputed locally only in correspondence of the events (green cells) whereas the remaining regions (depicted in yellow), are obtained reusing the previous state.

The internal state of the e-conv layer, therefore, comprises the feature maps and the update values computed at the previous time step. The initial values of the internal state are computed making full frame inference on a blank frame; this is the only time the network needs to be executed entirely. As a new sequence of events arrives the following operations are performed (see Figure 2(a)):

  1. Update locally on the coordinates specified by the list of incoming events (Eq. (6)).

  2. Update the feature map with Eq. (7) in the locations which are not affected by any event and generate an output event where the activation function coefficient changed.

  3. Recompute through Equation (2) in correspondence of the incoming events and output which receptive field has been affected.

  4. Forward the feature map and the events generated in the current step to the next layer.

3.3 Event-based Max Pooling Layer

The location of the maximum value in each receptive field of a max-pooling layer is likely to remain the same over time. An event-based pooling layer, hence, can exploit this property to avoid recomputing every time the position of maximum values.

The internal state of an event-based max-pooling (e-max-pool) layer can be described by a positional matrix , which has the shape of the output feature map produced by the layer, and which stores, for every receptive field, the position of its maximum value. Every time a sequence of events arrives, the internal state is updated by recomputing the position of the maximum values in every receptive field affected by an incoming event. Then the internal state is used both to build the output feature map and to produce the update matrix by fetching the previous layer on the locations provided by the indices . For each e-max-pool layer, the indices of the receptive fields where the maximum value changes are communicated to the subsequent layers so that the internal states can be updated accordingly. This mechanism is depicted in Figure 2(b).

Notice that the leaking mechanism acts differently in distinct regions of the input space. Features inside the same receptive field can indeed decrease over time with different speeds as their update values could be different. Therefore, even if no event has been detected inside its region, the position of the maximum value might change and, in principle, the update procedure has to check if the maximum value has changed both in the receptive fields affected by an event and in the remaining regions. However, if an input feature corresponds to the maximum value of a receptive field and has also the minimum update rate among the input features in , the output feature will decrease slower than all the others and its value will remain the maximum. In this case, we do not need to recompute the maximum until a new event arrives at this receptive field.

3.4 Event-based Fully Convolutional Object Detection

To fully exploit the event-based layers presented so far, YOLE needs to be converted to a fully convolutional object detection network replacing all its layers with their event-based versions (see Figure 3). Moreover, fully-connected layers are replaced with e-conv layers that map features extracted by the previous layers into a precise set of values defining the bounding boxes parameters predicted by the network.

This architecture divides the field of view into a grid of regions that predicts bounding boxes each and classify the detected objects into different classes. Each one of the regions in which the frame is divided is processed independently. The last e-conv layer is used to decrease the dimensionality of the feature vectors and to map them into the right set of parameters, regardless of their position in the field of view.

Figure 3: Fully-convolutional detection network based on YOLE. The first convolutional layers extract a hierarchy of abstract representations. The last layer is used to map the feature vectors into a set of values which define the parameters of the predicted bounding boxes.

4 Experiments

Datasets.

Only few event-based object detection datasets are publicly available in the literature: N-MNIST [? ], MNIST-DVS [? ] and N-Caltech101 [? ]. These datasets are obtained from the original MNIST [? ] and Caltech101 [? ] datasets by recording the original images with an event camera while moving the camera itself or the images in the datasets. We performed experiments on N-Caltech101 and on two enhanced versions of N-MNIST and MNIST-DVS, i.e., Shifted N-MNIST and Shifted MNIST-DVS, with the purpose of better testing the translation-invariance of detection models. Moreover we present a novel dataset, named Blackboard MNIST, and an extension of POKER-DVS [? ], an event-based dataset originally designed for object tracking. See the supplementary materials for further details 111Shifted N-MNIST, Shifted MNIST-DVS, POKER-DVS and Blackboard MNIST will be released soon..

Experiments Setup.

Event-based datasets are generally simpler than the RGB ones used in the original YOLO paper. Therefore, we designed the object detection networks taking inspiration from the simpler LeNet [? ] model with conv-pool layers for feature extraction. Both YOLE and eFCN share the same structure up to the last regression/classification layers.

For what concerns the N-Caltech101 dataset, we used a slightly different architecture inspired by the structure of the VGG16 model [? ]. The network is composed by only one layer for each group of convolutional layers, as we noticed that a simpler architecture achieved better results. Moreover, the dimensions of the last fully-connected layers have been adjusted such that the frame is divided into a grid of regions predicting bounding boxes each. As in the original YOLO architecture we used for all models Leaky ReLU for the activation functions of hidden layers and a linear activation for the last one.

In all the experiments the first four convolutional layers have been initialized with kernels from a recognition network pretrained to classify MNIST-DVS digits, while the remaining layers have been initialized as proposed by ? ]. All networks were trained optimizing the multi-objective loss proposed by ? ] with Adam [? ], learning rate , , , . The batch size was chosen depending on the dataset: for Shifted N-MNIST, for Shifted MNIST-DVS and N-Caltech101, for Blackboard MNIST and for Poker-DVS with the aim of filling the memory of the GPU at best. Early-stopping was applied to prevent overfitting.

4.1 Results and Discussion

px S-N-MNIST S-MNIST-DVS Blackboard MNIST Poker-DVS N-Caltech101 v1 v2 v2* v2fr v2fr+ns accuracy 94.9 91.7 94.7 88.6 85.5 96.1 90.4 99.01 56.5 mAP 91.3 87.9 90.5 81.5 77.4 92.0 87.4 98.06 30.7

Table 1: Detection performance of YOLE.
S-MNIST-DVS Blackboard MNIST
accuracy 94.0 88.5
mAP 87.4 84.7
Table 2: Detection performance of eFCN.

Detection performance of YOLE.

The YOLE network achieves good detection results both in terms of mean average precision (mAP) [? ] and accuracy (which in this case is computed by matching every ground truth bounding box with the predicted one having the highest IOU) in most of the datasets. The results we obtained are summarized in Table 1.

We used the Shifted N-MNIST dataset also to analyze how detection performance changes when the network is used to process scenes composed of a variable number of objects. We denote as v1 the results obtained using scenes composed of a single digit, with v2 those obtained with scenes containing two digits in random locations of the field of view. We further tested the robustness of the proposed model by adding some challenging noise, i.e., higher than what can be usually experienced with event cameras. We added non-target objects (v2fr) in the form of five fragments, taken from random N-MNIST digits using a procedure similar to the one used to build the Cluttered Translated MNIST dataset [? ], and additional random events per frame (v2fr+ns). In case of multiple object the algorithm is still able to detect the all of them, while, as expected, performance drops both in terms of accuracy and mean average precision when dealing with noisy data. Neverthelesss, we have very good detection performance on the Shifted MNIST-DVS, Blackboard MNIST and Poker-DVS datasets which represent a more realistic scenario in terms of noise.

All of these experiments were performed using the set of hyperparameters suggested by the original work from ? ]. However, a different choice of these parameters, namely and , worked better for us increasing both the accuracy and mean average precision scores (v2*).

The dataset in which the proposed model did not achieve noticeable results is N-Caltech101. This is mainly explained by the fact that the number of samples in each class is not evenly balanced. The network, indeed, achieves good results as for those obtained in the other datasets when the number of training samples is high such as with Airplanes, Motorbikes and Faces_easy (see Table and Table in the supplementary material). As the number of training samples decreases, however, the performance of the model becomes worse, behavior which explains the poor aggregate scores we report here in Table 1.

Detection performance of eFCN.

We tested the performance of the eFCN network on two datasets: the Shifted MNIST-DVS and Blackboard MNIST datasets. With this fully-convolutional variant of the network we registered a slight decrease in performance w.r.t. the results we obtained using YOLE, as reported in Table 2. This gap in performance is mainly due to the fact that each region in eFCN generates its predictions by only looking at the visual information contained in its portion of the field of view. Indeed, if an object is only partially contained inside a region the network has to guess the object dimensions and class by only looking at a restricted region of the frame. However, removing the last fully-connected layers allowed us to design a detection network made of only event-based layers and which also uses a significantly lower number of parameters. In the supplementary materials we provide a video showing a comparison between the predictions obtained using the two proposed networks, YOLE and eFCN.

Timing performance of the event-based framework.

In order to identify the advantages and weaknesses of our event-based framework we compared our detection networks on two datasets, Shifted N-MNIST and Blackboard MNIST. In the first one the event-based approach achieved a x speedup (ms per frame), whereas in the second one it performed slightly slower (ms per frame) w.r.t. a network using conventional layers (ms per frame). The second benchmark is indeed challenging for our framework since changes are not localized in restricted regions of the frame due to the presence of noise and large objects covering multiple regions. In these conditions, where most of the feature maps need to be recalculated, a conventional frame-based approach performs better since it has not to deal with the overhead of additional event information.

px Shifted Shifted OD-Poker-DVS N-Caltech101 Blackboard N-MNIST MNIST-DVS MNIST

Figure 4: Examples of YOLE predictions.

5 Conclusions

In this paper we proposed two different methods, based on the YOLO architecture, to accomplish object detection in event-based cameras. The first one, namely YOLE, integrates events into a unique frame to make them usable with YOLO. Conversely, eFCN relies on a more general extension of the convolutional and max pooling layers to deal directly with events and to exploit their sparsity to avoid reprocessing the whole network. This novel event-based framework can be used in every fully-convolutional architecture to make it usable with event-cameras, even conventional CNN for classification, although in this paper it has been applied to object detection networks.

We analyzed the timing performance of this formalization obtaining promising results. We are planning to extend our framework to automatically detect at runtime when the use of event-based layers speedups computation (i.e., changes affect few regions of the frame) or a complete recomputation of the feature maps is more beneficial in order to exploit the benefits of both approaches. Nevertheless, we believe that a hardware implementation, e.g., with FPGAs, would allow to better exploit the advantages of the proposed method enabling a fair timing comparison with SNNs, which are usually implemented in hardware.

\cb@dobiblio

main.bbl


Event-based Convolutional Networks for Object Detection in Neuromorphic Cameras
Supplementary material


Marco Cannici marco.cannici@mail.polimi.it Marco Ciccone marco.ciccone@polimi.it Andrea Romanoni andrea.romanoni@polimi.it Matteo Matteucci matteo.matteucci@polimi.it Politecnico di Milano Milano, Italy

 

px Shifted N-MNIST Shifted MNIST-DVS OD-Poker-DVS Blackboard-MNIST

Figure 1: Examples of samples from the proposed dataset.

px v1 v2 v2fr v2fr+ns

Figure 2: Different versions of Shifted N-MNIST.
Figure 3: Examples of the three different scales of MNIST-DVS digits. Two samples at scale scale4, two at scale8 and two at scale16.

In this document we describe our novel event-based datasets adopted in the paper "Event-based Convolutional Network for Object Detection in Neuromorphic Cameras" and the object-specific detection performance in the N-Caltech101 dataset.

1 Event-based object detection datasets

Due to the lack of object detection datasets with event cameras, we extended the publicly available N-MNIST, MNIST-DVS, Poker-DVS and we propose a novel dataset based on MNIST, i.e., Blackboard MNIST. They will be soon released, however, in Figure 1 we reported some example from the four datasets.

1.1 Shifted N-MNIST

The original N-MNIST [? ] extends the well-known MNIST [? ]: it provides an event-based representation of both the full training set ( samples) and the full testing set ( samples) to evaluate object classification algorithms. The dataset has been recorded by means of event camera in front of an LCD screen and moved to detect static MNIST digits displayed on the monitor. For further details we refer the reader to [? ].

Starting from the N-MNIST dataset, we built a more complex set of recordings that we used to train the object detection network to detect multiple objects in the same scene. We created two versions of the dataset, Shifted N-MNIST v1 and Shifted N-MNIST v2, that contains respectively one or two non overlapping N-MNIST digits per sample randomly positioned on a bigger surface. We used different surface dimensions in our tests which vary from double the original size, , up to . The dimension and structure of the resulting dataset is the same of the original N-MNIST collection.

To extend the dataset for object detection evaluation, the bounding boxes ground truth are required. To estimate them we first integrate events into a single frame as described in Section 2 of the original paper. We remove the noise by considering only non-zero pixels having at least other non-zero pixels around them within a circle of radius . All the other pixels are considered noise. Then, with a custom version of the DBSCAN [? ] density-based clustering algorithm we group pixels into a single cluster. A threshold is used to filter out small bounding boxes extracted in correspondence of low events activities. This condition usually happens during the transition from a saccade to the next one as the camera remains still for a small fraction of time and no events are generated. We used , and . The coordinates of these bounding boxes are then shifted based on the final position the digit has in the bigger field of view.

For each N-MNIST sample, another digit was randomly selected in the same portion of the dataset (training, testing or validation) to form a new example. The final dataset contains training samples and testing samples, as for the original N-MNSIT dataset. In Figure 2 we illustrate one example for v1 and the three variants of v2 we adopted (and described) in the paper.

1.2 Shifted MNIST-DVS

The MNIST-DVS dataset [? ] is another collection of event-based recordings that extends MNIST [? ]. It consists of samples recorded by displaying digits on an screen in front of a event camera, but differently from N-MNIST, they move digits on the screen instead of the sensors, and they use the digits at three different scales, i.e., scale4, scale8 and scale16. The resulting dataset is composed of event-based recordings showing each one of the selected MNIST digits on thee different dimensions. Examples of these recordings are shown in Figure 3.

We used MNIST-DVS recordings to build a detection dataset by means of a procedure similar to the one we used to create the Shifted N-MNIST dataset. However in this case we mix together digits of multiple scales. All the MNIST-DVS samples, despite of the actual dimensions of the digits being recorded, are contained within a fixed field of view. Digits are placed centered inside the scene and occupy a limited portion of the frame, especially those belonging to the smallest and middle scales. In order to place multiple examples on the same scene we first cropped the three scales of samples into smaller recordings occupying , and spatial regions respectively. The bounding boxes annotations and the final examples were obtained by means of the same procedure we used to construct the Shifted N-MNIST dataset. These recordings were built by mixing digits of different dimensions in the same sample. Based on the original samples dimensions, we decided to use the following four configurations (which specify the number of samples of each category used to build a single Shifted MNIST-DVS example): (i) three scale4 digits, (ii) two scale8 digits, (iii) two scale4 digits mixed with one scale8 digit (iv) one scale16 digit placed in random locations of the field of view. The overall dataset is composed of samples containing these four possible configurations.

1.3 OD-Poker-DVS

The original Poker-DVS [? ] have been proposed to test object recognition algorithms; it is a small collection of neuromorphic recordings obtained by quickly browsing custom made poker card decks in front of a DVS camera for 2-4 seconds. The dataset is composed of samples containing centered pips of the four possible categories (spades, hearts, diamonds or clubs) extracted from three decks recordings. Single pips were extracted by means of an event-based tracking algorithms which was used to follow symbols inside the scene and to extract pixels examples.

With OD-Poker-DVS we extend its scope to test also object detection. To do so we used the event-based tracking algorithm provided with the original dataset to follow the movement of the samples in the uncut recordings and extract their bounding boxes.

Even if composed of a limited amount of samples, this dataset represents an interesting real-world application that highlights the potential of event-based vision sensors. The nature of the data acquisition is indeed particularly well suited to neuromorphic cameras due to their very high temporal resolution. Symbols are clearly visible inside the recordings even if they move at very high speed. Each pip, indeed, takes from to ms to cross the screen but it can be easily recognized within the first - ms.

1.4 MNIST Blackboard

The two dataset based on MNIST presented in Section 1.1 and 1.2 have the drawback of recording digits at predefined sizes. Therefore, in Blackboard MNIST we propose a more challenging scenario that consists of a number of samples showing digits (from the original MNIST dataset) written on a blackboard in random positions and with different scales.

(a)
(b)
Figure 4: (a) The image shows in black the intensity, expresses as , of a single pixel . This curve is sampled at a constant rate when frames are generated by Blender, shown in figure as vertical blue lines. The sampled values thus obtained (blue circles) are used to approximate the pixel intensity by means of a simple piecewise linear time interpolation (red line). Whenever this curve crosses one of the threshold values (horizontal dashed lines) a new event is generated with the corresponding predicted timestamp. (Figure from [? ]) (b) A preprocessed MNIST digit on top of the blackboard’s background.

We used the DAVIS simulator released by [? ] to build our own set of synthetic recordings. Given a three-dimensional virtual scene and the trajectory of a moving camera within it, the simulator is able to generate a stream of events describing the visual information captured by the virtual camera. The system uses Blender [? ], an open-source 3D modeling tool, to generate thousands of rendered frames along a predefined camera trajectory which are then used to reconstruct the corresponding neuromorphic recording. The intensity value of each single pixel inside the sequence of rendered frames, captured at a constant frame-rate, is tracked. As Figure 4(a) shows, an event is generated whenever the log-intensity of a pixel crosses an intensity threshold, as in a real event-based camera. A piecewise linear time interpolation mechanism is used to determine brightness changes in the time between frames in order to simulate the microseconds timestamp resolution of a real sensor. We extended the simulator to output bounding boxes annotations associated to every visible digit.

We used Blender APIs to place MNIST digits in random locations of a blackboard and to record their position with respect to the camera point of view. Original MNIST images depict black handwritten digits on a white background. To mimic the chalk on the blackboard, we removed the background, we turned digits in white and we roughen their contours to make them look like if their were written with a chalk. An example is shown in Figure 4(b).

(a)
(b)
Figure 5: (a) The 3D scene used to generate the Blackboard MNIST dataset. The camera moves in front of the blackboard along a straight trajectory while following a focus object that moves on the blackboard’s surface, synchronized with the camera. The camera and its trajectory are depicted in green, the focus object is represented as a red cross and, finally, its trajectory is depicted as a yellow line. (b) The three types of focus trajectories.

The scene contains the image of a blackboard on a vertical plane and a virtual camera with resolution that moves horizontally on a predefined trajectory parallel to the blackboard plane (see Figure 5(a)). The camera points a hidden object that moves on the blackboard surface, synchronized with the camera, following a given trajectory. To introduce variability in the camera movement, and to allow all the digits outline to be seen (and possibly detected), we used different trajectories that vary from a straight path to a smooth or triangular undulating path that makes the camera tilt along the transverse axis while moving (Figure 5(b)).

Before starting the simulation, we randomly select a number of preprocessed MNIST digits and place them in a random portion of the blackboard. The camera moves so that all the digits will be framed during the camera movement. The simulation is then finally started on this modified scene to generate neuromorphic recordings. Every time a frame is rendered during the simulation, the bounding boxes of all the visible digits inside the frame are also extracted. This operation is performed by computing the camera space coordinates (or normalized device coordinates) of the top-left and bottom-right vertex of all the images inside the field of view. Since images are slightly larger than the actual digits they contain, we manually cropped every bounding box to better enclose each digit and also to compensate the small offset in the pixels position introduced by the camera motion and by the linear interpolation mechanism. In addition, bounding boxes corresponding to objects which are only partially visible are also filtered out. In order to build the final detection dataset, this generation process is executed multiple times, each time with different digits.

We built three sub-collections of recordings with increasing level of complexity which we merged together to obtain our final dataset: Blackboard MNIST EASY, Blackboard MNIST MEDIUM, Blackboard MNIST HARD. In Blackboard MNIST EASY, we used digits of only one dimension (roughly corresponding to the middle scale of MNIST-DVS samples) and a single type of camera trajectory which moves the camera from right to left with the focus object moving in a straight line. In addition, only three objects were placed on the blackboard using only a fixed portion of its surface. We collected a total of samples ( training, testing, validation).

Blackboard MNIST MEDIUM features more variability in the number and dimensions of the digits and in the types of camera movements. Moreover, the portion of the blackboard on which digits were added varies and may cover any region of the blackboard, even those near its edges. The camera motions were also extended to the set of all possible trajectories that combine either left-to-right or right-to-left movements with variable paths of the focus object. We used three types of trajectories for this object: a straight line, a triangular path or a smooth curved trajectory, all parallel to the camera trajectory and placed around the position of the digits on the blackboard. One of these path was selected randomly for every generated sample. Triangular and curved trajectories were introduced as we noticed that sudden movements of the camera produce burst of events that we wanted our detection system to be able to handle. The number and dimensions of the digits were chosen following three possible configurations, similarly to the Shift MNIST-DVS dataset: either six small digits (with sizes comparable to scale4 MNIST-DVS digits), three intermediate-size digits (comparable to the MNIST-DVS scale8) or two big digits (comparable to the biggest scale of the MNIST-DVS dataset, scale16). A set of recordings was generated using the same splits of the first variant and with equal amount of samples in each one of the three configurations.

Finally, Blackboard MNIST HARD contains digits recorded by using the second and third configuration of objects we described previously. However, in this case each image was resized to a variable size spanning from the original configuration size down to the previous scale. A total of new samples ( training, testing, validation) were generated, of them containing three digits each and the remaining consisting of two digits with variable size.

The three collections can be used individually or jointly; the whole Blackboard MNIST dataset contains samples in total ( training, testing, validation). Examples of different objects configurations are shown in Figure 6. Samples were saved by means of the AEDAT v file format for event-based recordings.

Figure 6: Examples of the three types of objects configurations used to generate the second collection of the Blackboard MNIST dataset.

2 Results

In the video attachment we illustrate the detection results of YOLE and eFCN.

Table 3 and Table 4 provide a detailed analysis of the model performance on some N-Caltech101 categories.

\cb@dobiblio

main.bbl

airplanes

Motorbikes

Faces_easy

inline_skate

minaret

dollar_bill

grand_piano

watch

laptop

menorah

yin_yang

windsor_chair

soccer_ball

stapler

trilobite

stop_sign

accordion

cellphone

metronome

umbrella

AP 93.17 92.31 87.15 78.03 76.95 76.58 76.09 71.49 68.42 67.39 66.23 64.12 62.69 59.08 58.95 57.95 57.18 56.70 56.32 54.90
samples 480 480 261 19 46 32 61 145 49 53 36 34 40 27 52 40 33 37 20 45
Table 3: Top-20 average precisions on N-Caltech101.

kangaroo

beaver

lobster

crocodile_head

ant

flamingo

gerenuk

scorpion

elephant

llama

crayfish

ibis

panda

emu

mayfly

bass

crocodile

cannon

binocular

wild_cat

AP 6.15 5.61 5.36 5.03 4.96 4.35 4.32 4.19 3.89 3.64 3.09 3.02 2.84 1.43 1.20 0.82 0.22 0.08 0.00 0.00
samples 52 28 25 31 66 68 22 52 40 48 42 48 24 33 24 34 61 27 21 22
Table 4: Worst-20 average precisions on N-Caltech101.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
198752
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description