Unsupervised Learning of Dense Optical Flow and Depth from Sparse Event Data

Unsupervised Learning of Dense Optical Flow and Depth from Sparse Event Data

Chengxi Ye, Anton Mitrokhin, Chethan Parameshwara,
Cornelia Fermüller, James A. Yorke, Yiannis Aloimonos
The authors contribute equally to this work.University of Maryland Institute for Advanced Computer Studies, College Park, MD 20742, USA. E-mails: cxy@umd.edu, amitrokh@umd.edu, cmparam9@terpmail.umd.edu, fer@umiacs.umd.edu, yiannis@cs.umd.eduInstitute for Physical Science and Technology, University of Maryland, College Park, MD 20742, USA. E-mail:yorke@umd.edu
Abstract

In this work we present unsupervised learning of depth and motion from sparse event data generated by a Dynamic Vision Sensor (DVS). To tackle this low level vision task, we use a novel encoder-decoder neural network architecture that aggregates multi-level features and addresses the problem at multiple resolutions. A feature decorrelation technique is introduced to improve the training of the network. A non-local sparse smoothness constraint is used to alleviate the challenge of data sparsity. Our work is the first that generates dense depth and optical flow information from sparse event data. Our results show significant improvements upon previous works that used deep learning for flow estimation from both images and events.

I Introduction

Visual motion is evolutionary the oldest and most important cue for encoding information about the 3D motion and spatial geometry of a scene. Even the most primitive animals, such as insects and reptiles, use visual motion to interpret the space-time geometry surrounding them. Yet, even the most advanced Computer Vision algorithms are no match for the capabilities of biological systems.

Recently, there has been much progress in imaging sensor technology, offering alternative solutions to scene perception. The dynamic vision sensor (DVS) [19] and other event-based sensors, inspired by the transient pathway of mammalian vision, offer exciting alternatives for visual motion perception. The DVS does not record image frames, but instead - the changes of lighting occurring independently at every DVS pixel. Each of these changes is transmitted asynchronously and is called an event. By its design, this sensor accommodates a large dynamic range and provides high temporal resolution and low latency – ideal properties for real-time applications. The price for these properties is indeed heavy - they produce a lot of noise. Furthermore the data is very sparse, which can be a great advantage, but requires different treatment. The reality created by this novel type of visual sensors thus requires completely different visual processing approaches.

Most works, both in frame-based and event-based vision, fall within the reconstruction framework. A typical reconstruction pipeline computes feature-based correspondences or the optical flow field, which are universal representations for motion analysis.

With recent advancements in deep learning, the traditional feature-based scene reconstruction framework has been replaced by neural networks. Neural network based learning approaches [34, 33] have shown promising results on frame-based data in solving video reconstruction problems. However, the design of neural network architectures for event-based data is still a challenging problem because of the sparse nature of events and ambiguity in event representation for training a neural network. In this work, we propose a novel, Evenly-Cascaded neural Network (ECN) architecture and a frame-based representation to solve design challenges.

Fig. 1: Optical flow and depth inference on sparse event data at night. The left image is the event camera output. The middle column is ground truth and the last column is the network output (top row - flow, bottom row - depth). The event data is overlaid on the ground truth and inference images in blue. Best viewed in color.

Traditionally, low level tasks such as image segmentation, depth estimation and optical flow were solved by utilizing low level features at multiple resolutions. The recently introduced deep neural networks have an encoder-decoder architecture and they address these low level tasks with high-level features universally used in deep networks [28, 34, 9]. They do not use low level features at multiple resolutions. In this work we introduce a novel encoding-decoding neural network architecture that utilizes both low-level and high-level features and addresses the final task from coarse-to-fine using multiple resolutions. We also utilize a sparse smoothness constraint, which is tailored for sparse data.

Our pipeline achieves good results during low-light scenes. Fig. 1 shows one example featuring night driving - the network was able to predict both depth and flow even with a low event rate and abundance of noise. One of the contributing factors for that is our event-image representation: instead of using the latest event timestamps, we use the average timestamp of the events generated at a given pixel. The averaging helps to reduce the noise without losing the timestamp information. The main contributions of our work can be summarized as:

  • The first learning-based approach to the full structure from motion using DVS input.

  • A new network architecture, called .

  • A data representation (average time image), that improves robustness in difficult lighting conditions.

  • A pre-processesed MVSEC [36] dataset to allow other researchers work further on SfM with event data.

Ii Related Work

Ii-a Event-based Depth Estimation

The majority of event-based depth estimation methods [27, 18, 37, 35] use two or more event cameras. As our proposed approach uses only one event camera, we focus our discussion on monocular depth estimation methods. The first works on event-based monocular depth estimation were presented in [15] and [17]. Rebecq et al. [15] used a space-sweep voting mechanism and maximization strategy to estimate semi-dense depth maps where the trajectory is known. Kim et al. [17] used probabilistic filters to jointly estimate the motion of the event camera, a 3D map of the scene, and the intensity image. More recently, Gallego et al. [11] proposed a unified framework for joint estimation of depth, motion and optical flow. So far there has been no deep learning framework to predict depths from a monocular event camera.

Ii-B Event-based Optical Flow

Previous approaches to image motion estimation used local information in event-space. The different methods adapt in smart ways one of the three principles known from frame-based vision, namely correlation [7, 21], gradient [4] and local frequency estimation [29, 2]. The most popular approaches are gradient based and fit local planes to events [3, 25]. As discussed in [1], local event information is inherently ambiguous. To resolve the ambiguity Barranco et al. [1] proposed to collect events over a longer time intervals and compute the motion from the trace of events that contours create when moving over multiple pixels.

Recently, neural network approaches have shown promising results in various estimation problems without explicit feature engineering. Orchard and Etienne-Cummings [26] used a spiking neural network to estimate flow. Most recently, Zhu et al. [38] released the MVSEC dataset [36] and proposed self-supervised learning algorithm to estimate dense flow. Unlike [38], which uses grayscale information as a supervision signal, our proposed framework uses only events.

Ii-C Self-supervised Structure from Motion

The unsupervised learning framework for 3D scene understanding has recently gained popularity in frame-based vision research. Zhou et. al [34] pioneered this line of work. The followed the traditional geometric modeling and built two neural networks, one for learning pose from single image frames, and one for pose from consecutive frames, which were self-supervised by aligning the frames via the flow. Follow-up works [22, 33] have used similar formulations with better loss functions and networks.

Iii Methods

Iii-a Ego-motion Model

We assume that the camera is moving with a rigid motion with translational velocity and rotational velocity , and the camera intrinsic matrix is provided. We start with the calibrated coordinates by applying beforehand. Here we give a brief overview of several equations that are used in this paper [5]. Let be the world coordinates of a point, which are related to the pixel coordinates as . The velocity of the pixel ( is obtained as: .

On the other hand, under the assumption of rigid motion, we have: , where represents the cross product. The expanded version of the equation is:. After basic substitutions we have: Using the camera projection equations, the previous equations can be simplified into the algebraic form:

(1)

To put it in simpler words, for each pixel, given the inverse depth and the ego-motion velocities , we can calculate the optical flow or pixel velocity using a simple matrix multiplication (Eq. 1). Here is the pose , is a matrix.

Iii-B The Pipeline

Fig. 2: A depth network (middle) with an encoder-decoder architecture is used to estimate scene depth. A pose network (right) takes consecutive frames to estimate the translational velocity and rotational velocity with respect to the middle frame. Given the poses of neighboring frames and the depth of the middle frame, we calculate the pixel velocity or optical flow. The neighboring frames are inversely warped to the middle frame and we can calculate the warping loss.

In this work we use a network with a novel encoding-decoding architecture to estimate the scaled inverse depth from a slice of event signals, rather than from normal RGB images [34]. We use another separate network to take consecutive slices of signals and predict the translational velocity and rotational velocity . Under the rigid ego-motion assumption, the velocity of each pixel can be predicted from and using simple matrix multiplication (Eq. 1). The pixel velocity, also known as optical flow is used to inversely warp the neighboring slices to the middle slice (Fig. 2). We use the photometric warping loss as the supervision signal.

Iii-C Evenly Cascading Network Architecture

We use an encoder network to estimate motion pose from consecutive frames and a U-Net like encoder-decoder network  [28] to estimate the scaled depth. Here, we introduces important differences to circumvent drawbacks in the standard designs.

Standard downsampling and upscaling techniques for neural networks such as pooling and transposed convolutions are limited by integer scaling factors. The networks need to be carefully handcrafted according to the problem size. Upscaling with transposed convolutions are also known to introduce unwanted ‘checkerboard artifacts’.

In our network, we use bilinear interpolation to resize the features, as in classic vision problems. In the encoding layers, our network evenly downscales the previous feature maps by a scaling factor to get coarser and coarser features until the feature sizes fall below a predefined threshold. In the decoding layers, the feature maps are reversely upscaled back by scaling factor . Since bilinear interpolation is locally differentiable, the gradients can be easily calculated for back propagation training. The network construction is automatic and is controlled by the scaling factor.

Fig. 3: The evenly-cascaded encoder structure. Feature maps gradually decrease in size. Each encoding layer has two streams of feature maps. The downsampled features in the previous layer generate modulation signals via convolution. The modulation signal is added to the downsampled features themselves to improve them. The high-level features are concatenated with existing feature channels (red, the modulation times for the feature is braced in the superscript). At the end of the encoding stage (the layer in the middle), the network aggregates multiple levels of features. We make pose predictions using all levels of features.
Fig. 4: The encoder-decoder structure. In the encoding stage, feature maps gradually decrease in size, while in the decoding stage the feature maps expand in size. Each encoding layer has two streams of feature maps. The downsampled features in the previous layer generate modulation signals via convolution. The modulation signal is added to the downsampled features themselves to improve them. The high-level features are concatenated with existing feature channels (red, the modulation times for the feature is braced in the superscript). At the end of the encoding stage (the layer in the middle), the network aggregates multiple levels of features. In the decoding stage, the highest-level features are merged back into the lower-level features as modulation. The corresponding encoding layer is also used as modulation (blue dash lines). We make predictions of depth at multiple scales using different features. A coarse, backbone prediction is made using multiple levels of features. The backbone prediction is then upscaled and refined using modulated lower-level features.

Our transform of features is inspired by the celebrated cascade algorithm in wavelet analysis. The encoding stage [31] is analogous to the wavelet packet decomposition [6], which decomposes signals into two streams of low/high frequency coefficients. Each layer of our encoding stage contains two streams of features (Fig. 4). One stream adapts the low-level features from previous layers via residual learning [14]. The other stream generates a set of higher level features from these features. At the end of the encoding stage, the network possesses multiple levels of coarsened feature representations. Our pose prediction is made at this stage with these multi-level features. Our decoding stage is similar to the ‘merging’ operation in wavelet reconstruction. In each decoding layer, the highest level features, together with the corresponding features in the encoding layers are convolved and added back to the lower level features as modulation. At the end of the decoding stage, the network acquires a set of modulated high resolution low-level features for the final task. It is important to point out that all modulation signals are added to the lower-level features as is common in residual learning.

Our evenly-cascaded (EC) structure facilitates training by providing easily-trainable shortcuts in the architecture. The mapping from network input to output is decomposed into a series of progressively deeper, and therefore harder to train functions: . The leftmost pathway in Fig. 4 contains the easiest to train, lowest-level features, and is maintained throughout the whole network. Therefore this construction alleviates the vanishing gradient problem in neural network training, and allows the network to selectively enhance and utilize multiple levels of features.

Fig. 5: Qualitative results from our evaluation. The table entries from left to right: DVS input, ground truth optical flow, network output for flow, ground truth for depth, network output for depth. The event counts are overlaid in blue for better visualization. Examples were collected from sequences of the MVSEC [36] dataset: ( top to bottom) outdoor day 1, outdoor day 1, indoor flying 1, indoor flying 2, outdoor night 1, outdoor night 2, outdoor night 3. It can be seen that on the ‘night’ sequences the ground truth is occasionally missing due to Lidar limitations but the pipeline performs reasonably well. Best viewed in color.

Iii-D Depth Predictions

In the decoding stage, we make predictions from features at different resolutions and levels (Fig.  4). Initially, both high and low-level coarse features are used to predict a backbone depth map. The depth map is then upsampled with bilinear interpolation for refinement. In the middle stage, high level features as well as features in the encoding layers are merged with the low level features to serve as modulation streams. Lower level features, enhanced by the modulation streams are used to estimate the prediction residue, which are usually also low-level structures. The residue is add to the backbone estimation to refine it. The final prediction map is therefore obtained through successive upsamplings and refinements.

Iii-E Feature Decorrelation

Gradient descent training of neural networks can be challenging if the features are not properly conditioned. For neural networks, the feature channels are usually correlated due to the interplay of channels. The signal amplitude can also be different due to layers of transforms. Researchers have proposed normalization strategies [16, 30] to partially account for the scale inconsistency problem. We proceed one step further with a decorrelation algorithm to combat the feature collinearity problem. Straightforward decorrelation can be achieved by applying the inverse square root of the covariance matrix to the mean subtracted features. However, for neural network training, calculating inverse square roots of matrices at each iteration not only in computational expensive but introduces instability. Here we propose to apply Denman-Beavers iterations [8] to decorrelate the feature channels in a simple and forward fashion. Given symmetric positive definite covariance matrix , Denman-Beavers iterations start with initial values , . Then we iteratively compute: . We then have  [20]. In our implementation, we evenly divide the features into 16 groups as proposed in group normalization [30], and reduce the correlation between the groups by performing a few (1-10) Denman-Beavers iterations. We notice that a few iterations lead to significantly faster convergence and better results.

Iii-F Non-local Smoothness Penalty

To combat the sparsity in data, we utilize a sparsity constraint that promotes non-local information propagation: . Here the loss is applied on the first-order derivatives of the depth estimation, and we use a sparse penalty where . The complexity of the loss is quadratic in the neighborhood size. Acceleration techniques have been applied to reduce it to  [32].

outdoor driving day outdoor driving night indoor flying 1 indoor flying 2 indoor flying 3
AEE % Outlier AEE % Outlier AEE % Outlier AEE % Outlier AEE % Outlier

0.40 0.24 0.48 0.81 0.20 0.01 0.25 0.01 0.22 0.01
0.36 0.20 0.52 1.1 0.20 0.01 0.24 0.01 0.21 0.01
0.31 0.15 0.46 0.67 0.20 0.01 0.24 0.01 0.21 0.01
[38] 0.49 0.20 - - 1.03 2.20 1.72 15.10 1.53 11.90
0.53 0.89 0.59 1.12 0.39 0.13 0.50 0.39 0.45 0.29
TABLE I: Evaluation of the optical flow pipeline

Iii-G Data Representation

Fig. 6: A three-channel DVS data representation. The first channel represents the time image described in [24]. The second and third channels represent the per-pixel positive and negative event counts. Best viewed in color.

Event data consists of 3 dimensions: the pixel coordinate and the event timestamp. In addition to that, the DAVIS camera provides event polarity - a binary value which disambiguates events which were generated on rising light intensity (positive polarity) and events generated on falling light intensity (negative polarity).

The 3D event data was projected onto a plane and converted to a 3-channel image. An example of such image can be seen on Fig. 6. The two channels of the image are the per-pixel counts of positive and negative events. The third channel is the time image used in [24] - each pixel consists of the average timestamp of the events generated on this pixel. We argue that the averaging of the timestamps allows better toleration of noise, allowing our pipeline to work in low-light conditions, and handle cases of fast motion, where more recent events overwrite the previous ones.

Iv Experimental Evaluation

The main contribution of our self-supervised learning framework lies in its ability to infer both dense optical flow and depth given only the sparse event data. We evaluate our work on the MVSEC [36] event camera dataset which, given a ground truth frequency of 20 Hz, contains over 40000 training images.

The MVSEC dataset is inspired by KITTI [13, 23], and it features 5 sequences of a car on the street (2 during the day and 3 during the night), as well as 4 indoor sequences shot from a flying qudrotor. MVSEC was shot in a variety of lighting conditions and features low-light and high dynamic range frames which are often challenging for an analysis with classical cameras. We find that the event-based camera allows our pipeline to perform equally well during both day and night.

For all experiments and tables, the network was trained on all frames except the outdoor day 1 sequence. During the training we did not see significant signs of overfitting, but all table entries labeled outdoor day correspond to an unseen outdoor day 1 sequence. All other entries aggregate corresponding sequences.

Another important note is that the outdoor night sequences have occasional errors in the ground truth (see for example Fig. 5 last three rows, or Fig. 8). All incorrect frames were manually removed for the evaluation.

Iv-a Qualitative Results

In addition to the quantitative evaluation, we present a number of samples for qualitative analysis in Fig. 5. The last three rows of the table show the night sequences, and how the pipeline performs well even when only a few events are available. The third and the fourth rows are captured indoor. The indoor sequences were relatively few and it is possible that the quality of the output would increase given a larger dataset.

Iv-B Quantitative Evaluation

Iv-B1 Optical Flow

We evaluate our optical flow results in terms of Average Endpoint Error ( with and the estimated and ground truth value, and the number of events) and compare our results against the state-of-the-art optical flow method for event-based cameras: EV-FlowNet [38].

Because our network produces flow and depth values for every image pixel, our evaluation is not constrained by pixels which didn’t trigger a DVS event. Still, for consistency reasons, we report both numbers for each of our experiments (for example, and , where the latter has errors computed only on the pixels with at least one event). In all the cases, the portion of the frame with no ground truth values was masked off. Similar to KITTI and EV-FlowNet, we report the percentage of outliers - values with error more than 3 pixels or 5% of the flow vector magnitude.

To compare against EV-FlowNet, we account for the difference in the frame rates (EV-FlowNet uses the frame rate of the DAVIS classical frames) by downscaling our optical flow. Our results are presented in the Table I.

Our results show that the predicted flow is always very close to the ground truth. The results are typically better for the experiments with event masks except for the outdoor night. One possible explanation for that is that this sequence is much noisier with events being generated not only on the edges, which leads to suboptimal masking.

We also provide the baseline results by training and evaluating the state-of-the-art SfMlearner [34] on our data, labeled in Table I.

Iv-B2 Performance Versus Event Rate

Fig. 7: The Average Endpoint Error (blue) and the number of pixels with at least one event (red) for the first 1500 frames of ‘outdoor_day1’ sequence of the MVSEC [36] dataset. Both plots are normalized so that the mean value is 0.5 for easier comparison.

One of our experiments was to investigate how well the neural network performs on the sparse event data in comparison to the amount of events on the scene.

Fig. 7 shows that the event rate (the number of pixels with at least one event) is inversely proportional to the error rate. That is, the more events are available to the network, the better is the quality of the predicted flow. This is an important observation since the event rate is known prior to the inference, and this property could be used for the late sensor fusion, when several systems provide optical flow.

Motivated by that, we provide an additional row to the Table I: and report our error metrics once again only for the frames with higher than average number of event pixels across the dataset.

Iv-B3 Depth Evaluation

Since there are currently no event-based methods for the depth estimation based on unsupervised learning, we provide the classical scale-invariant depth metrics, used in many works such as [10], [34], [12]: Accuracy: of , SILog: , , Absolute Relative Difference: , Logarithmic RMSE:.

Our results are presented in Table II for both event count-masked depth values and full, dense depth.

Applying an event mask during the evaluation increases accuracy for all scenes - this is expected, as the inference is indeed more accurate on the pixels with event data. On the contrary, the error rate increases on the outdoor scenes and decreases on the indoor scenes. This is probably due to higher variation of the outdoor scenes and also the faster motion of the cars.


Error metric Accuracy metric
mask Abs Rel RMSE log SILog

outdoor driving day
- 0.29 0.34 0.12 0.80 0.91 0.96
outdoor driving night - 0.34 0.38 0.15 0.67 0.85 0.93
indoor flying - 0.28 0.29 0.11 0.75 0.91 0.96


outdoor driving day
0.33 0.36 0.14 0.97 0.98 0.99
outdoor driving night 0.39 0.42 0.18 0.95 0.98 0.99
indoor flying 0.22 0.25 0.11 0.98 0.99 1.0

TABLE II: Evaluation of the depth estimation pipeline

Iv-B4 Failure Cases

One cause of failure would be the lack of motion or relative motion, which happens often on the road. The resulting lack of events leads to the failure displayed in Fig. 8, when the static car is completely undetected. This problem can be solved by masking off the regions which have little events.

Fig. 8: A common failure case: A non-moving car (visible in the middle ground truth inverse depth image) is not visible on the DAVIS camera (left image) which prevents the network to infer optical flow or depth correctly (right image is the inference inverse depth image). On the contrary, the moving car on the left side of the road is clearly visible in the event space and its depth inference is correct, but due to the Lidar limitations the depth ground truth is completely missing. This frame is taken from the ‘outdoor_night 1’ MVSEC sequence.

Another failure case is related to the inner workings of the pipeline itself. The internal smoothing results in blurry object boundaries, an example of which is shown in Fig. 9.

Fig. 9: Another common failure case: the smoothing constraint causes the flow at the edges of an object to be incorrect. Left image - ground truth flow, middle image - predicted flow, right image - the per-pixel endpoint error (darker is better). The blue lines on the flow images are overlaid event counts. This frame is taken from the ‘indoor_flight 2’ MVSEC sequence.

On the contrary, the pipeline seems to behave very stable in the presence of high-speed motion. On Fig. 10 the fast motion results in event counts and timestamps being overwritten by the more recent events, but the result is still close to the ground truth.

Fig. 10: A case of fast motion: even though the recent events overwrite older pixels, the network is still capable of predicting accurate flow. Ground truth is in the middle, the inference image is on the right. This frame is taken from the ‘indoor_flight 2’ MVSEC sequence.

V Conclusion

We have presented a novel pipeline for generating dense optical flow and depth from sparse event camera data. We also have shown experimentally that our new neural network architecture using multi-level features improves upon existing work. Future work will investigate the estimation of moving objects as part of the pipeline, using event cloud representations instead of accumulated events, and the use of space-time frequency representations in the learning.

References

  • [1] Francisco Barranco, Cornelia Fermüller, and Yiannis Aloimonos. Contour motion estimation for asynchronous event-driven cameras. Proceedings of the IEEE, 102(10):1537–1556, 2014.
  • [2] Francisco Barranco, Cornelia Fermuller, and Yiannis Aloimonos. Bio-inspired motion estimation with event-driven sensors. In International Work-Conference on Artificial Neural Networks, pages 309–321. Springer, 2015.
  • [3] R. Benosman, C. Clercq, X. Lagorce, Sio-Hoi Ieng, and C. Bartolozzi. Event-based visual flow. Neural Networks and Learning Systems, IEEE Transactions on, 25(2):407–417, 2014.
  • [4] Ryad Benosman, Sio-Hoi Ieng, Charles Clercq, Chiara Bartolozzi, and Mandyam Srinivasan. Asynchronous frameless event-based optical flow. Neural Netw., 27:32 – 37, March 2012.
  • [5] François Chaumette and S. Hutchinson. Visual servo control, Part I: Basic approaches. IEEE Robotics and Automation Magazine, 13(4):82–90, 2006.
  • [6] R. R. Coifman and M. V. Wickerhauser. Entropy-based algorithms for best basis selection. IEEE Trans. Inf. Theor., 38(2):713–718, September 2006.
  • [7] T. Delbruck. Frame-free dynamic digital vision. In Proceedings of Intl. Symposium on Secure-Life Electronics, Advanced Electronics for Quality Life and Society, Tokyo, Japan,, pages 21–26, March 2008.
  • [8] Eugene D. Denman and Alex N. Beavers, Jr. The matrix sign function and computations in systems. Appl. Math. Comput., 2(1):63–94, January 1976.
  • [9] A. Dosovitskiy, P. Fischer, E. Ilg, P. Häusser, C. Hazırbaş, V. Golkov, P. v.d. Smagt, D. Cremers, and T. Brox. Flownet: Learning optical flow with convolutional networks. In IEEE International Conference on Computer Vision (ICCV), 2015.
  • [10] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. In NIPS, 2014.
  • [11] Guillermo Gallego, Henri Rebecq, and Davide Scaramuzza. A unifying contrast maximization framework for event cameras, with applications to motion, depth, and optical flow estimation. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2018.
  • [12] Ravi Garg, Vijay Kumar B. G, and Ian D. Reid. Unsupervised CNN for single view depth estimation: Geometry to the rescue. CoRR, abs/1603.04992, 2016.
  • [13] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 3354–3361, June 2012.
  • [14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
  • [15] Guillermo Gallego Henri Rebecq and Davide Scaramuzza. Emvs: Event-based multi-view stereo. In Edwin R. Hancock Richard C. Wilson and William A. P. Smith, editors, Proceedings of the British Machine Vision Conference (BMVC), pages 63.1–63.11. BMVA Press, September 2016.
  • [16] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167, 2015.
  • [17] Hanme Kim, Stefan Leutenegger, and Andrew J. Davison. Real-time 3d reconstruction and 6-dof tracking with an event camera. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision – ECCV 2016, pages 349–364, Cham, 2016. Springer International Publishing.
  • [18] Jurgen Kogler, Martin Humenberger, and Christoph Sulzbachner. Event-based stereo matching approaches for frameless address event stereo data. In George Bebis, Richard Boyle, Bahram Parvin, Darko Koracin, Song Wang, Kim Kyungnam, Bedrich Benes, Kenneth Moreland, Christoph Borst, Stephen DiVerdi, Chiang Yi-Jen, and Jiang Ming, editors, Advances in Visual Computing, pages 674–685, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg.
  • [19] P. Lichtsteiner, C. Posch, and T. Delbruck. A 128 x 128 at 120db 15 micros latency asynchronous temporal contrast vision sensor. Solid-State Circuits, IEEE Journal of, 43(2):566–576, 2008.
  • [20] Tsung-Yu Lin and Subhransu Maji. Improved bilinear pooling with cnns. CoRR, abs/1707.06772, 2017.
  • [21] Min Liu and Tobi Delbruck. Block-matching optical flow for dynamic vision sensors: Algorithm and fpga implementation. In Circuits and Systems (ISCAS), 2017 IEEE International Symposium on, pages 1–4. IEEE, 2017.
  • [22] Reza Mahjourian, Martin Wicke, and Anelia Angelova. Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. CoRR, abs/1802.05522, 2018.
  • [23] M. Menze and A. Geiger. Object scene flow for autonomous vehicles. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3061–3070, June 2015.
  • [24] Anton Mitrokhin, Cornelia Fermuller, Chethan Parameshwara, and Yiannis Aloimonos. Event-based moving object detection and tracking. IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS), 2018.
  • [25] Elias Mueggler, Christian Forster, Nathan Baumli, Guillermo Gallego, and Davide Scaramuzza. Lifetime estimation of events from dynamic vision sensors. In Robotics and Automation (ICRA), 2015 IEEE International Conference on, pages 4874–4881. IEEE, 2015.
  • [26] G. Orchard and R. Etienne-Cummings. Bioinspired visual motion estimation. Proceedings of the IEEE, 102(10):1520–1536, Oct 2014.
  • [27] P. Rogister, R. Benosman, S. Ieng, P. Lichtsteiner, and T. Delbruck. Asynchronous event-based binocular stereo matching. IEEE Transactions on Neural Networks and Learning Systems, 23(2):347–353, Feb 2012.
  • [28] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI (3), volume 9351 of Lecture Notes in Computer Science, pages 234–241. Springer, 2015.
  • [29] Stephan Tschechne, Tobias Brosch, Roman Sailer, Nora von Egloffstein, Luma Issa Abdul-Kreem, and Heiko Neumann. On event-based motion detection and integration. In Proceedings of the 8th International Conference on Bioinspired Information and Communications Technologies, pages 298–305. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), 2014.
  • [30] Yuxin Wu and Kaiming He. Group normalization. CoRR, abs/1803.08494, 2018.
  • [31] Chengxi Ye, Chinmaya Devaraj, Michael Maynord, Cornelia Fermüller, and Yiannis Aloimonos. Evenly cascaded convolutional networks. CoRR, abs/1807.00456, 2018.
  • [32] Chengxi Ye, Dacheng Tao, Mingli Song, David W. Jacobs, and Min Wu. Sparse norm filtering. CoRR, abs/1305.3971, 2013.
  • [33] Zhichao Yin and Jianping Shi. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. CoRR, abs/1803.02276, 2018.
  • [34] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G. Lowe. Unsupervised learning of depth and ego-motion from video. In CVPR, 2017.
  • [35] Yi Zhou, Guillermo Gallego, Henri Rebecq, Laurent Kneip, Hongdong Li, and Davide Scaramuzza. Semi-dense 3d reconstruction with a stereo event camera. European Conference on Computer Vision(ECCV), 2018.
  • [36] A. Z. Zhu, D. Thakur, T. Ã–zaslan, B. Pfrommer, V. Kumar, and K. Daniilidis. The multivehicle stereo event camera dataset: An event camera dataset for 3d perception. IEEE Robotics and Automation Letters, 3(3):2032–2039, July 2018.
  • [37] Alex Zihao Zhu, Yibo Chen, and Kostas Daniilidis. Realtime time synchronized event-based stereo. European Conference on Computer Vision(ECCV), 2018.
  • [38] Alex Zihao Zhu, Liangzhe Yuan, Kenneth Chaney, and Kostas Daniilidis. Ev-flownet: Self-supervised optical flow estimation for event-based cameras. Robotics: Science and Systems, 2018.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
283428
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description