Memory Warps for Learning Long-Term Online Video Representations

Memory Warps for Learning Long-Term
Online Video Representations

Tuan-Hung Vu Wongun Choi Samuel Schulter Manmohan Chandraker INRIA/ENS WILLOW ISEE NEC-Labs UC San Diego

This paper proposes a novel memory-based online video representation that is efficient, accurate and predictive. This is in contrast to prior works that often rely on computationally heavy 3D convolutions, ignore actual motion when aligning features over time, or operate in an off-line mode to utilize future frames. In particular, our memory (i) holds the feature representation, (ii) is spatially warped over time to compensate for observer and scene motions, (iii) can carry long-term information, and (iv) enables predicting feature representations in future frames. By exploring a variant that operates at multiple temporal scales, we efficiently learn across even longer time horizons. We apply our online framework to object detection in videos, obtaining a large times speed-up and losing only mAP on ImageNet-VID dataset, compared to prior works that even use future frames. Finally, we demonstrate the predictive property of our representation in two novel detection setups, where features are propagated over time to (i) significantly enhance a real-time detector by more than mAP in a multi-threaded online setup and to (ii) anticipate objects in future frames.


[0]leftmargin=10pt \newfloatcommandcapbtabboxtable[][\FBwidth]

The work was conducted as part of Tuan-Hung Vu’s internship at NEC Labs America.

1 Introduction

Motion is a crucial intermediary for human visual perception to learn about its environment and relate to it [1, 2]. By encapsulating motion cues, video represents a rich medium for computer vision to understand and analyze the visual world. While the advent of convolutional neural networks (CNNs) has led to rapid improvements in learning spatial features, a persistent challenge remains to learn efficient representations that derive significant benefits from long-term temporal information in videos.

In this paper, we learn online video representations that incorporate multi-scale information on longer time horizons and design practical frameworks that achieve accuracy, efficiency and predictive power. Analogous to the Gestalt principle of common fate [3, 4], we hypothesize that temporal coherence by accounting for motion across frames allows learning powerful representations while achieving greater invariance to blur, lighting, pose and occlusions. While our frameworks are applicable to diverse problems, we demonstrate them with the specific example of object detection in videos.

Figure 1: A schematic comparison between a typical per-frame method (left) and the proposed video representation learning approach (right) for online object detection in videos. We propose a multi-scale memory that efficiently aggregates image evidence over longer time horizons and also accounts for camera and object motion by feature warping, which enables learning better representations that lead to higher accuracy.

In recent years, object detection in videos has attracted significant interest with benchmarks such as ImageNet VID [5] or Youtube-8M [6]. A popular approach has been to use detected bounding boxes computed independently for each frame using a strong CNN-based model [7, 8, 9, 10] and to do temporal reasoning through tracking [11], re-scoring detections [12] and performing sequential non-maximum suppression [13]. While such methods improve over per-frame baselines, we explore the benefits of leveraging temporal information to learn better underlying feature representations. A few recent works temporally aggregate features to improve representation power [12, 14], but use a fixed set of nearby frames and do not maintain causality or efficiency. In contrast, we propose a video representation that composes information across time in an online fashion (see Figure 1), which is not only faster, but also enables predictive applications.

In particular, Section 3.1 proposes a novel network structure, termed MemNet, that holds a memory of the feature representation, which is updated at every frame based on image observations solely from the past and warped from one frame to the next to account for observer and scene motions. We use a displacement field for warping similar to [14], but encoding memory allows retaining information from further in the past, while requiring only a single warp computation per-frame. This is times faster than [14], which requires as many warps as the number of temporally aggregated frames.111To put this speed-up in context for object detection, the improvement from Faster R-CNN [10] to R-FCN [15] is about times. Further, in Section 3.2, we propose a hierarchical network structure, termed ClockNet due to similarity to [16, 17], which extends MemNet by operating at multiple temporal scales. This allows efficiently leveraging information from even longer temporal horizons, which improves the representation power, as demonstrated by our experiments.

We apply our video representation to object detection in Section 4.1 and evaluate it on the ImageNet VID [5] data set in Section 5.1. Our proposed architectures improve over per-frame baselines by up to 2.2% in mean average precision (mAP).222To put this accuracy gain in context for object detection, the gains from hard example mining [18] or hard positive generation [19] are around mAP. We achieve state-of-the-art results close to flow-guided feature aggregation (FGFA) [14] without having access to future frames and with considerably lower runtime, while outperforming a causal variant of FGFA.

A key benefit of the online nature of our video representation is that it imparts predictive abilities, which enables novel applications. First, in Section 4.2, we enhance the accuracy of an online real-time detector, by leveraging a stronger but less efficient detector in another thread. While the strong detector lags due to higher latency, our memory warping enables propagating and aligning its representation with the real-time detector, boosting the accuracy of the latter by more than mAP, with no impact on speed or online operation (see Section 5.3). This is non-trivial, since parallelizing standard detectors in an online setup is not straightforward. Next, our predictive warping of video representations enables anticipating features in future frames, which allows solving visual tasks without actually observing future images. Sections 4.3 and 5.2 demonstrate this for the novel application of anticipating objects in future frames.

Finally, we note that our contributions are architecture-independent. The speed, accuracy and predictive benefits of our online representation are available for any detection method on video inputs.

2 Related Work

Learning representations for videos has been a long-standing goal in computer vision and many directions have been explored. Donahue et al. [20] or Srivastava et al. [21] rely on recurrent neural networks (RNNs) like LSTMs [22] to propagate feature representation from still frames over time. However, unlike our approach, features are propagated without explicit knowledge of motion in the scene, as we elaborate on in Section 3.3. 3D CNNs provide more freedom to learn motion-specific kernels (although motion is also not explicitly used) and were successfully used in [23, 24] for tasks like video captioning or action recognition, but come with considerably more parameters to learn and typically more computational costs. Recent works have also considered unsupervised learning with pretext tasks [25], however, we focus on efficient learning of supervised representations.

Two-stream architectures like [26, 27, 28] combine features extracted from both images and motion (optical flow) to boost the representational power. While 3D CNNs learn motion-specific features implicitly, two-stream architectures explicitly take optical flow as input. In both cases, however, this information is not used to transform features over time to compensate for observer and scene motions.

The recently proposed flow-guided feature aggregation framework (FGFA) [29, 14], on the other hand, explicitly warps convolutional feature maps between frames for better alignment when aggregating them. The warping function is triggered by a learned displacement field initialized with FlowNet [30]. However, FGFA [14] uses a fixed temporal window of nearby frames, requires as many warp computations as the length of the window, and compromises causality, i.e., integrates features from future frames. In contrast, we introduce a feature memory that is warped from one frame to another, saving computation time and allowing for a longer temporal horizon without looking into future frames.

Our temporal multi-scale variant, ClockNet, is inspired by [16, 17]. While sharing the idea of efficiently extending the temporal horizon, we also motivate and experimentally demonstrate the multi-scale aspect of this architecture. LSTMs [22] are also designed with a similar goal of a long-term representation for temporal data. The feature warping and propagation proposed in this work complements [16, 17, 22] with an effective way of dealing with spatial memory maps that describe real video data where observer and the scene itself can be in motion.

We showcase our ideas for object detection in videos, which has recently attracted lots of interest, partly due to the ImageNet VID challenge [5]. Besides [12, 14] that work on the feature level, most recent approaches for detection in video leverage the temporal data on a higher level, e.g., by tracking objects or specifically designed post-processing mechanisms [31, 11], which are orthogonal to our contributions. Besides object detection, the proposed feature can also be used for other tasks operating on videos like semantic segmentation [32, 17] or action recognition [27, 28, 26]. Visual anticipation has also been shown for segmentation [33] and actions [34], but we differ in demonstrating for supervised object detection and in explicitly accounting for motion in the learned representation.

Figure 2: An illustration of the proposed MemNet running for three frames. At every time step, features from the current frame (blue) are aggregated with the memory of the previous frame (purple), either by simple averaging or a learned adaptive weighting. Then, the memory is warped via bilinear sampling based on a learned displacement field. The detection output in every frame is computed from the current memory.

3 Feature propagation via memory networks

The goal of this paper is to improve the feature representation for objects in videos, by leveraging temporal information and motion. Exploiting past frames can also help predictions in the current frame when occlusions or motion blur distorts image evidence, cf. [14]. We propose to continuously aggregate and update features over time to provide a stable and powerful representation of the scene captured by the video, which is illustrated in Figure 2.

3.1 Aggregating features over time

Given a single image , a convolutional neural network (CNN) with parameters first extracts a feature map , where is the number of feature maps and we typically have and . In the following, we show how these single image feature representations are effectively aggregated over time. While we use a single feature map per image for ease of presentation, note that we can easily handle multiple feature maps at different resolutions to handle scale variations, which was shown to be useful in [7, 35].

Tracking features over time:

In every frame , we hold a feature map that acts as a memory on the feature representation of the video. Since the scene is dynamic and the camera is moving, the same objects will appear at different locations of the image plane in frames and . In order for the memory of the past frame to benefit detection in the current frame , needs to be transformed according to the scene dynamics. Similar to [14], we use bilinear sampling to implement this transformation,


where is the bilinear sampling function and is a displacement (or flow) field between frames and , which is estimated by a CNN with parameters . This CNN is a pre-trained FlowNet [30], which takes images and as input and predicts the displacement, but we fine-tune the parameters for the task at hand. Note that for fast computation of the displacement field, we feed FlowNet with half-resolution images and up-scale the displacement field. Also note that in the absence of ground truth data for the displacement field, this CNN predicts displacements suitable for the task at hand, which is demonstrated in Section 5.1.

Updating with image evidence:

After having transformed the memory to the current frame , i.e., , we need to aggregate the newly available image evidence extracted by the feature CNN into the memory,


which defines one step of the proposed MemNet. We implement (and experimentally evaluate) two variants of the aggregation function . The first is a parameter-free combination that leads to exponential decay of memory over time,


and the second is a weighted combination of memory and image features,


with and . The weights are computed by a small CNN with parameters and operating on and , respectively, and the constraint is always satisfied by sending the concatenated output of the CNNs through a per-pixel softmax function. The parameters of the weight-CNNs are automatically learned together with the rest of the network without any additional supervision. In the first frame , we simply assign the memory to be the feature representation of the image .

Training MemNet:

Training the proposed video representation requires a supervisory signal from a task module that is put on top of the memory features . In general, the task module can be anything, even an unsupervised task like predicting future frames [36]. In this work we explore object detection in videos, where the supervisory signal comes from a combination of object localization and classification loss functions, see Section 4.1.

All parts of our representation can be trained end-to-end. Since bilinear sampling and the grid generation of the warping module are both differentiable [37], we can back-propagate gradients over time to previous frames, to the image feature extractor, as well as to the FlowNet generating the displacement fields.

While the network architecture in theory allows gradients to flow over the memory warping module to learn a good feature propagation, it also opens a shortcut for minimizing the loss because image evidence is available at every frame. While for some tasks past information is truly essential for prediction in the present, for several tasks the image of the current frame already provides most evidence for a good prediction (or at least a signal to minimize the loss). To encourage the network to learn a good feature propagation module, we randomly drop image evidence with probability at frame , which we found to improve results by a few percentage points.



Figure 3: Our ClockNet extends the MemNet by adding multiple time axes with increasing time scales to aggregate more information from further back in time. Each additional time axis skips frames. We only illustrate two time scales to avoid clutter.

3.2 Extending the temporal scale

The basic MemNet operates on just a single temporal scale, which has limited capability to leverage information at a larger temporal horizon. While, in theory, information from the whole video sequence is contained in the feature representation of the current frame , this portion can be vanishingly small, particularly for the aggregation function relying on the exponential decay.

We thus propose to use a clock-work structure similar to [16, 17] that operates on multiple temporal scales, which we denote ClockNet and illustrate in Figure 3. Formally, instead of having a single memory feature map, we have memories at frame with , each of them operating at different rates. In our implementation, we update memory every frames with new image evidence, although other schedules are also possible. Note that when , the basic MemNet is obtained.

In order to exchange information across the different time scales , we aggregate all memory maps at a single frame by simply averaging them, i.e., . As with the feature map aggregation in the basic MemNet, different strategies for combining feature maps are possible. We chose the simpler parameter-free averaging, as a more complex learning-based weighting scheme did not show any performance gains. The aggregated memory can then be used as input to any task-specific modules.



Figure 4: Relation between standard convolutional recurrent neural networks (left) and the proposed video representation (right).

3.3 Discussion

Our proposed video representations have a simple and intuitive structure, can be trained end-to-end and fulfill the basic requirements for a fast and causal system that can be applied to videos in any real-world application. In contrast to FGFA [14], the proposed model does not look at future frames and is also not limited to a specific temporal horizon in the past, rather can carry information from the whole (past) sequence in its memory. An even longer temporal horizon is utilized by the ClockNet architecture.

There also exists a relation to convolutional recurrent neural networks (cRNN) [38, 39], however, with one crucial difference. While cRNNs keep their hidden memory fixed across spatial dimensions (), our model enables the memory to be spatially aligned with observer and scene motion in the actual video content (), see Figure 4. While our aggregation function for new input and previous hidden states is simple, we did not observe any improvements for our particular applications with more complex architectures like LSTM [22] or GRU [40].

4 Detection, Propagation and Anticipation in Videos

To demonstrate the benefits of propagating features over time with the proposed MemNet and ClockNet, we show its impact for three practical applications.

4.1 Object detection in videos

While we can use the proposed memory features from Section 3 for any downstream task, in this paper, we focus on object detection in videos. Modern object detectors like Faster-RCNN [10], R-FCN [15], SSD [9] or RetinaNet [8] all have a similar high-level structure in the sense that they all rely on a convolutional neural network to extract features from a single image. The detection-specific modules applied on top of define the differences between the detectors, e.g., proposal-based [10, 15] or proposal-free [9, 8], making an interface between one generic module and detection-specific modules. Our proposed MemNet and ClockNet operate on and compute a novel feature representation , making our video representation applicable to all of these detectors. In this paper, we pick R-FCN [15] because it has shown a good trade-off between accuracy and speed, with a publicly available code base.

Given a representation of a video sequence at frame , the object detector first computes object proposals with a region proposal network (RPN) as proposed in [10]. Object proposals define potential locations of objects of interest (independent of the actual category) and reduce the search space for the final classification stage. Each proposal is then classified into one of categories and the corresponding proposal location is further refined. In contrast to Faster-RCNN [10], the per-proposal computation costs in R-FCN are minimal by using position-sensitive ROI pooling. This special type of ROI pooling is applied on the output of the region classification network (RCN).

Figure 5: In a multi-GPU setup, a fast but weak object detector (R-FCN-18, green blocks) leverages the features of a strong but slow object detector (ClockNet-101, red blocks), see Section 4.2. The width of blocks represent computation time. At a frame , strong features from ClockNet-101 are only available from , but efficiently warped into frame with our propagation module. The warped features boost the representational power of R-FCN-18 significantly, without increasing latency of the real-time system.

4.2 Real-time detection by propagating strong features

Assuming an input stream capturing images at 20 frames-per-second (FPS), we ideally want an object detector that can process one image in less than 50 ms to avoid latency in the output. One easy option to speed-up a modern object detector is to use a more light-weight feature extraction CNN, e.g., by using ResNet-18 instead of ResNet-101. Note that this is a viable option for any detection framework, e.g., Faster-RCNN [10], R-FCN [15], YOLO [41, 42] or SSD [9]. However, accuracy will decrease. Here, we explore another option to speed-up a modern object detector. Instead of using a single model, we demonstrate how to exploit two models with complementary properties running simultaneously (but asynchronously) on two threads (two GPUs) to achieve both speed and accuracy, using our feature propagation.

We run a fast detector, R-FCN-18 (i.e., R-FCN with ResNet-18 [43]) in one thread and a slower but also stronger detector, ClockNet-101, in the other thread. R-FCN-18 runs at the required frame rate and can provide output for every frame, however at a lower quality than ClockNet-101 could do if no time requirements existed. The main problem with the strong object detector is that it will always have some delay (or latency) to produce an output. If is too large for a practical system, the strong detector is not usable. It is important to note that achieving a speed-up with two GPUs is not trivial in a real-time setting. For the offline case it is easy to distribute computation of different images on multiple GPUs. However, this is not an option for streaming data. In Section 5.3, we still empirically compare with two alternative baselines that also leverage two GPUs.

With our design, on the other hand, we can still leverage the strong features by making up for the delay via feature propagation. We compute the displacement field between frame and and warp the strong features into the current frame , where the fast object detector has already computed features , see Figure 5. We boost the representational power of R-FCN-18 by combining the feature maps. Again, we take the average of both features (the dimensionality is the same), but more advanced aggregation schemes are possible. We experimentally evaluate this application in Section 5.3.

4.3 Anticipating features

Another application of the proposed feature propagation is future prediction or anticipation. Features from the current frame are propagated to a future frame , where the task network is applied to make predictions. In the previous application of Section 4.2, we utilize feature propagation over several frames, but the displacement fields are still computed from image evidence, similar to [29]. For a true visual anticipation, however, future images are not available.

We propose to extrapolate the displacement fields into future frames and use them to propagate the feature (or memory) maps. For demonstration, we use a simple extrapolation technique. Given the two previous displacement fields and , we compute the difference of aligned displacement vectors (with bilinear sampling), which gives us the acceleration of pixels. We then employ a simple constant acceleration motion model to each displacement vector and extrapolate for one or multiple frames. Obviously, this extrapolation technique has limitations but it is sufficient for our demonstrations of feature anticipation. We analyze the quality of the anticipated features in Section 5.2 by measuring the object detection quality in future frames.

5 Experiments

Our experimental evaluation focuses on the performance of our feature propagation and aggregation methods for object detection in videos. In Sections 5.1, 5.2 and 5.3, we evaluate the performance of the proposed MemNet and ClockNet on the three applications introduced in Section 4, respectively.


All our experiments are conducted on the ImageNet VID data set [5], which is most suitable for object detection in videos and has been the testbed for most recent approaches for this task [12, 11, 29, 14]. ImageNet VID is a large scale data set consisting of video clips, captured at frames rates between and FPS and divided into training, validation and testing sets with , and clips respectively. Each clip is fully annotated with bounding box tracks of different object classes.

Implementation details:

We use the ResNet-101 architecture [43] as the basic feature extractor in all experiments with the exception of some ablation studies. In particular, we use à trous convolutions as in [44, 14] to increase the feature resolution. Similar to [14] the extracted features are passed through a convolutional layer and a non-linear activation (ReLU [45]) before we provide them as inputs to MemNet and ClockNet. To estimate displacement fields we use FlowNet [30]. All the parameters are jointly fine-tuned end-to-end. In general, we closely follow the experimental setup of [14] using their publicly available MXNet [46] implementation. All models are trained for epochs with an initial learning rate of , which is decreased by a factor of after epochs. We train our models on the same mix of ImageNet DET and ImageNet VID training sets as in [14]. All experiments, including runtime measurement, are done on an NVIDIA TITAN Xp. For training, we use 4 GPUs.

5.1 Object detection in videos

Object detection in videos aims at localizing objects in every video frame, i.e., estimating bounding boxes around objects associated with a confidence score.

Evaluation metrics:

We measure detection performance as mean average precision (mAP) over all object classes, where we additionally differentiate between fast, medium and slowly moving objects using the subsets of videos introduced in [14]. We also measure the average runtime per frame in milliseconds (ms) for each model (using the same framework and GPU setup).


We compare the proposed MemNet and ClockNet models with several baselines. The first one is the per-frame object detector itself (R-FCN [15, 14]) that does not exploit temporal information. The second baseline is FGFA [14], which, for every frame, aggregates features from nearby frames both in the past and the future. Obviously, this makes FGFA a non-causal system not applicable to real-time tasks. Note that these two baselines represent two extremes of using temporal information, with R-FCN not exploiting the video at all and FGFA looking not only into the past but also into future frames. For a more fair comparison, we thus created a causal variant of FGFA (Cau. FGFA) that aggregates information only from nearby features in the past but not from future frames. While FGFA can only operate in the off-line setting where future frames are accessible, causal FGFA is an on-line detector, making it the most comparable baseline to our proposed models.



Figure 6: Accuracy and runtime trade-off of various methods.

[\Xhsize] Method mAP mAP (fast) mAP (med) mAP (slow) ms R-FCN [15] 73.4 51.4 71.6 82.4 108 FGFA [14] 76.2 56.0 75.2 83.8 286 FGFA (half) 66.7 42.2 66.1 77.5 152 Cau. FGFA 75.2 53.9 74.1 83.8 204 Cau. FGFA* 66.0 40.0 65.3 79.3 181 MemNet-3 74.3 51.9 72.5 83.6 122 MemNet-6 75.1 53.8 73.5 83.3 122 MemNet-6-wgts 75.3 51.8 73.6 83.8 124 MemNet-6-strd-4 74.4 51.7 72.4 84.2 122 MemNet-6-strd-8 74.2 51.6 72.6 82.7 122 ClockNet 75.6 55.4 73.7 83.4 169

Figure 7: Detection performance and runtime on ImageNet-VID validation of different methods.

Main results:

Table 7 summarizes our quantitative results and Figures 7(a)-7(b) give qualitative examples. We can first see that all models leveraging temporal data improve over the per-frame baseline R-FCN [15]. Looking into both past and future frames, as FGFA [14] does, gives the best overall results, but comes at a considerable runtime cost and, more importantly, is a non-causal system. Among all causal systems leveraging data only from the past (eight bottom models in Table 7), the proposed ClockNet gives the best results overall and is particularly strong for fast moving objects. Its mAP value is only 0.6 percentage points behind FGFA without having access to future frames.

Looking at the running times, we see that the proposed MemNet is clearly the fastest, except for the still-frame baseline, and ClockNet already ranks second. Both proposed models are faster than causal FGFA and consequently, FGFA [14]. Built upon the detection framework of R-FCN, all those models have the same computational complexity for feature extraction, proposals generation and classification. Therefore, the causes of speed difference are the numbers of flow computations and feature warps . For FGFA and Causal FGFA, and equal the number of frames within aggregation range, i.e. 20 and 10, respectively. The ClockNet reported in Table 7 requires and , corresponding to its temporal scales, and MemNet has a speed advantage because it only needs and to process an incoming frame.

We also note that the computation time of Causal FGFA can be reduced by aggregating displacement fields in an online manner, see Cau. FGFA* in Table 7, thus reducing to 1 as in MemNet. While the runtime is reduced, the error accumulation in online aggregation of displacement fields leads to a significant accuracy drop. To highlight the reduction in runtime from MemNet and ClockNet, we show another trivial way to speed-up FGFA by halving the image resolution, FGFA (half). However, this variant also leads to a large performance drop.

Overall, accuracy and runtime demonstrate the advantages of our memory propagation mechanism for object detection in videos, see Figure 7. Moreover, our memory-based architectures have the additional benefit (over FGFA) of being able to propagate features into future frames, which we analyze in Section 5.2 and demonstrate in a practical application in Section 5.3.

Ablation studies:

First, we investigate the importance of the length of the sequences used for training MemNet. In the testing phase there is no limitation on the sequence length, but GPU memory is a hard constraint during training as gradients need to flow back through the whole sequence. We can observe in Table 7 that a longer temporal window for training MemNet (MemNet-6 vs. MemNet-3) is indeed beneficial. Second, by comparing different aggregation schemes (Equations (3) and (4)), we can observe that learning an additional weighting gives a small improvement with only a tiny increase in computational costs. We also tried LSTM [22] as aggregation scheme, but did not observe any improvements. Finally, we analyze the importance of multiple temporal scales in ClockNet, which operates on three time axes , corresponding to temporal strides of 1, 4 and 8, respectively. To emphasize the benefit of feature aggregation across multiple temporal scales, we compare ClockNet to MemNet trained with longer strides (MemNet-6-strd-4 and MemNet-6-strd-8), given it access to larger temporal horizon during training. We can see from Table 7 that both baselines perform worse than MemNet-6 (with temporal stride of 1), which illustrates the importance of the information provided by actual neighboring frames. Therefore, the success of ClockNet provides a signal for the benefits of a temporal multi-scale architecture, where different temporal scales are complementary to each other. More experiments on various details of our methods can be found in the supplemental material.

Fine-tuning FlowNet:

Finally, we want to understand the effect of fine-tuning FlowNet during training the video representation for the detection task. Similar to [14], we observe a performance drop in mAP if FlowNet is not fine-tuned. Given that displacement fields after fine-tuning are apparently better for the detection task, we visualize the difference for one example in Figure 7(c). One difference that we can observe is that the object tends to move as a whole and ignores the motion of individual parts.

5.2 Propagating and anticipating features

In the previous experiment, the appearance feature of the current frame is always available and the main purpose was to analyze the influence of additional information from the past. In this experiment, we want to give more insights into the quality of the feature propagation and anticipation performance, which has several applications as discussed in Section 4.


In this experiment, we provide image information up to frame and then propagate up to frame . Note that displacement fields are still available but only for warping the memory; no image evidence in form of appearance features is available to the detector after frame . The propagated features are then used to compute the detections. We compare our model with a baseline that takes the detected bounding boxes at frame and propagates them with the computed displacement fields to frame . The mean displacement inside a bounding box serves as the translational vector. Table 2 shows the impact of skipping the image features for different amounts of frames for MemNet, ClockNet and the box-propagating baselines. In general, all models gracefully degrade performance with larger , but at the same time reduce running time. It is evident that feature propagation outperforms propagation on the bounding box level, particularly for ClockNet. From Figure 7(d), we can see that the performance gap between MemNet and ClockNet increases when gets larger, demonstrating the impact of multiple temporal scales and the extend time horizon of ClockNet.


[\FBwidth] Methods mAP FPS 0 4 8 0 4 8 Box - Propagation MemNet 75.1 64.6 55.4 8.2 14.7 16.9 ClockNet 75.6 64.9 56.2 5.9 12.5 15.3 Feature - Propagation MemNet 75.1 68.9 56.1 8.2 14 16 ClockNet 75.6 70.9 62.3 5.9 6.6 8.3 Feature - Anticipation MemNet - 68.8 55.9 - 5.1 8.7 ClockNet - 67.0 57.3 - 2.5 3.4

Table 1: Runtime in fps and accuracy in mAP of MemNet and ClockNet for feature propagation and anticipation.

[\Xhsize] Method #G mAP FPS R-FCN-18 1 60.2 14.3 ClockNet-101 1 75.6 5.9 ClockNet-101-FeatProp 1 70.9 6.6 R-FCN-18 (split) 2 21.7 28.6 ClockNet-101 (split) 2 28.9 11.8 ClockNet-101-FeatProp (split) 2 23.3 13.2 R-FCN-18+ClockNet-101-BoxProp 2 66.9 14.3 R-FCN-18+ClockNet-101-FeatProp 2 71.9 14.3

Table 2: Accuracy and runtime of our two-threaded detection setup. The number of GPUs utilized is denoted as #G.


We next evaluate feature anticipation as discussed in Section 4.3, which differs to the previous experiment since no image information is available at all for frames after . This requires us to extrapolate the displacement field as described in Section 4.3. Comparing the anticipation results with the propagation model in Table 2, we can see that our flow extrapolation strategy works well for MemNet which has short temporal stride, although the runtime speed drops due to our current non-optimized implementation of flow extrapolation. Reversely, on ClockNet memories, feature anticipation performs worse than feature propagation. One explanation is that quality of the extrapolated flows is heavily degraded with the long temporal strides of ClockNet.

5.3 Fast detection by propagating strong features

In this section we analyze the application described in Section 4.2 and Figure 5. A fast but weak object detector is running in the main thread providing detection output at every frame, while leveraging features from a strong but slow feature extractor. In order to use the strong features, they have to be propagated to the current frame, i.e., aligned over time, to compensate for the delay .

We use R-FCN based on the ResNet-18 architecture as the fast main-thread detector, which runs at FPS. The helper thread runs ClockNet based on ResNet-101 (same as in Table 7), which is able to propagate features over time as demonstrated in Section 5.2. In practice, ClockNet-101 runs times slower than R-FCN-18. To compensate for additional overhead (e.g., feature propagation), we update ClockNet-101 with image evidence once every frames (“ClockNet-101-FeatProp”). For training the fast detector, i.e., R-FCN-18, we only leverage fixed features coming from the second thread but do not fine-tune ClockNet-101.

Quantitative results:

Table 2 shows the results of this experiment. As expected, R-FCN-18 is the fastest model but also gives the worst accuracy. ClockNet-101 is the model shown in Section 5.1 and can be considered an upper bound in terms of accuracy but is very slow compared to R-FCN-18. ClockNet-101-FeatProp is the model running in the helper thread which receives image evidence every 4 frames and uses propagation to make predictions in other frames. The performance drop compared to the upper bound is not much, as also seen in Section 5.2, but the runtime is still high. However, when following the design proposed in Section 4.2, i.e., feeding R-FCN-18 with propagated strong features from ClockNet-101-FeatProp, we observe a significant 10% mAP boost compared to R-FCN-18 which is even close to the ClockNet-101 upper bound, while maintaining the low runtime of R-FCN-18.

Comparison to two-thread baselines:

Recall that achieving speed-up with two GPUs in a real-time (data-streaming) setting is non-trivial as parallelizing frames over multiple GPUs is not possible. We still evaluate alternative baselines that leverage two GPUs. First, a trivial two times speed-up can be achieved by splitting the image into two halves, running the detector individually and merging the detections before NMS, denoted “(split)” in Table 2. While this gives the expected speed-up, the performance drop is significant, which can be explained by the fact that objects in Imagenet-VID are mostly centered, thus effectively truncating them. The second baseline is more evolved and similar to our proposed design. However, instead of propagating features, (delayed) detections from the strong model are propagated to align with the faster detector (“BoxProp”), as in Section 5.2. As in the previous experiment, we observe that feature-level propagation is superior to propagating bounding boxes.

Figure 8: (a-b) Qualitative examples of our memory models and the RFCN baseline. (c) The flow fields before and after fine-tuning demonstrate the impact of jointly fine-tuning the FlowNet with the object detector. (d) Mean AP of MemNet and ClockNet with respect to different propagation lengths .

6 Conclusions

Our work proposes a long-term online video representation that effectively leverages past information. We rely on a memory that is updated regularly with image evidence and is warped over time for a proper spatio-temporal alignment of features. We also extend this representation to multiple temporal scales, thus aggregating information from even farther in the past. Our experiments illustrate the benefits of each component, as well as a demonstration of two practical applications: feature anticipation and a real-time multi-threaded object detector.

For future works, we plan to investigate the effectiveness of our learned video representation for action recognition and segmentation in videos. We also plan to further explore the utility of feature anticipation and postulate that causally learned representations like ours have inherent advantages for visual anticipation.


  • [1] Cutting, J.E.: Perception with an Eye for Motion. MIT Press (1986)
  • [2] Gibson, J.: The Ecological Approach to Visual Perception. Houghton Mifflin (1979)
  • [3] Ellis, W.: A Source Book of Gestalt Psychology. Routledge (1938)
  • [4] Wertheimer, M.: Laws of organization in perceptual forms. Psycologische Forschung 3 (1938)
  • [5] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. IJCV 115(3) (2015) 211–252
  • [6] Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., Vijayanarasimhan, S.: YouTube-8M: A Large-Scale Video Classification Benchmark. arXiv:1609.08675 (2016)
  • [7] Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature Pyramid Networks for Object Detection. In: CVPR. (2017)
  • [8] Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal Loss for Dense Object Detection. In: ICCV. (2017)
  • [9] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: SSD: Single Shot MultiBox Detector. In: ECCV. (2016)
  • [10] Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In: NIPS. (2015)
  • [11] Kang, K., Ouyang, W., Li, H., Wang, X.: Object Detection from Video Tubelets with Convolutional Neural Networks. In: CVPR. (2016)
  • [12] Feichtenhofer, C., Pinz, A., Zisserman, A.: Detect to Track and Track to Detect. In: ICCV. (2017)
  • [13] Han, W., Khorrami, P., Paine, T.L., Ramachandran, P., Babaeizadeh, M., Shi, H., Li, J., Yan, S., Huang, T.S.: Seq-NMS for video object detection. CoRR 1602.08465 (2016)
  • [14] Zhu, X., Xiong, Y., Dai, J., Yuan, L., Wei, Y.: Flow-Guided Feature Aggregation for Video Object Detection. ICCV (2017)
  • [15] Dai, J., Li, Y., He, K., Sun, J.: R-FCN: Object detection via region-based fully convolutional networks. In: NIPS. (2016)
  • [16] Koutník, J., Greff, K., Gomez, F., Schmidhuber, J.: A Clockwork RNN. In: ICML. (2014)
  • [17] Shelhamer, E., Rakelly, K., Hoffman, J., Darrell, T.: Clockwork Convnets for Video Semantic Segmentation. In: Video Semantic Segmentation Workshop at ECCV. (2016)
  • [18] Wang, X., Shrivastava, A., Gupta, A.: A-Fast-RCNN: Hard positive generation via adversary for object detection. In: CVPR. (2017) 3039–3048
  • [19] Shrivastava, A., Gupta, A., Girshick, R.: Training region-based object detectors with online hard example mining. In: CVPR. (2016) 761–769
  • [20] Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term Recurrent Convolutional Networks for Visual Recognition and Description. In: CVPR. (2015)
  • [21] Srivastava, N., Mansimov, E., Salakhutdinov, R.: Unsupervised Learning of Video Representations using LSTMs. In: ICML. (2015)
  • [22] Hochreiter, S., Schmidhuber, J.: Long Short-Term Memory. Neural Computation 9 (1997) 1735–1780
  • [23] Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning Spatiotemporal Features with 3D Convolutional Networks. In: ICCV. (2015)
  • [24] Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A.: Describing Videos by Exploiting Temporal Structure. In: ICCV. (2015)
  • [25] Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. In: ICCV. (2015) 2794–2802
  • [26] Simonyan, K., Zisserman, A.: Two-Stream Convolutional Networks for Action Recognition in Videos. In: NIPS. (2014)
  • [27] Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional Two-Stream Network Fusion for Video Action Recognition. In: CVPR. (2016)
  • [28] Feichtenhofer, C., Pinz, A., Wildes, R.P.: Spatiotemporal Multiplier Networks for Video Action Recognition. In: CVPR. (2017)
  • [29] Zhu, X., Xiong, Y., Dai, J., Yuan, L., Wei, Y.: Deep Feature Flow for Video Recognition. In: CVPR. (2017)
  • [30] Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D., Brox, T.: FlowNet: Learning Optical Flow With Convolutional Networks. In: ICCV. (2015)
  • [31] Han, W., Khorrami, P., Paine, T.L., Ramachandran, P., Babaeizadeh, M., Shi, H., Li, J., Yan, S., Huang, T.S.: Seq-NMS for Video Object Detection. CoRR abs/1602.08465 (2016)
  • [32] Kundu, A., Vineet, V., Koltun, V.: Feature Space Optimization for Semantic Video Segmentation. In: CVPR. (2016)
  • [33] Luc, P., Neverova, N., Couprie, C., Verbeek, J., LeCun, Y.: Predicting Deeper into the Future of Semantic Segmentation. In: ICCV. (2017)
  • [34] Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating Visual Representations From Unlabeled Video. In: CVPR. (2016)
  • [35] Yang, F., Choi, W., Lin, Y.: Exploit All the Layers: Fast and Accurate CNN Object Detector With Scale Dependent Pooling and Cascaded Rejection Classifiers. In: CVPR. (2016)
  • [36] Mathieu, M., Couprie, C., Lecun, Y.: Deep Multi Scale Video Prediction Beyond Mean Square Error. (2016)
  • [37] Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial Transformer Networks. In: NIPS. (2015)
  • [38] Pinheiro, P.O., Collobert, R.: Recurrent Convolutional Neural Networks for Scene Labeling (2014)
  • [39] Liang, M., Hu, X.: Recurrent Convolutional Neural Network for Object Recognition. In: CVPR. (2015)
  • [40] Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In: EMNLP. (2014)
  • [41] Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You Only Look Once: Unified, Real-Time Object Detection. In: CVPR. (2016)
  • [42] Redmon, J., Farhadi, A.: YOLO9000: Better, Faster, Stronger. In: CVPR. (2017)
  • [43] He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition. In: CVPR. (2016)
  • [44] Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915 (2016)
  • [45] Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: ICML. (2010)
  • [46] Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., Zhang, Z.: Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR (2015)
  • [47] Kar, A., Rai, N., Sikka, K., Sharma, G.: Adascan: Adaptive scan pooling in deep convolutional neural networks for human action recognition in videos. arXiv preprint arXiv:1611.08240 (2016)


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description