Single Level Feature-to-Feature Forecasting with Deformable Convolutions

Single Level Feature-to-Feature Forecasting
with Deformable Convolutions

Josip Šarić University of Zagreb, Faculty of Electrical Engineering and Computing, Croatia
Rimac Automobili, Sveta Nedelja, Croatia
   Marin Oršić University of Zagreb, Faculty of Electrical Engineering and Computing, Croatia
Rimac Automobili, Sveta Nedelja, Croatia
   Tonći Antunović    Sacha Vražić    Siniša Šegvić University of Zagreb, Faculty of Electrical Engineering and Computing, Croatia
Rimac Automobili, Sveta Nedelja, Croatia

Future anticipation is of vital importance in autonomous driving and other decision-making systems. We present a method to anticipate semantic segmentation of future frames in driving scenarios based on feature-to-feature forecasting. Our method is based on a semantic segmentation model without lateral connections within the upsampling path. Such design ensures that the forecasting addresses only the most abstract features on a very coarse resolution. We further propose to express feature-to-feature forecasting with deformable convolutions. This increases the modelling power due to being able to represent different motion patterns within a single feature map. Experiments show that our models with deformable convolutions outperform their regular and dilated counterparts while minimally increasing the number of parameters. Our method achieves state of the art performance on the Cityscapes validation set when forecasting nine timesteps into the future.

1 Introduction

Ability to anticipate the future is an important attribute of intelligent behavior, especially in decision-making systems such as robot navigation and autonomous driving. It allows to plan actions not only by looking at the past, but also by considering the future. Accurate anticipation is critical for reliable decision-making of autonomous vehicles. The farther the forecast, the longer the time to avoid undesired outcomes of motion. We believe that semantic forecasting will be one of critical concepts for avoiding accidents in future autonomous driving systems.

There are three meaningful levels at which forecasting could be made: raw images, feature tensors, and semantic predictions. Forecasting raw images [26, 20] is known to be a hard problem. Better results have been obtained with direct forecasting of semantic segmentation predictions [18]. The third approach is to forecast feature tensors instead of predictions [25]. Recent work [17] proposes a bank of feature-to-feature (F2F) models which target different resolutions along the upsampling path of a feature pyramid network [16]. Each F2F model receives corresponding features from the four previous frames (t, t-3, t-6, t-9) and forecasts the future features (t+3 or t+9). The forecasted features are used to predict instance-level segmentations [8] at the corresponding resolution level.

This paper addresses forecasting of future semantic segmentation maps in road driving scenarios. We propose three improvements with respect to the original F2F approach [17]. Firstly, we base our work on a single-frame model without lateral connections. This requires only one F2F model which targets the final features of the convolutional backbone. These features are very well suited for the forecasting task due to high semantic content and coarse resolution. Secondly, we express our F2F model with deformable convolutions [31]. This greatly increases the modelling power due to capability to account for different kinds of motion patterns within a single feature map. Thirdly, we provide an opportunity for the two independently trained submodels (F2F, upsampling path) to adapt to each other by joint fine-tuning. This would be very difficult to achieve with multiple F2F models [17] since the required set of cached activations would not fit into GPU memory. Thorough forecasting experiments on Cityscapes val [4] demonstrate state-of-the-art mid-term (t+9) performance and runner-up short-term (t+3) performance where we come second only to [24] who require a large computational effort to extract optical flow prior the forecast. Two experiments on Cityscapes test suggest that our performance estimates on the validation subset contain very little bias (if any).

2 Related Work

Semantic segmentation.

State of the art methods for semantic segmentation [30, 3, 28, 14] have overcome the 80% mIoU barrier on Cityscapes test. However, these methods are not well suited for F2F forecasting due to huge computational cost and large GPU memory footprint. We therefore base our research on a recent semantic segmentation model [22] which achieves a great ratio between accuracy (75.5 mIoU Cityscapes test) and speed (39 Hz on GTX1080Ti with 2MP input). This model is a great candidate for F2F [17] forecasting due to a backbone with low-dimensional features (ResNet-18, 512D) and a lean upsampling path similar to FPN [16]. In particular, we rely on a slightly impaired version of that model (72.5 mIoU Cityscapes val) with no lateral connections in the upsampling path.

Raw image forecasting.

Predicting future images is interesting because it opens opportunities for unsupervised representation learning on practically unlimited data. It has been studied in many directions: exploiting adversarial training [20] anticipating arbitrary future frames [26], or leveraging past forecasts to autoregressively anticipate further into the future [11].

Feature forecasting.

Feature-level forecasting has been first used to anticipate appearance and actions in video [25]. The approach uses past features to forecast the last AlexNet layer of a future frame. Later work [17] forecasts convolutional features and interprets them with the Mask-RCNN [8] head of the single-frame model. F2F approaches are applicable to dense prediction tasks such as panoptic segmentation [13], semantic segmentation [30], optical flow [23] etc.

Semantic segmentation forecasting.

Luc et al. [18] set a baseline for direct semantic segmentation forecasting by processing softmax preactivations from past frames. Nabavi et al. [21] train an end-to-end model which forecasts intermediate features by convolutional LSTM [27]. Bhattacharyya et al. [1] use Bayesian learning to model the multi-modal nature of the future and directly predict future semantic segmentation of road driving scenes. None of the previously mentioned approaches utilize optical flow despite its usefulness for video recognition [7]. Jin et al. [10] jointly forecast semantic segmentation predictions and optical flow. They use features from the optical flow subnet to provide better future semantic maps. Terwilliger et al. [24] predict future optical flow and obtain future prediction by warping the semantic segmentation map from the current frame.

Convolutions with a wide field of view.

Convolutional models [15] proved helpful in most visual recognition tasks. However, stacking vanilla convolutional layers often results in undersized receptive field. Consequently, the receptive field has been enlarged with dilated convolutions [29] and spatial pyramid pooling [30]. However, these techniques are unable to efficiently model geometric warps required by F2F forecasting. Early work on warping convolutional representations involved a global affine transformation at the tensor level [9]. Deformable convolutions [5] extend this idea by introducing per-activation convolutional warps which makes them especially well-suited for F2F forecasting.

3 Single-Level F2F model with Deformable Convolutions

We propose a method for semantic segmentation forecasting composed of i) feature extractor (ResNet-18), ii) F2F forecasting model, and iii) upsampling path, as illustrated in Fig.  1 (b). Yellow trapezoids represent ResNet processing blocks RB1 - RB4 which form the feature extractor. The red rectangle represents the F2F model. The green rhombus designates spatial pyramid pooling (SPP) while the blue trapezoids designate modules which form the upsampling path.

Fig. 1(a) shows the single-frame model which we use to train the feature extractor and the upsampling path. We also use this model as an oracle which predicts future segmentation by observing a future frame. Experiments with the oracle estimate upper performance bound of semantic segmentation forecasting.

(a) (b)
Figure 1: Structural diagram of the employed single-frame model (a) and the proposed compound model for forecasting semantic segmentation (b). The two models share the ResNet-18 feature extractor (yellow) and the upsampling path (green, blue).

3.1 Training Procedure

The training starts from a public parameterization of the feature extractor pre-trained on ImageNet [6]. We jointly train the feature extractor and the upsampling path for single-frame semantic segmentation [22]. We use that model to extract features at times t-9, t-6, t-3, and t (sources), as well as at time t+dt (target). We then train the F2F model with L2 loss in an unsupervised manner. However, the forecasting induces a covariate shift due to imperfect F2F prediction. Therefore, we adapt the upsampling path to noisy forecasted features by fine-tuning the F2F model and the upsampling path using cross-entropy loss with respect to ground truth labels. We update the F2F parameters by averaging gradients from F2F L2 loss and the backpropagated cross-entropy loss.

3.2 Proposed Feature-to-Feature Model

We propose a single-level F2F model operating on features from the last convolutional layer of ResNet-18. We formulate our model as a sequence of N deformable convolutions and denote it as DeformF2F-N. The first convolution of the sequence has the largest number of input feature maps since it blends features from all previous frames. Therefore we set its kernel size to . All other convolutions have kernels and 128 feature maps, except the last one which recovers the number of feature maps to match the backbone output.

The proposed formulation differs from the original F2F architecture [17] in three important details. Firstly, we forecast backbone features instead of features from the upsampling path. Backbone features have a larger dimensionality, and are closer to ImageNet pre-trained parameters due to reduced learning rate during joint training. Hence, these features are more distinctive than features trained for recognition of only 19 classes. Forecasting SPP features decreased the validation performance for 1 percentage point (pp) mIoU in early experiments.

Secondly, we use a single-level F2F model which performs the forecasting at a very coarse resolution (1/32 of the original image). This is beneficial since small feature displacements simplify motion prediction (as in optical flow). Early multi-level forecasting experiments decreased performance for 2 pp mIoU.

Thirdly, we use thin deformable convolutions [5] instead of thick dilated ones. This decreases the number of parameters and improves the performance as presented in ablation experiments. Feature-to-feature forecasting is rather geometrically than semantically heavy, since the inputs and the outputs are at the same semantic level. Regular convolutions lack the potential to learn geometrical transformations due to fixed grid sampling locations. In deformable convolutions, the grid sampling locations are displaced with learned per-pixel offsets which are inferred from the preceding feature maps. We believe that learnable displacements are a good match for F2F transformation since they are able to model semantically aware per-object dynamics across observed frames.

3.3 Inference

The proposed method requires features from four past frames. These features are concatenated and fed to the F2F module which forecasts the future features. The future features are fed to the upsampling path which predicts the future semantic segmentation. A perfect F2F forecast would attain performance of the single-frame model applied to the future frame, which we refer to as oracle.

The proposed method is suitable for real-time semantic forecast since the feature extractor needs to be applied only once per frame. Consider the computational complexity of the single-frame model as baseline. Then the only overhead for a single forecast corresponds to caching of four feature tensors evaluating the F2F model. If we require both the current prediction and a single forecast, then the overhead would additionally include one evaluation of the upsampling path.

4 Experiments

We perform experiments on the Cityscapes dataset [4] which contains 2975 training, 500 validation and 1525 test images with dense labels from 19 classes. The dataset includes 19 preceding and 10 succeeding unlabeled frames for each image. Each such mini-clip is 1.8 seconds long. Let denote features from the last convolutional layer of ResNet-18. The shape of these features is 512H/32W/32, where 512 is the number of feature maps, while H and W are image dimensions. Then, the model input is a tuple of features . The model output are future features (short-term prediction, 0.18 s) or (mid-term prediction, 0.54 s) [17] which in most experiments correspond to the labeled frame in a mini-clip.

4.1 Implementation Details

We use the deformable convolution implementation from [2]. The features are pre-computed from full-size Cityscapes images and stored on SSD drive. We optimize the L2 regression loss with Adam [12]. We set the learning rate to 5e-4 and train our F2F models for 160 epochs with batch size 12 in all experiments. We fine-tune our model with SGD with learning rate set to 1e-4 and batch size 8 for 5 epochs. The training takes around 6 hours on a single GTX1080Ti.

We measure semantic segmentation performance on the Cityscapes val dataset. We report the standard mean intersection over union metric over all 19 classes. We also measure mIoU for 8 classes representing moving objects (person, rider, car, truck, bus, train, motorcycle, and bicycle).

4.2 Comparison with the State of the Art on Cityscapes Val

Table 1 evaluates several models for semantic segmentation forecasting. The first section shows the performance of the oracle, and the copy-last-segmentation baseline which applies the single-frame model to the last observed frame. The second section shows results from the literature. The third section shows our results. The last section shows our result when F2F model is trained on two feature tuples per mini-clip. The row Luc F2F applies the model proposed in [17] as a component of our method. The methods DeformF2F-5 and DeformF2F-8 correspond to our models with 5 and 8 deformable convolutions respectively. The suffix FT denotes that our F2F model is fine-tuned with cross entropy loss.

Short-term Mid-term
mIoU mIoU-MO mIoU mIoU-MO
Oracle 72.5 71.5 72.5 71.5
Copy last segmentation 52.2 48.3 38.6 29.6
Luc Dil10-S2S [18] 59.4 55.3 47.8 40.8
Luc Mask-S2S [17] / 55.3 / 42.4
Luc Mask-F2F [17] / 61.2 / 41.2
Nabavi [21] 60.0 / / /
Terwilliger [24] 67.1 65.1 51.5 46.3
Bhattacharyya [1] 65.1 / 51.2 /
Luc F2F (our implementation) 59.8 56.7 45.6 39.0
DeformF2F-5 63.4 61.5 50.9 46.4
DeformF2F-8 64.4 62.2 52.0 48.0
DeformF2F-8-FT 64.8 62.5 52.4 48.3
DeformF2F-8-FT (2 samples per seq.) 65.5 63.8 53.6 49.9
Table 1: Semantic forecasting on the Cityscapes validation set.

Poor results of copy-last-segmentation reflect the difficulty of the forecasting task. Our method DeformF2F-8 outperforms Luc F2F for 4.6 pp mIoU. In comparison with the state-of-the-art, we achieve the best mid-term performance, while coming close to [24] in short-term, despite a weaker oracle (72.5 vs 74.3 mIoU) and not using optical flow. Cross entropy fine-tuning improves results by 0.4 pp mIoU both for the short-term and the mid-term model. We applied DeformF2F-8-FT to Cityscapes test and achieved results similar to those on the validation set: 64.3 mIoU (short-term) and 52.6 mIoU (mid-term).

The last result in the table shows benefits of training on more data. Here we train our F2F model on two farthest tuples (instead of one) in each mini-clip. Cross entropy fine-tuning is done in the regular way, since groundtruth is available only in the 19th frame in each mini-clip. We notice significant improvement of 0.7 and 1.2 pp mIoU for short-term and mid-term forecast respectively.

4.3 Single-Step vs. Autoregressive Mid-term Forecast

There are two possible options for predicting further than one step into the future: i) train a separate single-step model for each desired forecast interval, ii) train only one model and apply it autoregressively. Autoregressive forecast applies the same model in the recurrent manner, by using the current prediction as input to each new iteration. Once the model is trained, the autoregression can be used to forecast arbitrary number of periods into the future. Unfortunately, auto-regression accumulates prediction errors from intermediate forecasts. Hence, the compound forecast tends to be worse than in the single-step case.

DeformF2F-8 variant mIoU mIoU-MO
single-step 52.4 48.3
autoregressive 3 48.7 43.5
autoregressive 3 fine-tuned 51.2 46.5
Table 2: Validation of auto-regressive mid-term forecast on Cityscapes val.

Table 2 validates autoregressive models. The first row shows our single-step model (cf. Table 1) for mid-term forecast. The middle row shows the baseline autoregressive forecast with our corresponding short-term model. The last row shows improvement due to recurrent fine-tuning for mid-term prediction, while initializing with the same short-term model as in the middle row. Fine-tuning brings 2.5pp mIoU improvement with respect to the autoregressive baseline. Nevertheless, the single-step model outperforms the best autoregressive model.







traffic light

traffic sign












Oracle 97.5 81.6 90.7 50.1 53.4 56.1 60.3 70.8 90.9 60.9 92.9 75.9 53.0 93.2 67.4 84.4 72.0 54.5 71.7 72.5
Short-term 96.1 73.9 87.0 47.9 50.8 35.8 51.4 57.2 86.7 56.0 88.7 58.8 41.4 86.3 64.8 75.2 63.7 48.5 60.6 64.8
Mid-term 93.2 61.2 79.6 41.6 45.1 15.1 31.9 33.2 78.3 49.1 80.1 39.1 24.6 72.9 60.0 63.5 46.5 37.5 41.9 52.4
AR-3 95.8 71.1 84.9 42.0 52.2 35.0 46.2 53.5 85.0 50.0 88.0 59.0 36.6 86.2 68.5 71.7 60.6 51.8 58.0 63.0
AR-6 94.3 64.2 80.9 37.6 48.6 23.5 35.4 40.6 80.1 46.8 82.8 48.4 26.3 78.8 64.9 66.0 50.0 44.5 49.4 56.0
AR-9 93.4 61.1 78.0 37.7 46.2 17.5 28.4 30.9 77.0 44.5 79.3 41.8 23.2 74.4 63.7 60.7 34.0 42.1 43.5 51.5
AR-12 92.6 57.7 75.3 36.5 44.1 13.5 21.5 25.4 74.2 42.2 75.7 35.5 18.3 69.8 57.1 53.8 29.6 37.7 37.3 47.3
AR-15 91.6 53.8 72.9 35.7 42.0 10.8 17.9 20.1 71.1 36.4 71.6 31.6 13.2 64.5 40.6 48.0 34.7 24.4 32.9 42.9
AR-18 90.7 51.4 71.0 33.9 40.9 09.1 14.7 15.6 68.9 34.5 69.0 29.2 12.4 60.4 38.2 46.6 16.8 25.1 28.2 39.9
Table 3: Single-step and autoregressive per-class results on Cityscapes val. Rows denoted with are evaluated only on Frankfurt sequences where long clips are available.

Table 3 shows per-class auto-regressive performance for different forecasting offsets. The three sections correspond to the oracle, two single-step models, and autoregressive application of the last model from Table 2. Autoregressive experiments have been performed on 267 sequences from the Frankfurt subset of Cityscapes val. Long clips are not available for other cities.

The performance drop due to forecasting is largest for class person among all of moving object classes. We believe that this is because persons are articulated: it is not enough for the model to determine the new position of the object center, the model also needs to determine positions and poses of the parts (legs and arms). Poles seem to be the hardest static class because of their thin shape. Qualitative results (e.g. last two rows of fig. 4) show that pole often gets dominated by large surrounding classes (building, sidewalk, road etc.).

Figure 2 plots mIoU results from the third section of Table 3 for various temporal offsets of the future frame, and explores contribution of autoregressive fine-tuning. We show mIoU and mIoU-MO (solid and dashed lines resp.) for a straight autoregressive model (red), and a model that was autoregressively fine-tuned for mid-term forecast (blue).

Figure 2: Autoregressive mIoU performance at different forecasting offsets for the straight short-term model (red) and the model fine-tuned for mid-term prediction (blue).

4.4 Validation of Deformable Convolutions

Table 4 compares the mIoU performance and the number of parameters for various design choices. Our DeformF2F-5 model achieves a 4-fold decrease in the number of parameters with respect to Luc F2F. Dilated and deformable convolutions achieve the largest impact in mid-term forecasting where the feature displacements are comparatively large. Dilation achieves a slight improvement on mid-term prediction. Deformable convolutions improve both the short-term and mid-term results while significantly outperforming the dilation models. This clearly validates the choice of deformable convolutions for F2F forecasting.

Short-term Mid-term
mIoU mIoU-MO mIoU mIoU-MO #params
Luc F2F 59.8 56.7 45.6 39.0 5.50M
ConvF2F-5 60.4 56.6 43.8 36.3 1.30M
DilatedF2F-5 60.0 56.9 45.6 38.8 1.30M
DeformF2F-5 63.4 61.5 50.9 46.4 1.43M

Table 4: Validation of plain, 2 dilated and deformable convolutions on Cityscapes.

4.5 Ablation of the Number of Input Frames

Table 5 investigates the impact of the number of input frames to short-term and mid-term performance. We always sample frames three steps apart. For instance, the second row in the table observes frames at t-6, t-3, and t. The model operating on a single frame performs significantly worse than the models which observe multiple frames. Such model can only predict the movement direction from object posture and/or orientation, while it is often very hard to forecast the magnitude of motion without looking at least one frame in the past. Models operating on two and three frames produce comparable short-term forecast with respect to the four frame model. Adding more frames from the past always improves the accuracy of mid-term forecasts. This suggests that the models benefit from past occurrences of the parts of the scene which are disoccluded in the forecasted frame. This effect is visible only in mid-term prediction, since such occlusion-disocclusion patterns are unlikely to occur across short time intervals.

Short-term Mid-term
#frames mIoU mIoU-MO mIoU mIoU-MO
DeformF2F-8 4 64.4 62.2 52.0 48.0
3 64.4 62.5 50.9 46.2
2 64.5 62.6 50.7 46.2
1 57.7 54.3 44.2 37.8
Table 5: Ablation of the number of input frames. Two input frames are enough for short-term forecasting. More input frames improve performance of mid-term forecasts.

4.6 Could a Forecast Improve the Prediction in the Current Frame?

We consider an ensemble of a single-frame model which observes the current frame and a forecasting model which observes past frames. The predictions of the ensemble are a weighted average of softmax activations of the two models:


Similar results are achieved for . Table 6 presents experiments on Cityscapes val. The first two rows show the oracle and our best short-term model. The third row ensembles the previous two models according to (1). We observe 0.3pp improvement over the single-frame model. This may be interesting in autonomous driving applications which would need semantic segmentation for the current and the future frame in each time instant. In that case, the proposed ensemble would require no additional cost, since the forecast from the previous time instant can be cached. On the other hand, evaluating an ensemble of two single-frame models would imply double computational complexity.

mIoU mIoU-MO
Single frame model 72.5 71.5
DeformF2F-8-FT 64.8 62.5
Ensemble 72.8 71.8
Table 6: Performance of the ensemble of a single-frame model which observes the current frame with a forecasting model which observes only the four past frames.

4.7 Qualitative Results

Figures 3 and 4 show forecasted semantic segmentation on Cityscapes val for short-term and mid-term predictions respectively.

Figure 3: Short-term semantic segmentation forecasts (0.18 s into the future) for 3 sequences. The columns show i) the last observed frame, ii) the future frame, iii) the groundtruth segmentation, iv) our oracle, and v) our semantic segmentation forecast.

We observe loss of spatial detail when forecasting sequences with greater dynamics and when predicting further into the future (cf. the first row in figures 3 and 4). The row 4 in figure 4 shows a red car turning left.

Figure 4: Mid-term semantic segmentation predictions (0.5 s into the future) for 5 sequences. The columns show i) the last observed frame, ii) the future frame, iii) the ground truth segmentation, iv) our oracle, and v) our semantic segmentation forecast.

Our model inferred the future spatial location of the car quite accurately. The last row shows a car which disoccludes the road opposite the camera. Our model correctly inferred the car motion and in-painted the disoccluded scenery in a feasible although not completely correct manner.

Effective receptive field.

We express the effective receptive field by measuring partial derivation of log-max-softmax [19] with respect to the four input images. The absolute magnitude of these gradients quantifies the importance of particular pixels for the given prediction. Figure 5 visualizes the results for our DeformF2F-8-FT mid-term model.

Figure 5: Effective receptive field of mid-term forecast in 4 sequences. Columns show the four input frames, the future frame t+9 and the corresponding semantic segmentation forecast. We show pixels with the strongest gradient of log-max-softmax (red dots) in a hand-picked pixel (green dot) w.r.t. the each of the input frames.

The four leftmost columns show input images, while the two rightmost columns show the future image (unavailable to the model), and the semantic forecast. The green dot in the two rightmost columns designates the examined prediction. The red dots designate pixels in which the absolute magnitude of the gradient of the examined prediction is larger than a threshold. The threshold is dynamically set to the value of the k-th (k = 3000, top 0.15 percent) largest gradient within the last observed frame (t). In other words, we show pixels with top k gradients in the last observed frame, as well as a selection of pixels from the other frames according to the same threshold. We notice that most important pixels come from the last observed frame. Row 1 considers a static pixel which does not generate strong gradients in frames t-3, t-6, and t-9. Other rows consider dynamic pixels. We observe that the most important pixels for a given prediction usually correspond to object location in past frames. Distances between object locations in the last observed and the forecasted frame are often larger than 300 pixels. This emphasizes the role of deformable convolutions since the F2F model with plain convolutions is unable to compensate for such large offsets. The figure also illustrates the difficulty of forecasting in road-driving videos, and the difference of this task with respect to single-frame semantic segmentation. These visualizations allow us to explain and interpret successes and failures of our model and to gauge the range of its predictions. In particular we notice that most mid-term decisions rely only on pixels from the last two frames. This is in accordance with mid-term experiments from Table 5 which show that frames t-6 and t-9 contribute only 1.3pp mIoU.

5 Conclusion and Future Work

We have presented a novel method for anticipating semantic segmentation of future frames in driving scenarios based on feature-to-feature (F2F) forecasting. Unlike previous methods, we forecast the most abstract backbone features with a single F2F model. This greatly improves the inference speed and favors the forecasting performance due to coarse resolution and high semantic content of the involved features. The proposed F2F model is based on deformable convolutions in order to account for geometric nature of F2F forecasting. We use a lightweight single-frame model without lateral connections, which allows to adapt the upsampling path to F2F noise by fine-tuning with respect to groundtruth labels. We perform experiments on the Cityscapes dataset. To the best of our knowledge, our mid-term semantic segmentation forecasts outperform all previous approaches. Our short-term model comes second only to a method which uses a stronger single-frame model and relies on optical flow. Evaluation on Cityscapes test suggests that our validation performance contains very little bias (if any). Suitable directions for future work include adversarial training of the upsampling path, complementing image frames with optical flow, investigating end-to-end learning, as well as evaluating performance on the instance segmentation task.


This work has been funded by Rimac Automobili. This work has been partially supported by European Regional Development Fund (DATACROSS) under grant KK. We thank Pauline Luc and Jakob Verbeek for useful discussions during early stages of this work.


  • [1] Bhattacharyya, A., Fritz, M., Schiele, B.: Bayesian prediction of future street scenes using synthetic likelihoods. arXiv preprint arXiv:1810.00746 (2018)
  • [2] Chen, K., Pang, J., Wang, J., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Shi, J., Ouyang, W., Loy, C.C., Lin, D.: mmdetection. (2018)
  • [3] Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40(4), 834–848 (2018)
  • [4] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3213–3223 (2016)
  • [5] Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolutional networks. In: ICCV. pp. 764–773 (2017)
  • [6] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)
  • [7] Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. pp. 1933–1941 (2016)
  • [8] He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision. pp. 2961–2969 (2017)
  • [9] Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. In: NIPS. pp. 2017–2025 (2015)
  • [10] Jin, X., Xiao, H., Shen, X., Yang, J., Lin, Z., Chen, Y., Jie, Z., Feng, J., Yan, S.: Predicting scene parsing and motion dynamics in the future. In: Advances in Neural Information Processing Systems. pp. 6915–6924 (2017)
  • [11] Kalchbrenner, N., van den Oord, A., Simonyan, K., Danihelka, I., Vinyals, O., Graves, A., Kavukcuoglu, K.: Video pixel networks. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70. pp. 1771–1779. JMLR. org (2017)
  • [12] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  • [13] Kirillov, A., He, K., Girshick, R., Rother, C., Dollár, P.: Panoptic segmentation. arXiv preprint arXiv:1801.00868 (2018)
  • [14] Krešo, I., Krapac, J., Šegvić, S.: Efficient ladder-style densenets for semantic segmentation of large images. arXiv preprint arXiv:1905.05661 (2019)
  • [15] LeCun, Y., Bengio, Y., et al.: Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks 3361(10),  1995 (1995)
  • [16] Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2117–2125 (2017)
  • [17] Luc, P., Couprie, C., Lecun, Y., Verbeek, J.: Predicting future instance segmentation by forecasting convolutional features. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 584–599 (2018)
  • [18] Luc, P., Neverova, N., Couprie, C., Verbeek, J., LeCun, Y.: Predicting deeper into the future of semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 648–657 (2017)
  • [19] Luo, W., Li, Y., Urtasun, R., Zemel, R.: Understanding the effective receptive field in deep convolutional neural networks. In: Advances in neural information processing systems. pp. 4898–4906 (2016)
  • [20] Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440 (2015)
  • [21] Nabavi, S.S., Rochan, M., Wang, Y.: Future semantic segmentation with convolutional lstm. BMVC (2018)
  • [22] Oršić, M., Krešo, I., Bevandić, P., Šegvić, S.: In defense of pre-trained imagenet architectures for real-time semantic segmentation of road-driving images. arXiv preprint arXiv:1903.08469 (2019)
  • [23] Sun, D., Yang, X., Liu, M.Y., Kautz, J.: Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8934–8943 (2018)
  • [24] Terwilliger, A.M., Brazil, G., Liu, X.: Recurrent flow-guided semantic forecasting. arXiv preprint arXiv:1809.08318 (2018)
  • [25] Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating the future by watching unlabeled video. arXiv preprint arXiv:1504.08023 2 (2015)
  • [26] Vukotić, V., Pintea, S.L., Raymond, C., Gravier, G., van Gemert, J.C.: One-step time-dependent future video frame prediction with a convolutional encoder-decoder neural network. In: International Conference on Image Analysis and Processing. pp. 140–151. Springer (2017)
  • [27] Xingjian, S., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.c.: Convolutional lstm network: A machine learning approach for precipitation nowcasting. In: Advances in neural information processing systems. pp. 802–810 (2015)
  • [28] Yang, M., Yu, K., Zhang, C., Li, Z., Yang, K.: Denseaspp for semantic segmentation in street scenes. In: CVPR. pp. 3684–3692 (2018)
  • [29] Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)
  • [30] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)
  • [31] Zhu, X., Hu, H., Lin, S., Dai, J.: Deformable convnets v2: More deformable, better results. arXiv preprint arXiv:1811.11168 (2018)
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description