ScopeFlow: Dynamic Scene Scoping for Optical Flow

# ScopeFlow: Dynamic Scene Scoping for Optical Flow

## Abstract

We propose to modify the common training protocols of optical flow, leading to sizable accuracy improvements without adding to the computational complexity of the training process. The improvement is based on observing the bias in sampling challenging data that exists in the current training protocol, and improving the sampling process. In addition, we find that both regularization and augmentation should decrease during the training protocol.

Using a low parameters off-the-shelf model, the method is ranked first on the MPI Sintel benchmark among all other methods, improving the best two frames method accuracy by more than 10%. The method also surpasses all similar architecture variants by more than 12% and 19.7% on the KITTI benchmarks, achieving the lowest Average End-Point Error on KITTI2012 among two-frame methods, without using extra datasets.

\appto\appto\appto\cvprfinalcopy

## 1 Introduction

The field of optical flow estimation has benefited from the availability of acceptable benchmarks. In the last few years, the architectures of choice have stabilized, and a greater emphasis has been \colorblackplaced on the training protocol.

A conventional training protocol now consists of two stages: (i) pretraining on larger and simpler data and (ii) finetuning on more complex datasets. In both stages, a training step includes the following: (i) sampling batch frames and flow maps, (ii) applying photometric augmentations to the frames, (iii) applying affine (global and relative) transformations to the frames and flow maps, (iv) cropping a fixed size random crop from both input and flow maps, (v) feeding the cropped frames into a CNN [15] architecture, and (vi) backpropagating the loss of the flow estimation.

While photometric augmentations include variations of the input image values, affine transformations are used to augment the variety of input flow fields. Due to the limited motion patterns represented by today’s optical flow datasets, these regularization techniques are required for the data driven training. We chose the word scoping, to define the process of affine transformation followed by cropping, as this process sets the scope of the input frames.

To improve optical flow training, we ask the following questions: Q1. How do fixed size crops affect this training? Q2. What defines a good set of scopes for optical flow? Q3. Should regularization be relaxed after pretraining?

Our experiments employ the smallest PWC-Net [27] variant of Hur & Roth [11], with only 6.3M trainable parameters, in order to support low memory, real time processing. We demonstrate that by answering these questions and contributing to the training procedure, \colorblackit is possible to train a dual frame, monocular and small sized model to outperform all other models on the MPI Sintel benchmark. The trained model improves the accuracy of the baseline model, which uses the same architecture, by 12%. See Fig. 1 for a comparison to other networks.

Moreover, despite using the smallest PWC-Net variant, our model outperformed all other PWC-Net variants on both KITTI 2012 and KITTI 2015 benchmarks, improving the baseline model results by 12.2% and 19.7% on the public test set, and demonstrating once more the power of using the improved training protocol.

Lastly, albeit no public benchmark is available for occlusion estimation, we compared our occlusion results to other published results on the full Sintel dataset, demonstrating more than 5% improvement of the best published F1 score.

Our main contributions are: (i) showing, for the first time, as far as we can ascertain, that CNN training for optical flow and occlusion estimation can benefit from cropping randomly sized scene scopes, (ii) exposing the powerful effect of regularization and data augmentation on CNN training for optical flow and (iii) presenting an updated generally applicable training scheme and testing it across benchmarks, on the widely used PWC-Net network architecture.

Our code is attached as supplementary and our models will be openly shared, in order to encourage follow-up work, to support reproducibility, and to provide an improved performance to off the shelf real-time models.

## 2 Related work

The deep learning revolution in optical flow started with deep descriptors [29, 6, 2] and densification methods [34]. Dosovitskiy \etal [4] presented FlowNet, the first deep end-to-end network for optical flow dense matching, later improved by Ilg \etal [12], incorporating classic approaches, like residual image warping. Ranjan & Black [24] showed that deep model size can be much smaller with a coarse to fine pyramidal structure. Hui \etal [9, 10] suggested a lighter version for FlowNet, adding features matching, pyramidal processing and features driven local convolution. Xu \etal [31] adapted semi-global matching [8] to directly process a reshaped 4D cost volume of features learned by CNN, inspired by common practices in stereo matching. Yang & Ramanan [32] suggested a method for directly learning to process the 4D cost volume, with a separable 4D convolution. Sun \etal [27] proposed PWC-Net, which includes pyramidal processing of warped features, and a direct processing of a partial layer-wise cost volume, demonstrated strong performance on optical flow benchmarks. Many extensions were suggested to the PWC-Net architecture, among them multi-frame processing, occlusion estimation, iterative warping and weight sharing [25, 23, 17, 11].

Pretraining optical flow models  Today’s leading optical flow learning protocols, include pretraining on large scale data. The common practice is to pretrain on the FlyingChairs [4] and then on FlyingThings3D [20] (FChairs and FThings). As shown by recent works [19, 12], the multistage pretraining ordering is critical. The FChairs dataset includes 21,818 pairs of frames, generated by CAD models [1], with flicker images background. FThings is a natural extension of the FChairs dataset, having 22,872 larger 3D scenes with more complex motion patterns. Hur & Roth [11] created a version of FChairs with ground truth occlusions, called FlyingChairsOcc (denoted FChairsOcc), to allow supervised pretraining on occlusion labels.

Datasets and benchmarks  The establishment of large complex benchmarks, such as MPI Sintel [3], and KITTI [7, 22], boosted the evolution of optical flow models. The MPI Sintel dataset was created from the Sintel movie, composed of 25, relatively long, annotated scenes, with 1064 training frames in total. The final pass category of Sintel is a challenging one, having many realistic effects to mimic natural scenes. The KITTI2012 dataset comprises 194 training pairs with annotated flow maps, while KITTI2015 has 200 dynamic color training pairs. Furthermore, some methods are using more datasets during the finetune process, such as HD1K [14], Driving and Monkaa [20].

Motion categories  MPI Sintel provides a stratified view of the error magnitude of challenging motion patterns. The ratio of the best mean error for the small motion category (slower than 10 pixels per frame) to the large motion category (faster than 40 pixels per frame) is approximately x44. In Sec. 3, we present one possible theoretical explanation for the poor performance of state of the art methods in large motions, and suggest an approach to improve the accuracy of this pixels category.

Another example is the category of unmatched pixels. This category includes pixels belonging to regions that are visible only in one of two adjacent frames (occluded pixels). As expected, these pixels share much higher end-point-error than match-able pixels: the ratio of the best match-able EPE to the best non match-able is approximately 9.5.

Different deep learning approaches were suggested to tackle the problems of fast objects and occlusion estimation. Among the different solutions suggested were: occlusion based loss [28] and model [16, 17] separation, and multi-frame support for long-range, potentially occluded, spatio-temporal matches [25, 23]. We suggest a new approach for applying multiple strategies online. Our findings imply that the training can be improved by applying scene scope variations, while taking into account the probability of sampling valid examples from different flow categories.

Training procedure and data augmentation  Fleet & Weiss [5] showed the importance of photometric variations, boundary detection and scale invariance to the success of optical flow methods. In recent years, as the evolution of optical flow models started to saturate, training variation studies became more popular [17]. Sun \etal [26] used training updates to improve the accuracy of the initial PWC-Net model by more than 10%, showing they could improve the reported accuracy of FlowNetC (a sub network of FlowNet) by more than 50%, surpassing FlowNet2 [10] performance, with their updated training protocol. Mayer \etal [19] suggests that no single best general-purpose training protocol exists for optical flow, and different datasets require different care. These conclusions are in line with our findings on the importance of proper training.

### 2.1 PWC-Net architectures

PWC-Net [27] is the most popular architecture for optical flow estimation to date, and many variants for this architecture were suggested [25, 21, 11, 17, 23]. PWC-Net architecture was built over traditional design patterns for estimating optical flow, given two temporally consecutive frames, such as: pyramidal coarse-to-fine estimation, cost volume processing, iterative layerwise feature warping and others. Features warping  In PWC-Net, a CNN encoder creates feature maps for the different network layers (scales). The features of the second image are backward warped, using the upsampled flow of the previous layer processing, for every layer , except the last layer , by:

 clw(x)=cl2(x+up×2(fl+1(x)) (1)

where x is the pixel location, is the backward warped feature map of the second image, is the output flow of the coarser layer, and is the up-sampling module, followed by a bi-linear interpolation.

Cost volume decoding  A correlation operation applied on the first and backward warped second image features, in order to construct a cost volume:

 costl(x1,x2)=1N(cl1(x1))Tclw(x2) (2)

where is a feature vector of image n.

The cost volume is then processed by a CNN decoder, in order to estimate the optical flow directly. In some variants of PWC-Net [23, 11] there is an extra decoder with similar architecture for occlusion estimation.

Iterative residual processing  Our experiments employ the Iterative Residual Refinement proposed by Hur & Roth [11]. The reasons we chose to test our changes for the PWC-Net architecture on the IRR variant are: (i) IRR has the lowest number of trainable parameters among all PWC-Net variants, making a state of the art result obtained with proper training more significant, (ii) it uses shared weights that could be challenged with scope and scale changes, and if successful, it would demonstrate the power of a rich, scope invariant feature representations, (iii) this variant is using only two frames - demonstrating the power of dynamic scoping without long temporal connections, (iv) the occlusion map allows the direct evaluation of our training procedure on occlusion estimation, and (v) any success with this variant directly translates to real-time relatively low complexity optical flow estimation.

## 3 Scene scoping analysis

Due to the limited number of labeled examples available for optical flow supervised learning, most of the leading methods, in both supervised and unsupervised learning, are using cropping of a fixed sized randomly located patches. We are interested in understanding the chances of a pixel to be sampled, within a randomly located fixed size crop, as a function of its location in the image.

1D image random cropping statistics  Consider a 1D image with a width , a crop size and a pixel location . Let denote the distance of the pixel from the closest border, and denote the difference between the image width and the crop size . Let be the set of crop sizes with larger than half of the width, . Let be the complement set of crop sizes smaller or equal to half of the width, . Two instances of this setup are depicted in Fig. 2.

Using the notations above, the pixels are separated into three different categories, described in the following lemma.

###### Lemma 1.

For an image size and a randomly chosen crop of size the probability of a pixel, with coordinate and distance to the closest border to be sampled, is as follows:

 P(x|W,w)=⎧⎪ ⎪⎨⎪ ⎪⎩1if$Δw<Δx$wΔw+1if$w≤Δx$ΔxΔw+1otherwise (3)

where is the number of valid crops.

###### Proof.

For illustration purposes, the three cases are color coded, respectively, as green, orange, and red, in Fig.2. We handle each case separately. (i) Green: Every valid placement must leave out up to pixels from the left or the right. Since is larger than , the pixel must be covered. (ii) Orange: In this case, there are possible locations for pixel within the patch, all of which are achievable, since is large enough. Therefore, patch locations out of all possible patches contain pixel . (iii) Red: In this case, the patch can start at any displacement from the edge that is nearer to , that is not larger than , and still cover pixel . Therefore, there are exactly locations out of the possible locations. ∎

2D image random cropping statistics  Since the common practice in current state-of-the-art optical flow training protocols is to crop a fixed sized crop, in the range , we will focus in the reminder of this section on the green and red categories, which are the relevant categories for crop sizes with each dimension [h,w] larger than half of the corresponding image dimension (\iein ), and represent a cropping of more than a quarter of the image.

From the symmetry of our 1D random cropping setup, in both and axes, we can use Eq. 3 in order to calculate the probability of sampling pixels in a 2D image of size , with a randomly located crop of a fixed size . The probability of sampling a central (green) pixel remains 1, while the probability of sampling a marginal (red) pixel in 2D, is given by:

 Pred(x,y|H,h,W,w)=min(Δx,Δw)min(Δy,Δh)(Δw+1)(Δh+1) (4)

Where , the difference between the image and the crop width and height, and represent the distance from the closest border, as before. Eq. 4 represents the ratio between the number of crop locations where a (marginal) pixel with is sampled to the number of all unique valid crop locations. An illustration of this sampling probability is demonstrated in Fig. 3 for varying ratios of crop size axes and image axes.

Fixed crop size sampling bias  As in the 1D cropping setup, given an image of size and a crop size , we can define a central area (equivalent to the green pixels in 1D), which will always be sampled. Respectively, we can define a marginal area (equivalent to the red pixels in 1D), where Eq. 4 holds.

Eq. 3 links this marginal area to the values and . Analyzing Eq. 4 we can infer the following: (i) in the marginal area, for a fixed crop size , the probability of being sampled decreases quadratically along the image diagonal, when and both decrease together, and (ii) in the marginal area, for a fixed pixel, the probability of being sampled decreases quadratically when the crop size decreases (when and both decrease together).

Therefore, when using a fixed sized crop to augment a dataset with a random localization approach, there will be a dramatic sampling bias towards pixels in the center of the image, preserved by the symmetric range of random affine parameters. For example, with the current common cropping approach for the MPI Sintel data-set, the probability of the upper left corner pixel to be sampled in a crop equals , while the pixels in the central crop will be sampled in any randomized crop location.

This sampling bias could have a sizable influence on the training procedure. Fig. 3 illustrates the distribution of fast pixels in both MPI Sintel and KITTI datasets. Noticeably, pixels of fast moving objects (with speed larger than 40 pixels per frame) are often located at the marginal area, while slower pixels are more centered in the scene. This should not be a surprise, since (i) lower pixels belong to nearer scene objects and thus have a larger scale and velocity, and (ii) fast moving objects usually cross the image borders.

Moreover, many occluded pixels are also located close to the image borders. Therefore, increasing a crop size could also help to observe a more representative set of occlusions during training. Therefore, we hypothesized that larger crops can also improve the ability to infer occluded pixels motion from the context.

### 3.1 Scene scoping approaches

Fig. 3 shows the crop size effect on the probability to sample different motion categories. Clearly, the category of fast pixels suffers the most from reduction of the crop sizes. We tested four different strategies for cropping the scene dynamically (per mini batch) during training: (S1) fixed partial crop size (the common approach), (S2) cropping the largest valid crop size, \colorblack(S3) randomizing crop size ratios from a pre-defined set with:

 Rfixed={(0.73,0.69),(0.84,0.86),(1,1)} (5a) (rh,rw)=randchoice(Rfixed), (5b)

where are one of the three crop ratios, and strategy (S4) is a range-based crop size randomization:

 s=randint(round(rmin⋅S),round(rmax⋅S)), (6)

where s is the crop axis size (h or w), S is the full image axis size (H or W), and [, ] is the range of crop size ratios for sampling.

We also employ different affine transformations, and dynamically change the zooming range along the training, to enlarge the set of possible scene scopes, and improve the robustness of features to different scales. In Sec. 5.2 we describe the experiments done in order to find an appropriate approach for feeding the network with a diversity of scene scopes and reducing the inherent sampling bias explained in this section, caused by the current cropping mechanisms.

In addition to testing the scope modifications based on our analysis, we were also interested in testing different parameters of the training.

## 4 Training, regularization and augmentation

Learning rate and training schedules  The common LR schedules, proposed by Ilg \etal [12], used to train deep optical flow networks are or for the pretraining stage, and for the finetune phases. We used the shorter schedule, suggested by [11], of using for pretrain, half of for FThings finetuning, and for Sintel and KITTI datasets. We also examine the effect of retraining and over-training specific stages \colorblackof the multi-phase training.

Data augmentation  The common practice in the current training protocol employs two types of data augmentation techniques: photo-metric and geometric augmentations. The details of these augmentations did not change much since FlowNet [4]. The photo-metric transformations include input image perturbation, such as color and gamma corrections. The geometric augmentations include a global or relative affine transformation, followed by random horizontal and vertical flipping. Due to the spatial symmetric nature of the translation, rotation and flipping parameters, we decided to focus on the effect of zooming changes, followed by our new cropping approaches.

Regularization  The common protocol also includes weight decay and adding random Gaussian noise to the augmented image. In our experiments, we tested the effect of eliminating these sources of regularization at different stages of training.

## 5 Experiments

In this section, we describe the experiments and results for our research questions. Specifically, we tested (i) how can we change the current training pipeline in order to improve the final accuracy, and (ii) the effect of feeding the network with different scopes of the input during training, using different cropping approaches and changes to the zooming parameters.

All our experiments on KITTI used both KITTI2012 and KITTI2015, and for Sintel both the clean and final pass, for training and validation sets. We denote the combined datasets of Sintel and KITTI as Sintel combined and KITTI combined. We also tested the approach, suggested by Sun \etal [27], to first train on a combined Sintel dataset, followed by another finetune on the final pass.

All of our experiments employ the common End Point Error metric for flow evaluation, and F1 for occlusion evaluation. The KITTI experiments also present the outlier percentage ( pixels) metric.

### 5.1 Finetuning a pretrained model

Since the cost of pretraining is approximately than of the final finetune, we first present experiments done on the finetuning phase, in which we employ models pretrained on FChairs and FThings, published by the authors of IRR-PWC. \colorblackThese experiments are conducted on the Sintel dataset, since it has similar statistics of displacements to the FChairs dataset [19] used for pretraining. We tested different training protocol changes, and found that substantial gains could be achieved using the following changes: 1. Cropping strategies. During the initial finetune, we tested the cropping approaches specified in Sec. 3.1 on Sintel. The results specified in Tab. 5 show that the range-based crop size randomization approach (Eq. 6) was comparable to taking the maximal valid crop (although much more efficient computationally), and both improved Sintel validation error of models trained with smaller fixed crop sizes.

2. Zooming strategies. We found that applying a new random zooming range of alone, which increases the zoom out, and gradually reducing the zoom in to , achieved considerable gains for Sintel in all evaluation parameters, with and without cropping strategy changes. Interestingly, increasing the zoom out range without any change to the crop size provided 50% of this gain. We suggest that this is additional evidence for the existing bias in small crop sizes, as explained in Sec. 3.

3. Removing artificial regularization. Removing the random noise and weight decay helped us to achieve extra 2%-3% of improvement during \colorblackSintel finetune, demonstrating the benefit of reducing augmentation \colorblackin advanced stages.

### 5.2 Applying changes to the full training procedure

We then tested \colorblackthe changes from Sec. 5.1, along with all four cropping approaches described in Sec. 3.1, on the different stages of the common curriculum learning pipeline. Since we wanted to test our training changes and compare our results to other variants of the baseline architecture, we decided not to use any other dataset other than the common pretraining or benchmarking datasets. For FChairs and FThings, all trained models were evaluated on Sintel training images, as an external test set.

FChairs pretraining  For pretraining, we downloaded the newly released version of FChairs [11], which includes occlusions ground truth annotations. We trained two versions of the IRR-PWC model on FChairsOCC, for 108 epochs on 4 GPUs: (i) C1: removing weight decay and random noise (ii) C2: same as (i) with reduced zoom in. We then evaluated both models and the original model, trained by the authors with weight decay and original zoom in of , denoted by C*. Results are depicted in Tab. 5, showing that regularization is important in early stages, since removing either weight decay and random noise, or reducing the zoom-in hurt the performance.

FThings finetune  We then trained three versions of IRR-PWC on FThings, for 50 epochs: (i) T2: resuming C* training with batch size of 2, with the original crop size of , (ii) T3: resuming C1 with the maximal crop size, and (iii) T4: resuming T3 without weight decay and random noise. We can infer from the results in Tab. 5: (i) increasing the scope during FChairs training leads to better accuracy on the Sintel test set, and (ii) over-training without weight decay and random noise did not improve the results on the external test set (but did on the validation).

KITTI finetune  We trained two different versions, both with the same protocol, for 550 epochs on KITTI combined: (i) resuming T2 and (ii) resuming T3. Although T3 got better performance in the evaluation, after finetuning, both results were similar on KITTI validation, as shown in Tab. 5.

Sintel finetune  Two different versions were trained with the same protocol, for 290 epochs on Sintel combined, both from T3: (i) with weight decay and random noise and (ii) without weight decay and random noise. The results, presented in Tab. 5, show that reducing regularization in Sintel finetune produced an extra gain.

Dynamic scene scoping  Since the scoping approaches were already tested on Sintel during the initial finetune, we further tested the four different approaches for dynamic scene scoping, detailed in Sec. 3.1, on the combined KITTI dataset. The results are depicted in Tab. 7. For KITTI, cropping the maximal valid crop per batch shows noticeable improvement from using a fixed sized crop. However, for KITTI datasets, approach S4 (Eq. 6) show much better performance than using the maximal valid crop size. In order to find the optimal range of crop size ratios (Eq.6), we trained different models with different ranges of crop size to image ratios . All models used an upper crop size ratio limit equal to 1 (\iethe maximal valid crop for the batch), and different lower limit , ranging from to and representing random crop sizes with different aspect ratios, which are larger than a quarter of the image.

Fig. 4 shows the training and validation accuracy as a function of the lower ratio of the range of randomized crop sizes. Specifically, the best results obtained with equals the 0.95, as also demonstrated in Fig. 4. The validation accuracy improves consistently when increasing from 0.5 to 0.95 and then starting to deteriorate until is reaching the maximal valid crop size. As can be seen, when enlarging the crop size expectation, we also reduce the regularization provided by the larger number of scopes (as analyzed in Eq.4). This observation can be considered as additional evidence of a regularization-accuracy trade-off in the training process. It also emphasizes the power of Eq. 6, in improving the training outcome, while keeping the regularization provided by partial cropping.

Re-finetune with dynamic scene scoping  In order to further understand the effect of this regularization-accuracy trade-off, we re-trained three models with the best random approach on the KITTI combined set, using the same finetuning protocol. We took different models finetuned with as the checkpoint for the second finetune.

As described in the lower part of Tab. 7, finetuning again on the KITTI dataset improved the validation accuracy for all starting points (compared to their accuracy on the first finetune). Surprisingly, in the second finetune, repeating the best approach (of randomizing using Eq. 6 with ) did not provide the best result. The best approach was to finetune for the second time from a model with a larger range , thus stronger regularization, but lower EPE in the first finetune. We propose to consider this as additional evidence for the notion that gradually reducing regularization in optical-flow training, helps to achieve a better final local minima.

Full training insights  Concluding Sec. 5.2 experiments, we suggest the following: (i) larger scopes can improve optical flow training \colorblackas long as the regularization provided by small crops is not needed, \colorblack (ii) range based crop size randomization (Eq.6) is a good strategy when regularization is needed, (iii) the training protocol requires strong regularization on early stages, that should be relaxed when possible and (iv) gains on \colorblackearly stages do not always improve the finetuning accuracy.

### 5.3 Occlusion estimation

We evaluated the occlusion estimation of our trained models, using the F1 score, during all stages of the full training. As demonstrated in Tab. 5, it appears that gains in optical flow estimation are highly correlated with improvements in occlusion estimation. This might be due to the need for a network to identify non-matchable pixels and to \colorblack infer the flow from the context. Tab. 7 shows a comparison of our F1 score to other reported results, on the full Sintel dataset. Our updated training protocol improves the best reported occlusion result by more than .

### 5.4 Official Results

Evaluating our best models on the MPI Sintel and KITTI benchmarks shows a clear improvement over the IRR-PWC baseline, and an advantage over all other PWC-Net variants.

MPI Sintel  We uploaded results for three different versions: (i) with reduced regularization, (ii) with improved zooming schedule and (iii) with the best dynamic scoping approach. As Tab. 8 shows, there is a consistent improvements on the test set. This is congruent with the results in Tab. 5, obtained on the validation set.

At the time of submission, our method ranks first place on the MPI Sintel benchmark, improving two-frame methods by more than 10%, surpassing other competitive methods trained on multiple datasets, with multiple frames and all other PWCNet variants, using an equal or larger size of trainable parameters. On the clean pass, we improve the IRR-PWC result by 20 ranks and 7%. Interestingly, analyzing Sintel categories in Tab. 8, our model is leading in the categories of fast pixels () and non-occluded pixels, while also producing the best estimation for occluded pixels among two frame methods. This is consistent with our insights on these challenging categories from Sec.3. Fig. 5 shows a comparison of our method in the category of fast pixels, with the other two leading methods on Sintel.

KITTI On KITTI 2012 and KITTI 2015, we saw a consistent improvement of more than 19.7% and 12%, respectively, from the baseline model results, surpassing all other published methods of the popular PWC-Net architecture, and achieving state-of-the art EPE results among two frame methods. Since our training protocol can be readily applied to other methods, we plan, as future work, to test it on other leading architectures.

## 6 Conclusions

While a lot of effort is dedicated for finding effective network architectures, much less attention is provided to the training protocol. However, when performing complex multi-phase training, there are many choices to be made, and it is important to understand, for example, the proper way to schedule the multiple forms of network regularization. In addition, the method used for sampling as part of the augmentation process can bias the training protocol toward specific types of samples.

In this work, we show that the conventional scope sampling method leads to the neglect of many challenging samples, which hurts performance. \colorblackWe advocate for the use of larger scopes (crops and zoom-out) when possible, and carefully sample the position of the crop when needed. We further show how regularization and augmentation should be relaxed as training progresses. The new protocol developed has a dramatic effect on the performance of our trained models and leads to state of the art results in a very competitive domain.

## Acknowledgement

This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant ERC CoG 725974).

## Appendix A Introduction

With this appendix, we would like to provide more details on our training pipeline and framework, as well as more visualizations of the improved flow and occlusion estimation.

The ScopeFlow approach provides an improved training pipeline for optical flow models, which reduces the bias in sampling different regions of the input images while keeping the power of the regularization provided by fixed-size partial crops. Due to the sizable impact on performance that can be achieved by the improved training pipeline, we created a generic, easy to configure, training package, in order to encourage others to train state of the art models with our improved pipeline, as described in Sec. C.

## Appendix B Dynamic scoping

The common pipeline of batch sampling and augmentation in optical flow training includes four stages: (i) sampling images, (ii) applying random photometric changes, (iii) applying a random affine transformation, and (iv) cropping a fixed-size randomly located patch. We propose changes for stages (iii) and (iv), by choosing the zooming parameters more carefully along with the training, and incorporating a new randomized cropping scheme, presented and extensively tested in our paper.

Fig. 6 provides a demonstration of the ScopeFlow pipeline, which enlarges the variety of scopes presented during the data-driven process while reducing the bias towards specific categories.

## Appendix C ScopeFlow software package

In order to simplify the applicability of our approach, we created a small and easy to use package, which supports YAML configurations of a multi-stage optical flow model training and evaluation. We found this approach very helpful when running experiments for finding the best scoping augmentation approach.

Our code and our models would be made public. Furthermore, we provide easy visualization of our online augmentation pipeline, as described in the README of our package.

## Appendix D Comparison to the IRR baseline

In our experiments, we use the IRR [11] variant, of the popular PWC-Net architecture, to evaluate our method. This variant has shown to provide excellent results, while keeping a low number of parameters. To emphasize the improvements, we give here a thorough comparison, of all the public results obtained in the main three benchmarks, for our method and the IRR baseline.

### d.1 MPI Sintel

Other than leading the MPI Sintel [3] table, as can be seen in Tab. 11 and Tab. 12 in Sec. G, we improve the baseline IRR models by a large margin in all metrics, and in particular the challenging metrics of occlusions (14.7%) and fast pixels (18.4%). The only metric that did not improve is the metrics of low-speed pixels (), which should not be a surprise, since our method reduces the bias between the fast and slow pixels, as shown in our paper.

### d.2 Kitti 2012

We uploaded our results to the KITTI 2012 [7] benchmark. As can be seen in Tab. 9 and Tab. 10, training the IRR model with ScopeFlow pipeline improves the mean EPE by more than 20%. Moreover, the improvement is achieved for all thresholds of outliers and for all metrics.

IRR on KITTI 2012:

ScopeFlow on KITTI 2012:

In addition, Fig. 7 provides a qualitative comparison to the baseline IRR model on the KITTI 2012 benchmark. As can be seen, most of the improvement provided by the ScopeFlow pipeline is in the challenging occluded and marginal pixels.

### d.3 Kitti 2015

We uploaded our results to the KITTI 2015 [22] benchmark. As can be seen in Tab. 11 and Tab. 12, training the IRR model with ScopeFlow pipeline improves the mean EPE by more than 12%, in the default category of 3 pixels. Moreover, the improvement is achieved for all thresholds of outliers and for all metrics.

IRR on KITTI 2015:

ScopeFlow on KITTI 2015:

In addition, Fig. 8 provides a qualitative comparison to the leading VCN model on the KITTI 2015 benchmark, showing a clear improvement for handling non-background challenging objects. Our results are leading the category of non-background pixels, which belong to faster foreground objects.

## Appendix E Ablation visualization

Fig. 9 provides a demonstration of the contribution of different training changes, composing the ScopeFlow pipeline presented in our paper, to the improvement of the final flow. As expected, most of the improvements are in the marginal image area. Our method improves, in particular, the moving objects, which have many occluded and fast-moving pixels.

## Appendix F Occlusions comparison

In order to provide a qualitative demonstration of our improved occlusion estimation, we compared our results to the methods with the highest reported occlusion estimation. We provide a layered view of false positive, false negative and true positive predictions. All occlusion estimations created using the pre-trained models, published by the authors, and sampled from the Sintel final pass dataset. Fig. 10 shows that the model trained with the ScopeFlow pipeline improves occlusion estimation in the marginal image area and mainly for foreground objects. We used the F1 metric, with an average approach of ’micro’ (the same trend presented by all averaging approaches).

## Appendix G Public tables

We uploaded our results to the two main optical flow benchmarks: MPI Sintel and KITTI (2012 & 2015). In the subsections below, we provide the screenshots that capture the sizable improvements achieved by using our pipeline for training an optical flow model, with an off-the-shelf, low parameters model. Since our method can support almost any architecture, we plan, as future work, to apply it to other architectures as well.

### g.1 MPI Sintel

We add here two screenshots of the public table: (i) the table on the day of upload (14.10.19), and (ii) the table after the official submission deadline for CVPR 2020. As shown in Fig. 11, our method ranks first on MPI Sintel since 14.10.19, surpassing all other methods, and leading the categories of: (i) matchable pixels, (ii) pixels far more than from the nearest occlusion boundary, and (iii) fast-moving pixels ( pixels per frame). We also provide a screenshot taken after the official CVPR paper submission deadline, in Fig. 12, showing our method still leading the Sintel benchmark. We changed our method’s name after the initial upload (on 14.10.19) from OFBoost to ScopeFlow.

### g.2 Kitti 2012

Fig. 13 shows a screenshot of the KITTI 2012 flow table, with the lowest outlier threshold (of 2%), taken on the CVPR paper submission deadline. Our method provides the lowest average EPE among all published two-frame methods, lower by 23% from the IRR-PWC baseline results.

### g.3 Kitti 2015

Fig. 14 shows a screenshot of the KITTI 2015 flow table, taken on the CVPR paper submission deadline. Our method provides the lowest percentage of outliers, averaged over foreground regions, among all published two-frame methods. Moreover, the percentage of outliers, averaged over all ground truth pixels, is lower by more than 12% from the IRR-PWC baseline results.

### References

1. M. Aubry, D. Maturana, A. A. Efros, B. C. Russell and J. Sivic (2014-06) Seeing 3d chairs: exemplar part-based 2d-3d alignment using a large dataset of cad models. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
2. C. Bailer, K. Varanasi and D. Stricker (2017-07) CNN-based patch matching for optical flow with thresholded hinge embedding loss. In CVPR, pp. 2710–2719. External Links: Document Cited by: §2.
3. D. J. Butler, J. Wulff, G. B. Stanley and M. J. Black (2012-10) A naturalistic open source movie for optical flow evaluation. In European Conf. on Computer Vision (ECCV), A. Fitzgibbon et al. (Eds.) (Ed.), Part IV, LNCS 7577, pp. 611–625. Cited by: §D.1, §2.
4. A. Dosovitskiy, P. Fischer, E. Ilg, P. HÃ¤usser, C. Hazirbas, V. Golkov, P. v. d. Smagt, D. Cremers and T. Brox (2015-12) FlowNet: learning optical flow with convolutional networks. In 2015 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 2758–2766. External Links: Document, ISSN Cited by: §2, §2, §4.
5. D. J. Fleet and Y. Weiss (2006) Optical flow estimation. In Handbook of Mathematical Models in Computer Vision, Cited by: §2.
6. D. Gadot and L. Wolf (2016-06) PatchBatch: a batch augmented loss for optical flow. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
7. A. Geiger, P. Lenz and R. Urtasun (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §D.2, §2.
8. H. Hirschmuller (2008-02) Stereo processing by semiglobal matching and mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (2), pp. 328–341. External Links: Document, ISSN Cited by: §2.
9. T. Hui, X. Tang and C. C. Loy (2018) LiteFlowNet: a lightweight convolutional neural network for optical flow estimation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), External Links: Link Cited by: §2.
10. T. Hui, X. Tang and C. C. Loy (2019) A lightweight optical flow cnn - revisiting data fidelity and regularization. arXiv preprint arXiv:1903.07414. External Links: Link Cited by: §2, §2, Table 8.
11. J. Hur and S. Roth (2019) Iterative residual refinement for joint optical flow and occlusion estimation. In CVPR, Cited by: Appendix D, Figure 10, §1, §2.1, §2.1, §2.1, §2, §2, §4, §5.2, Table 7, Table 8.
12. E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy and T. Brox (2017) FlowNet 2.0: evolution of optical flow estimation with deep networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), External Links: Link Cited by: §2, §2, §4, Table 7, Table 8.
13. E. Ilg, T. Saikia, M. Keuper and T. Brox (2018) Occlusions, motion and depth boundaries with a generic network for disparity, optical flow or scene flow estimation. In European Conference on Computer Vision (ECCV), External Links: Link Cited by: Figure 10, Table 7.
14. D. Kondermann, R. Nair, K. Honauer, K. Krispin, J. Andrulis, A. Brock, B. Gussefeld, M. Rahimimoghaddam, S. Hofmann and C. Brenner (2016) The hci benchmark suite: stereo and flow ground truth with uncertainties for urban autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 19–28. Cited by: §2.
15. Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard and L. D. Jackel (1989-12) Backpropagation applied to handwritten zip code recognition. Neural Comput. 1 (4), pp. 541–551. External Links: ISSN 0899-7667, Link, Document Cited by: §1.
16. P. Liu, I. King, M. R. Lyu and J. Xu (2019) DDFlow: Learning Optical Flow with Unlabeled Data Distillation. In AAAI, Cited by: §2.
17. P. Liu, M. R. Lyu, I. King and J. Xu (2019) SelFlow: self-supervised learning of optical flow. In CVPR, Cited by: §2.1, §2, §2, §2, Table 7, Table 8.
18. D. Maurer and A. Bruhn (2018) ProFlow: learning to predict optical flow. In BMVC, Cited by: Table 8.
19. N. Mayer, E. Ilg, P. Fischer, C. Hazirbas, D. Cremers, A. Dosovitskiy and T. Brox (2018-04) What makes good synthetic training data for learning disparity and optical flow estimation?. International Journal of Computer Vision 126 (9), pp. 942â960. External Links: ISSN 1573-1405, Link, Document Cited by: §2, §2, §5.1.
20. N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy and T. Brox (2016-06) A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In CVPR, pp. 4040–4048. External Links: Document Cited by: §2, §2.
21. I. Melekhov, A. Tiulpin, M. Pollefeys, E. Rahtu and J. Kannala (2019) DGC-Net: dense geometric correspondence network. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Cited by: §2.1.
22. M. Menze and A. Geiger (2015) Object scene flow for autonomous vehicles. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §D.3, §2.
23. M. Neoral, J. Å ochman and J. Matas (2018-12) Continual occlusions and optical flow estimation. In 14th Asian Conference on Computer Vision (ACCV), External Links: Link Cited by: §2.1, §2.1, §2, §2, Table 7, Table 8.
24. A. Ranjan and M. J. Black (2017) Optical flow estimation using a spatial pyramid network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
25. Z. Ren, O. Gallo, D. Sun, M. Yang, E. B. Sudderth and J. Kautz (2019) A fusion approach for multi-frame optical flow estimation. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Cited by: §2.1, §2, §2, Table 8.
26. D. Sun, X. Yang, M. Liu and J. Kautz (2018) Models matter, so does training: an empirical study of cnns for optical flow estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). Cited by: §2, Table 8.
27. D. Sun, X. Yang, M. Liu and J. Kautz (2018) PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In CVPR, Cited by: §1, §2.1, §2, Table 8, §5.
28. Y. Wang, Y. Yang, Z. Yang, L. Zhao, P. Wang and W. Xu (2018) Occlusion aware unsupervised learning of optical flow. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4884–4893. Cited by: §2.
29. P. Weinzaepfel, J. Revaud, Z. Harchaoui and C. Schmid (2013-12) DeepFlow: large displacement optical flow with deep matching. In 2013 IEEE International Conference on Computer Vision, Vol. , pp. 1385–1392. External Links: Document, ISSN Cited by: §2.
30. J. Wulff, L. Sevilla-Lara and M. J. Black (2017-07) Optical flow in mostly rigid scenes. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 8.
31. J. Xu, R. Ranftl and V. Koltun (2017) Accurate optical flow via direct cost volume processing. In CVPR, Cited by: §2, Table 8.
32. G. Yang and D. Ramanan (2019) Volumetric correspondence networks for optical flow. In NeurIPS, Cited by: Figure 8, §2, Table 8.
33. Z. Yin, T. Darrell and F. Yu (2019) Hierarchical discrete distribution decomposition for match density estimation. In CVPR, Cited by: Table 8.
34. S. Zweig and L. Wolf (2017-07) InterpoNet, a brain inspired neural network for optical flow dense interpolation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters