Location Dependency in Video Prediction
Deep convolutional neural networks are used to address many computer vision problems, including video prediction. The task of video prediction requires analyzing the video frames, temporally and spatially, and constructing a model of how the environment evolves. Convolutional neural networks are spatially invariant, though, which prevents them from modeling location-dependent patterns. In this work, the authors propose location-biased convolutional layers to overcome this limitation. The effectiveness of location bias is evaluated on two architectures: Video Ladder Network (VLN) and Convolutional Predictive Gating Pyramid (Conv-PGP). The results indicate that encoding location-dependent features is crucial for the task of video prediction. Our proposed methods significantly outperform spatially invariant models.
Keywords:Video prediction, Deep learning, Location-dependent bias
Niloofar Azizi, Hafez Farazi, and Sven Behnke
The task of video prediction consists of predicting a set of successor frames, given a sequence of video frames. It is challenging, because the predictor needs to understand both contents and motion of the scene in order to make good predictions. In recent years, deep learning approaches became popular for video prediction. They analyze the video both spatially and temporally and learn hierarchical representations, which model the image evolution in terms of its content and dynamics ([mathieu2015deep], [wagnerlearning]). The learned representations can be used for a variety of applications, including action recognition and anticipating future actions, which can be utilized for instance in human-robot interaction scenarios.
Convolutional deep learning architectures cannot recognize location-dependent features, however, due to the location-invariant nature of convolutions. In the task of the video prediction, for instance, learning the location of static obstacles in the environment leads to better frame forecasting. In this work, the authors propose three different methods to overcome this limitation:
encoding location features in separate channels of the input,
convolutional layers with learnable location-dependent biases, and
convolutional layers with learnable location-dependent biases and predefined location encodings.
These methods are illustrated in Fig. 1 for 1D and two-dimensional convolutions.
We demonstrate the utility of our approach using two datasets that contain location dependencies. The code and datasets of this paper are publicly available.111https://github.com/AIS-Bonn/LocDepVideoPrediction
2 Related Work
Convolutional deep learning architectures are spatially invariant, which leads to the constraint of not being able to model location-dependent patterns.
To address this issue in various computer vision tasks, different approaches have been explored. Utilizing fully connected layers leads to learning location-dependent features, but this has the drawbacks of many parameters and no spatial weight sharing. In the PixelCNN architecture for conditional image generation, Oord et al. [van2016conditional] applied 11 convolutions to map a hidden representation into a spatial representation. The disadvantage of this approach is that to extract the spatial features, a very large number of parameters is needed. In saliency prediction, Kruthiventi et al. [DeepFix] proposed adding another set of convolutional weights with the same size of the original filters. They convolved these additional weights with predefined fixed channels that encode the image center using Gaussian blobs with different horizontal and vertical extent. Ghafoorian et al. [ghafoorian2017location] applied specific location features to train the model and utilized location dependency for the task of brain MRI image segmentation. They showed that the results improve in comparison to CNNs that do not use location information. The above approaches depend all on predefined location feature structures.
For the task of video prediction, different approaches have been explored. The most successful ones utilize deep learning methods. Cricri et al. [VLN] proposed Video Ladder Networks (VLN) by adding recurrent connections to the ladder network [rasmus2015semi]. Similar to ladder networks, VLN employs shortcut connections from the encoder to the respective decoder part, whereby it relieves the deeper layers from modeling details. The VLN architecture achieves a result competitive to VPN [VPN] which is the state-of-the-art on the synthetic dataset of Moving MNIST. However, the VLN architecture due to its convolutional layers, cannot deal with location-dependent features. Another recurrent network for the task of video prediction was proposed by Michalski et al. [michalski2014modeling]. Their PGP network is based on a gated autoencoder and a bilinear transformation model, to learn transformations between pairs of consecutive images ([memisevic2013learning], [memisevic2010learning]). PGP is fully connected, which results in a large number of parameters. Its convolutional variant Conv-PGP reduces the number of parameters significantly [demodeling], but looses the ability to learn location-dependent features. For the evaluation of Conv-PGP, the authors augmented one-pixel padding to the input to learn a bouncing ball motion in their synthetic dataset.
While VLN and Conv-PGP have shown impressive performance in the task of video prediction, the above analysis shows that the effect of location-dependent features on these two architectures requires further investigation.
3 Location Dependency in VLN Model
The VLN model [VLN] is a neural network architecture that predicts future frames by encoding the temporal and spatial features of a video. Although it achieves a competitive result in comparison to the state-of-the-art on Moving MNIST, due to the location invariant property of convolution operation, it cannot learn location-dependent features present in the dataset. The network would become unreasonably huge if we wanted to utilize a fully connected layer to allow for learning location-dependent features. Using a fully connected layer would also violate the assumption of weight sharing in the VLN architecture. The same-padding property around the border, which is not analyzed in the original paper, is the reason which allows the network to learn where to mirror digit velocity despite using only convolutional operations. Such a behavior is accidental, though, and should not be treated as a feature.
To demonstrate this limitation of the VLN architecture, we modified the Moving MNIST dataset to Occluded Moving MNIST, similar to what is used by Prémont-Schwarz et al. [RLN]. As demonstrated in the experiment section, we tested the original one-layer VLN with this dataset and it did not achieve an acceptable result.
To solve this issue, we propose three methods for providing location information to the network. In the first method illustrated in Fig. 1(a), we provide three additional input channels to the network: two gradient channels in and direction, starting from and ending with , as well as one channel containing the occlusion grid pattern. The occlusion channel is 1 in the occlusion areas and 0 elsewhere. These additional input channels allow the network to infer the location-dependent feature of the border and to utilize the occlusion pattern. In contrast to encoding location features in the original input channel, having additional channels does not alter the original input. Encooding occlusions in a separate channel can be useful, for example, when they are inferred from modalities other than a camera, like a laser scanner.
where is the activation function. and are the weight and bias of the specified layer, respectively. Note that can be omitted, but we kept it to make the proposed layer easy to implement on top of an existing convolution layer. is the input vector at the Cartesian position and represents the convolution operator. Note that and are location-dependent weights that are learned through the training procedure. and are shared for all convolutional filters, which is done by broadcasting over channel dimension.
In the third method, illustrated in Fig. 1(c), we added location-dependent gradients to the and :
where similar to additional input channels, and encode location by gradients in and directions, respectively.
Providing these facilitates the learning of more complex location-dependent biases.
4 Location Dependency in Conv-PGP Model
PGP [michalski2014modeling] is designed based on the assumption that two temporally consecutive frames can be described as a linear transformation of each other. In the PGP architecture, by using a Gated AutoEncoder (GAE) as bi-linear model, the hidden layer of mapping units encodes the transformation.
The fully connected PGP architecture contains a significant number of parameters. To deal with this issue, we utilized its convolutional variant (Conv-PGP), similar to [demodeling], where fully connected layers are replaced by convolutions.
While Conv-PGP reduces the number of parameters significantly, it cannot learn location-dependent features such as the image border anymore. Using valid convolutions prevents, e.g., learning the mirroring motion in the Bouncing Ball dataset. As shown in the experiment section, in the Conv-PGP model, the balls disappear instead of being reflected at the border which indicates that the model is incapable of predicting location-dependent motions.
To demonstrate this limitation more clearly, we modified the Bouncing Ball dataset. In the Occluded Bouncing Ball dataset, we augmented fixed strides of three pixels to occlude the moving balls as well as invisible lines to mirror the velocity. As shown in the following section, we trained the Conv-PGP with this dataset, and it did not achieve a satisfactory result. To resolve this issue, we applied the three proposed methods for modeling location dependency to Conv-PGP.
We tested our modified VLN architectures on the Occluded Moving MNIST dataset. Each video in the Occluded Moving MNIST dataset contains frames, with one MNIST digit moving inside a 6464 patch. Digits are chosen randomly from the training set and placed initially at random locations inside the patch with a random velocity. The frames are filled with occluding vertical and horizontal bars; the distance between them is eight pixels. In addition to that, we added invisible lines to mirror the velocity at a distance of ten pixels from the border.
In our first experiment, we compare the one-layer original VLN architecture on Occluded Moving MNIST with our three proposed solutions:
VLN-AI: Two location gradient channels and one occlusion channel as additional location encoding inputs (Fig. 1(a)),
VLN-LDC: Location-dependent bias in the encoder block (Fig. 1(b)), and
VLN-LDCAI: Location-dependent bias in the encoder block and location gradient channels (Fig. 1(c)).
In our experiment, the first eight frames are predicted using the given frame from the dataset. The last two frames are predicted using the previous network output. Sample results of one-layer original VLN and VLN-LDCAI are depicted in Fig. 3. Sample activations of the Conv-LSTM and the encoder block for both the original VLN and the VLN-LDCAI are shown in Fig. 4. These activations demonstrate that the original VLN cannot infer the location-dependent features while the VLN-LDCAI can learn location-dependent features including the border and the occlusion grid.
Table 1 reports the prediction loss and the number of parameters for the evaluated model variant. It can be observed that all methods to model location dependencies improve performance.
|Model||Prediction test loss (BCE)||Number of parameters|
In a second experiment, we compared a one-layer Conv-PGP network with and without the border on the Occluded Bouncing Ball dataset, which is constructed similar to Occluded Moving MNIST. In our experiment, the first three frames are predicted using the given frame from the dataset. The last seven frames are predicted using the previous network output. As illustrated in Figure 5, learning the location-dependent features is crucial for the prediction task. The prediction losses reported in Table 2 show that our proposed one-layer location-dependent Conv-PGP can solve the Occluded Bouncing Ball dataset and yields a much better result than one-layer Conv-PGP.
|Model||Prediction test loss (BCE)||Number of parameters|
Our experiments indicate that location information is a necessity in convolutional architectures for video prediction tasks as, for example, dealing with occlusions in the environment is challenging. To test three proposed variants of learning location-dependent features, we utilized the Occluded Moving MNIST and Occluded Bouncing Ball datasets which mimic occlusions in the real world. The proposed location-dependent inputs and biases allow the VLN and Conv-PGP models to learn more complex location-dependent features than just mirroring velocity at the borders. In contrast to previous approaches, our proposed learnable location-dependent biases do not assume any predefined underlying feature structure. Our proposed location-dependent convolution layers significantly improve on the results of both one-layer VLN and one-layer Conv-PGP architectures.
In future work, we will explore the proposed methods for general deep convolutional neural network architectures, and test the performance on real-world datasets.
This work was funded by grant BE 2556/16-1 (Research Unit FOR 2535 Anticipating Human Behavior) of the German Research Foundation (DFG).