PredNet and Predictive Coding: A Critical Review

PredNet and Predictive Coding: A Critical Review

Abstract.

The PredNet architecture by Lotter et al. (Lotter et al., 2016) combines a biologically plausible architecture called Predictive Coding with self-supervised video prediction in order to learn the complex structure of the visual world. While the architecture has drawn a lot of attention and various extensions of the model exist, there is a lack of a critical analysis. We fill in the gap by evaluating PredNet, both as an implementation of Predictive Coding theory and as a self-supervised video prediction model, using a challenging video action classification dataset. We also design an extended architecture to test if conditioning future frame predictions on the action class of the video and vise-versa improves the model performance. With substantial evidence, we show that PredNet does not completely follow the principles of Predictive Coding. Our comprehensive analysis and results are aimed to guide future research based on PredNet or similar architectures based on the Predictive Coding theory.

deep learning, convolutional neural networks, video classification, video prediction, semi-supervised, Predictive Coding
123456

1. Introduction

Learning a model of the visual world is a crucial prerequisite to reliably perform computer vision tasks like object detection and semantic segmentation. As illustrated by (Jing and Tian, 2019), self-supervised learning allows us to extract this complex structure of the real-world without a need for expensive labeled data. Videos contain information about how scenes evolve in time and therefore predicting the future frames of a video is one popular method (Finn et al., 2016)(Mathieu et al., 2015)(Vondrick et al., 2015)(Wang et al., 2016) (Wang et al., 2018) of extracting this structure in a self-supervised manner. Many researchers (Lotter et al., 2016)(Mathieu et al., 2015)(Srivastava et al., 2015) have hypothesized that to accurately predict how the visual world changes, a model should learn about the object structure and the possible transformations objects can undergo. Among the various video prediction models, PredNet by Lotter et al.(Lotter et al., 2016) achieves high video prediction accuracy with the additional benefit of using a biologically plausible architecture.

The PredNet architecture is inspired by the Predictive Coding theory from the neuroscience literature (Friston, 2009)(Rao and H. Ballard, 1999)(Spratling, 2017) and attempts to implement it within the deep learning framework. Predictive Coding theory is promising as a self-supervised learning technique as it is shown to imitate some of the neuronal behavior seen in the mammalian visual cortex. It posits that the brain is continually making predictions of incoming sensory stimuli and uses the deviations from these predictions as a learning signal. It is a hierarchical network consisting of top-down connections that carry predictions from higher to lower states and bottom-up connections that carry the sensory data from lower to higher states at each layer. The error in prediction is propagated upwards, eventually leading to better predictions in the future.

In this paper, we first re-evaluate PredNet’s performance as a video prediction architecture by testing it on a different dataset. Then we examine its capability to learn useful latent features for performing well on a downstream task. Specifically, our contributions in this paper are two-fold:

  1. Using visualization techniques and experiments, we critically review PredNet as an emulation of the Predictive Coding framework and as a video prediction model.

  2. We test the features extracted by PredNet by training it in a semi-supervised setup to perform video action classification. We also evaluate if conditioning the top-down predictions on action classes of the video and vice versa improves the model’s accuracies.

The paper is organized as follows. Section 2 reviews Predictive Coding and its implementations. Section 3 describes our experiment setup, namely our data, the architecture and the evaluation metrics used. Section 4 is dedicated to the first phase of our work, listing our observations while testing PredNet. Section 5 gives details on the second phase of our work, the implementation and evaluation of PredNet+ and our proposed extension of the architecture, and in Section 6 we conclude the paper and list possible directions for future research.

2. Related work

Rao and Ballard (Rao and H. Ballard, 1999) developed the Predictive Coding model in 1999 to demonstrate that the ’extra-classical’ receptive field effects observed in the early stages of cat and monkey visual cortex are a result of the brain trying to efficiently encode sensory data using prediction. This was followed by a rich literature of work in neuroscience (Spratling, 2017)(Summerfield and Koechlin, 2008)(Friston, 2009)(Clark, 2013)(Emberson et al., 2015) and computational modelling (Chalasani and Principe, 2013)(Lotter et al., 2016)(Song et al., 2019) that explored different interpretations and implementations of the basic idea, not only in the visual domain but also in sensory-motor domains.

Following the success of deep learning in the last decade, many researchers have attempted to implement the Predictive Coding model using deep learning (Han et al., 2018) (Lotter et al., 2016) (van den Oord et al., 2018) (Wen et al., 2018). Wen et al. (Wen et al., 2018) use Predictive Coding on static images to learn optimal feature vectors at each layer for object recognition. Han et al. (Han et al., 2018) build on this to develop a bidirectional and dynamic neural network with local recurrent processing. A. Oord et al. (van den Oord et al., 2018) perform Predictive Coding in latent space and use a probabilistic contrastive loss to learn useful representations. Lotter et. al. (Lotter et al., 2016) designed a video prediction network using the principles of Predictive Coding. Among these, Lotter et al.’s PredNet architecture is the closest to the original Predictive Coding model due to its hierarchical structure, bottom-up error propagation, and top-down predictions. They also achieve accuracy on-par with state-of-the-art video prediction models. Due to these reasons and the availability of open-source code, a lot of follow-up work on PredNet has ensued.

Zhong et al.(Zhong et al., 2018a) extend PredNet into AFA-PredNet for the robotics domain. They integrate the motor actions of a robot as an additional signal to condition the top-down generative process. Following this, they design MTA-PredNet (Zhong et al., 2018b) that has different temporal scales at different layers in the hierarchy. They developed MTA-PredNet to compensate for PredNet’s inability to perform reliable long-term predictions which is a necessity in robotics for planning. Furthermore, researchers have tried to improve PredNet by adding skip-connections alongside error propagation (Sato et al., 2018), reducing the number of parameters by using fewer gates in the top-down ConvLSTM units (Elsayed et al., 2019) and also using inception-type units within each PredNet layer (Hosseini et al., 2019). Sato et al.(Sato et al., 2018) evaluate PredNet on weather precipitation dataset and Watanabe et al. (Watanabe et al., 2018) test PredNet’s response to visual illusions to examine whether Predictive Coding models respond to visual illusions just as humans do. However, none of the above work critically evaluates PredNet as an implementation of Predictive Coding and as a reliable self-supervised pretraining method. Our work aims to provide this in the form of a critical review of PredNet for future works that intend to use the architecture or design architectures inspired by it.

3. Experiment setup

3.1. Dataset

Most existing large scale video classification datasets have coarse-grained labels(Kuehne et al., 2011)(Heilbron et al., 2015)(Karpathy et al., 2014). This means that the models are trained on a relatively easy task and the label can be detected even from isolated frames, e.g. ’soccer’ label can be inferred from green field. To overcome this issue and force models to learn better representations, the Something-something dataset (Goyal et al., 2017) was collected which contains 220,000 videos with 174 fine-grained action labels. For instance, ’putting something on a table’, ’pretending to put something on a table’, and ’putting something on a slanted surface so it slides down.’ are three different label classes in the dataset. Mahdisoltani et al. (Mahdisoltani et al., 2018) provide evidence for the hypothesis that task granularity is strongly correlated with the quality and generalizability of the learned features. As for the nature of the data, being crowd-sourced, it includes noise much resembling the real world: thousands of different objects, variations of lighting conditions, background patterns, and camera motion.

3.2. PredNet architecture

The PredNet architecture is shown in Figure 1 (Lotter et al., 2016). The network is composed of stacked hierarchical layers, each of which attempts to make local predictions of its input. The difference between the actual input and this prediction is then passed up the hierarchy to the next layer. Information flows in three ways through the network: (1) the error signal flows in the bottom-up direction as marked by the red arrows on the right, (2) the prediction signal flows in the top-down direction as shown by the green arrow on the left, and (3) the local error signal and prediction estimation signal flow within each layer. Every layer consists of four units: an input convolution unit (), a recurrent representation unit () followed by a prediction unit () and an error calculation unit () as labelled in Figure 1. The representation unit is made of a ConvLSTM (Shi et al., 2015) layer that estimates what the input will be on the next time step. This input is then fed into the prediction unit that generates the prediction . The error units calculate the difference between the prediction and the input which is fed as input to the next layer. The representation unit receives a copy of the error signal (red arrow) along with the up-sampled input from the representation unit of the higher-level (green arrow), which it uses along with its recurrent memory to perform the future predictions.

Figure 1. PredNet (Lotter et al., 2016) architecture.

3.3. Evaluation metrics

Defining a good evaluation metric for the quality of image predictions is a challenging subtopic in itself (De and Masilamani, 2013)(Mathieu et al., 2015). There is no universally accepted measurement of image quality and consequently for image similarity. For the video prediction task, we employ the two commonly used metrics in literature: Peak Signal Noise Ratio (PSNR) (Mathieu et al., 2015) and the Structural Similarity Index Measure (SSIM) (Wang et al., 2004). Like Mathieu et al. (Mathieu et al., 2015), we calculate PSNR and SSIM only for the frames which have movement with respect to the previous frame and call them ’PSNR movement’ and ’SSIM movement’ respectively. In our case this is crucial as action videos can often contain very few frames with movement and a metric should not reward a model for simply reconstructing a still frame. We also use a third metric called ’conditioned SSIM’, which is calculated as given in Equation 1. This metric quantifies how different the predictions are from the previous frame and therefore measures how ’risky’ the predictions of our model are in comparison to simply performing a ’last-frame-copy’.

(1)

4. Probing PredNet

In the first phase of our project, we review PredNet by evaluating its performance on the Something-something dataset and visualize different states of the architecture. For our experiments we use 10 different hyper-parameters settings with a different number of layers, channels per layer, input image size and frames-per-second (FPS) of the video and these are listed in Table 1. Along with the predicted frame, we visualize all the different states of PredNet at each layer by averaging the activation of all channels in a layer, similar to Han et al. (Han et al., 2018). We also plot the mean of the error signals and representations of every layer to help visualize how they evolve over the span of the video. A sample video with all its visualizations is shown in Figure 5. In the following section, we dedicate one paragraph to each of our findings, and Figure 5 and Table 1 will aid in discussing them.

Figure 2. Example of a low FPS video and the predictions made by PredNet
Figure 3. Comparison of performance of models trained with videos of 3, 6 and 12 FPS rates.
Figure 4. Example of a video with multiple possible future states.
Figure 5. Model 7 full visualization.
Model fps Layers Image No. of Loss
(h x w) param.
0 3 4 48 X 56 6.9
1 6 4 48 X 56 6.9
2 12 4 48 X 56 6.9
3 12 4 32 X 48 6.9
4 12 5 48 X 80 5.3
5 12 6 64 X 96 5.8
6 12 7 128 X 192 6.2
7 12 6 96 X 160 7.2
8 12 5 48 X 80 5.3
9 12 6 64 X 96 5.8
Table 1. Experiments with different frames-per-seconds (FPS), number of layers, no of parameters in the model (in millions), the image size and whether it was trained with loss or loss. Comparable models are grouped using horizontal lines and the column that varies are in bold.

4.1. Observations

Comparing the frame predictions with input frames in Figure 2, Figure 4 and Figure 5 we can summarize the working dynamics of PredNet on the action classification dataset as follows: The model simply performs previous-frame-copy if there are no cues for motion in the previous two frames. If there is a cue for motion and if the direction of the motion is continuous and the motion is smooth, it interpolates the object in the direction of the motion. Otherwise, it blurs the region containing the object of motion to keep the L2 loss minimal by virtue of regression-to-the-mean. The blurring is a result of PredNet’s inability to learn multi-modal predictions in the sense that it learns to perform the ‘one best’ prediction. However, in real-world scenes, there are multiple equally-probable future states. For instance, in Figure 4, the thumb can move up or move down or not move at all in the next frame. The blurring strategy of PredNet is further supported by the experiment we conducted with different sharpness measures. The predictions by the model are always less sharp than the actual videos.

The PredNet model learns relevant features only when trained on videos with continuous motion. The authors(Lotter et al., 2016) designed and tested PredNet on videos with continuous motion, such as the KITTI dataset (Geiger et al., 2013) and their synthetic ’Rotating Faces’ dataset. This is in stark contrast with our action dataset which can have a lot of still frames, see e.g. Figure 2. In this scenario, the PredNet resolves to mere last-frame copying, as it is statistically beneficial to do just that. If the model is not motivated enough to learn the dynamics of how objects move and scenes evolve then the features it learns would not be useful for downstream tasks as hypothesized by Lotter et al. (Lotter et al., 2016).

PredNet’s learning ability is sensitive to the frames-per-second (FPS) rate. When we compare the performance of models trained on videos with FPS rates 3, 6 and 12 in Figure 3, we can see that the performance varies greatly. Manual inspection of the predictions further confirms the large difference in prediction quality. At very high FPS there is minimal motion between 2 consecutive frames while at low FPS rates there is abrupt movement between frames which is challenging to predict. In both of these scenarios, the model completely resorts to last-frame-copy. Therefore, the FPS of the video is one of the most important hyper-parameter of PredNet.

We have two pieces of evidence to support that PredNet is not a comprehensive emulation of hierarchical Predictive Coding. Firstly, from the ”mean E activation” plot in Figure 5, it is evident that the mean bottom-up error increases as we go up to higher layers. This behavior can be observed in all sample videos being visualized. This is in contrary to the expectations of Predictive Coding, which posits that the error decreases as we go up the hierarchy as parts of the incoming signal are iteratively ’explained away’. Secondly, Lotter et al. (Lotter et al., 2016) demonstrate that models trained with loss perform better than loss on the KITTI data. We cross-check this on our dataset and get similar results. As shown in Figure 7, model with loss performs better on all metrics. Training with loss implies that we only minimize the error on the lowest layer (see Figure 1) while in loss the model is trained to minimize the prediction errors in all the layers. The Predictive Coding theory states that each layer in the hierarchy minimizes the error signal iteratively. Therefore, the ideal implementation of Predictive Coding is the one that improves results when trained with loss. Furthermore, the visualization of mean activation of PredNet’s states at different layers in Figure 5 shows that Layer 0’s states look quite different from all the higher layers. It seems to be that the model operates as two sub-modules when trained with the loss: the lowest layer aims to generate realistic predictions, while the rest of the layers operate as one deep network that regresses to generate the context . This is also evident by the fact that the ”mean R activations” for Layer 0 is higher and follows a different trajectory than the rest of the layers in 5.

We also evaluate PredNet by examining its ability to extrapolate and predict longer time steps into the future. As explained in Lotter et al. (Lotter et al., 2016), PredNet can be used to generate long-term predictions by simply feeding it’s predictions at time back in as input at the next time step . This can be done iteratively for time steps to get a prediction into the future. We test the extrapolation capability of PredNet models that are trained only to perform predictions as well as PredNet models that are trained exclusively to perform predictions. As expected, the results are marginally better in the latter case as also demonstrated in Lotter et al. (Lotter et al., 2016). The extrapolated predictions of our best performing model are given in Figure 8. The extrapolation is started at different time points in the video as shown by the red marker in the figure. The following three observations can be made from the above experiments (1) After two-time steps, the model resorts to last-frame-copying. As already discussed, PredNet performs predictions by using the movement between consecutive input frames as an active cue. Therefore while extrapolating, when we feed the predictions back as input, the model gets a cue that the action has stopped and reverted to performing last-frame-copying. (2) The predictions get blurrier over time. This is because the minor blur added by the down-sampling units in the bottom-up and up-sampling units in the top-down accumulate exponentially with time. (3) From the metrics in Figure 9 we can infer that the models perform better if the extrapolation is started in the later stages of the video. This can be due to the fact that in our dataset, motion generally starts in the middle or towards the end of the video. In conclusion, the extrapolation experiments suggest that the network designs compel it to learn just short-term interpolations instead of building long-term hypotheses.

Finally, we discovered that the model makes ’interesting’ predictions (i.e. it predicts object movements instead of just blurring regions of motion) only when the topmost layers have a full receptive field. The receptive field can be increased either by using deeper layers or by increasing the kernel size of the convolutions or even by reducing the image size. We experimented with each of these and found that the prediction scores improve with the increased receptive field. As proof we show the results of experimenting with a different number of layers in Figure 6 and it is apparent that the prediction quality improves with increasing depth.

Figure 6. Results of Models with 4, 5, 6 and 7 layers.
Figure 7. Results of Models with and loss.
Figure 8. Extrapolation results for Model 7 extrapolated at different time steps. The red mark shows the start of the extrapolation.
Figure 9. Comparison of the metrics when starting extrapolation at different stages in the video. denotes the total number of frames in the video.

5. Label classification with PredNet+

In this section, we describe the second phase of our work, where we further test the architecture by modifying it to perform supervised label classification simultaneously with video prediction. For a comparison of the architectures of PredNet+ and the vanilla PredNet, see Figures 10 and 1 respectively. The model design, the rationale behind the design and the results are discussed next.

5.1. PredNet+ Design

We modify the PredNet architecture such that it can perform video label classification along with next frame prediction and informally call this architecture PredNet+. The architecture is shown in Figure 10. As seen in said figure, along with the vanilla PredNet units it contains an additional ‘label classification unit’ that is attached to the top-most representation layer. It consists of an encoder section and a decoder section. In Figure 10 the two ConvLSTM layers form the encoder (in black) which transforms the output of the representation unit into label class probabilities. The two transposed convolution layers make up the decoder (also in black) that up-samples and transforms the label classes back to the imaging modality which is fed back into the top down as shown by the black arrow to .

Figure 10. PredNet+ architecture
Figure 11. Weighing of label predictions over time-steps.

The ‘label prediction unit’ makes predictions at each incoming frame, whose weighted sum is passed through a softmax function to get the final class probabilities for the video. As the model does not have enough context to make meaningful predictions at the beginning of the video, the weighing-over-time is done using an exponential function shown in Figure 11.

PredNet+ is designed such that the latent features at the top-most representation layer are shared between its two tasks. The future frame predictions are conditioned on the label predictions made by the ‘label classification unit’ (shown in Figure 10 by the black arrow going into ). We hypothesized that this would improve the results on both sub-tasks as evident in many multi-task training scenarios (Collobert and Weston, 2008) (Girshick, 2015). Even though in our case, we attach the ‘label classification unit’ to the top-most layer, this is not the only approach nor necessarily the best one as per Predictive Coding. But this setup is chosen because the top-most layer in our models have the full receptive field on the image frames and also contain recurrent memory which is ideal for video label classification.

In summary, the ‘label classification unit’ and the PredNet units in PredNet+, are expected to work in tandem in a multitask learning set-up and form a synergy. However, this is not what we observe in our results.

5.2. Results

Table 2 shows our best classification accuracy in comparison to the baseline model scores of Goyal et al. (Goyal et al., 2017) and the current state-of-the-art results by Mahdisoltani et al. on the Something-something dataset(Mahdisoltani et al., 2018).We test the PredNet+ architecture on our best 4 layer model, 5 layer model and 6 layer model from Table 1. Furthermore, we test the following minor variations of PredNet+ to further evaluate the model architecture: First, we remove the recurrent memory in the label classification unit by replacing the ConvLSTM with Convolution layers. Next, we extend the label classification loss function such that the model is rewarded for predicting at least the correct verb in the label. For example, if the correct label is Pretending to put something behind something, the model is penalized twice as much if it predicts Showing something to the camera than if it predicts Putting something behind something, which has the same verb as the correct label. Surprisingly, the classification results do not change at all () for any of these model variations. This suggests that the features from the top-most representation unit do not have any more information.

Model Top-1 Top-5
Baseline (Goyal et al., 2017) 11.5 30.0
Ours 28.2 57.0
Mahdisoltani et al. (Mahdisoltani et al., 2018) 51.38 -
Table 2. Classification accuracy (in percent) on the Something-something dataset(Goyal et al., 2017) with 174 label categories.

Our label classification score suggests that PredNet+ is a long way from the state-of-the-art architectures. Furthermore, the future frame prediction of PredNet+ also degrade in comparison to its equivalent vanilla PredNet models: Model 5 ( loss) and Model 8 ( loss). The metrics in Figure 12 and the visualization of predictions point this out. To further analyze this, we experiment with different loss weights for the two tasks. This allows us to control the relative importance of each task for the model during training. We find that the model’s future prediction quality degrades when the label classification task is given increased importance, suggesting that the multi-task constraint leads to worse future frame predictions.

Figure 12. Comparison of the prediction quality of PredNet+ with equivalent PredNet models.

6. Conclusion and future work

We have evaluated PredNet (Lotter et al., 2016) on a challenging action classification dataset in two phases.

In the first phase of our work, we investigate PredNet and derive the following insights: (1) PredNet does not completely follow the principles of the Predictive Coding framework. (2) It can perform only short-term next frame interpolations, rather than long-term video predictions. This has been further confirmed by the extrapolation experiments. (3) The representation units are unable to learn multi-modal distributions and produce blurry predictions. (4) The models’ learning ability is sensitive to the continuity of motion and the FPS rate of the videos.

In the second phase, we briefly test PredNet’s ability to learn useful latent features to perform label classification. We use the features from the highest representation layer and find that this is not adequate for the task at hand. We achieve a classification accuracy of 28.2% in comparison to current state-of-the-art of 51.38% (Mahdisoltani et al., 2018).

The above discourse brings forth a lot of scope for future research. A successor to PredNet can be designed, which does not have the aforementioned limitations and is a more accurate implementation of the Predictive Coding theory. Firstly, the network should be trainable with loss. This can be done by designing error estimators that are local to each layer. Secondly, the network should be redesigned such that it is encouraged to perform long-term predictions rather than just frame-to-frame interpolation. One way to do this is to have explicit layers, higher in the hierarchy, that make predictions at different temporal scales. Lastly, the estimator units or representation units should learn multi-modal probability distributions, from which predictions can be sampled. Additionally, PredNet’s performance metrics show high variance while PredNet+ is easily susceptible to over-fitting. These points signal the need for including regularization techniques and model averaging methods like dropout within the architecture. Some of the future work on PredNet+ would be to connect the ‘label classification unit’ to representation units of all layers rather than just the top-most layer. In the Predictive Coding framework, this would be deemed most beneficial.

Footnotes

  1. copyright: none
  2. conference: Dublin ’20: ACM ICMR Conference; June 08-11, 2020; Dublin, Ireland
  3. ccs: Computing methodologies Machine learning
  4. ccs: Computing methodologies Semi-supervised learning settings
  5. ccs: Computing methodologies Hierarchical representations
  6. ccs: Computing methodologies Computer vision problems

References

  1. Deep Predictive Coding Networks. arXiv e-prints, pp. arXiv:1301.3541. External Links: 1301.3541 Cited by: §2.
  2. Whatever next? Predictive brains, situated agents, and the future of cognitive science. Behavioral and Brain Sciences 36 (3), pp. 181–204. External Links: Document, 0140-525X, ISBN 1469-1825 (Electronic)$\$r0140-525X (Linking), ISSN 14691825 Cited by: §2.
  3. A unified architecture for natural language processing: deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, New York, NY, USA, pp. 160–167. External Links: ISBN 978-1-60558-205-4, Link, Document Cited by: §5.1.
  4. Image sharpness measure for blurred images in frequency domain. Procedia Engineering 64, pp. . External Links: Document Cited by: §3.3.
  5. Reduced-gate convolutional lstm architecture for next-frame video prediction using predictive coding. In 2019 International Joint Conference on Neural Networks (IJCNN), Vol. , pp. 1–9. External Links: Document, ISSN 2161-4393 Cited by: §2.
  6. Top-down modulation in the infant brain: learning-induced expectations rapidly affect the sensory cortex at 6 months. Proceedings of the National Academy of Sciences 112 (31), pp. 9585–9590. External Links: Document, ISSN 0027-8424, Link, https://www.pnas.org/content/112/31/9585.full.pdf Cited by: §2.
  7. Unsupervised learning for physical interaction through video prediction. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon and R. Garnett (Eds.), pp. 64–72. External Links: Link Cited by: §1.
  8. The free-energy principle: a rough guide to the brain?. Trends in Cognitive Sciences 13 (7), pp. 293 – 301. External Links: ISSN 1364-6613, Document, Link Cited by: §1, §2.
  9. Vision meets robotics: the kitti dataset. International Journal of Robotics Research (IJRR). Cited by: §4.1.
  10. Fast R-CNN. CoRR abs/1504.08083. External Links: Link, 1504.08083 Cited by: §5.1.
  11. The ”something something” video database for learning and evaluating visual common sense. CoRR abs/1706.04261. External Links: Link, 1706.04261 Cited by: §3.1, §5.2, Table 2.
  12. Deep predictive coding network with local recurrent processing for object recognition. CoRR abs/1805.07526. External Links: Link, 1805.07526 Cited by: §2, §4.
  13. ActivityNet: a large-scale video benchmark for human activity understanding. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 961–970. Cited by: §3.1.
  14. Inception-inspired LSTM for Next-frame Video Prediction. arXiv e-prints, pp. arXiv:1909.05622. External Links: 1909.05622 Cited by: §2.
  15. Self-supervised visual feature learning with deep neural networks: A survey. CoRR abs/1902.06162. External Links: Link, 1902.06162 Cited by: §1.
  16. Large-scale video classification with convolutional neural networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 1725–1732. External Links: Document, ISSN 1063-6919 Cited by: §3.1.
  17. HMDB: a large video database for human motion recognition. In Proceedings of the 2011 International Conference on Computer Vision, ICCV ’11, Washington, DC, USA, pp. 2556–2563. External Links: ISBN 978-1-4577-1101-5, Link, Document Cited by: §3.1.
  18. Deep predictive coding networks for video prediction and unsupervised learning. CoRR abs/1605.08104. External Links: Link, 1605.08104 Cited by: PredNet and Predictive Coding: A Critical Review, §1, §2, §2, Figure 1, §3.2, §4.1, §4.1, §4.1, §6.
  19. Fine-grained video classification and captioning. CoRR abs/1804.09235. External Links: Link, 1804.09235 Cited by: §3.1, §5.2, Table 2, §6.
  20. Deep multi-scale video prediction beyond mean square error. CoRR abs/1511.05440. External Links: Link, 1511.05440 Cited by: §1, §3.3.
  21. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature neuroscience 2, pp. 79–87. External Links: Document Cited by: §1, §2.
  22. Short-term precipitation prediction with skip-connected prednet. In Artificial Neural Networks and Machine Learning – ICANN 2018, V. Kůrková, Y. Manolopoulos, B. Hammer, L. Iliadis and I. Maglogiannis (Eds.), Cham, pp. 373–382. External Links: ISBN 978-3-030-01424-7 Cited by: §2.
  23. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. CoRR abs/1506.04214. External Links: Link, 1506.04214 Cited by: §3.2.
  24. Fast inference predictive coding: a novel model for constructing deep neural networks. IEEE Transactions on Neural Networks and Learning Systems 30 (4), pp. 1150–1165. External Links: Document, ISSN 2162-2388 Cited by: §2.
  25. A review of predictive coding algorithms. Brain and Cognition 112, pp. 92 – 97. Note: Perspectives on Human Probabilistic Inferences and the ’Bayesian Brain’ External Links: ISSN 0278-2626, Document, Link Cited by: §1, §2.
  26. Unsupervised learning of video representations using lstms. CoRR abs/1502.04681. External Links: Link, 1502.04681 Cited by: §1.
  27. A neural representation of prior information during perceptual inference. Neuron 59 (2), pp. 336 – 347. External Links: ISSN 0896-6273, Document, Link Cited by: §2.
  28. Representation Learning with Contrastive Predictive Coding. arXiv e-prints, pp. arXiv:1807.03748. External Links: 1807.03748 Cited by: §2.
  29. Anticipating the future by watching unlabeled video. CoRR abs/1504.08023. External Links: Link, 1504.08023 Cited by: §1.
  30. Temporal segment networks: towards good practices for deep action recognition. CoRR abs/1608.00859. External Links: Link, 1608.00859 Cited by: §1.
  31. PredRNN++: towards A resolution of the deep-in-time dilemma in spatiotemporal predictive learning. CoRR abs/1804.06300. External Links: Link, 1804.06300 Cited by: §1.
  32. Image quality assessment: from error visibility to structural similarity. IEEE TRANSACTIONS ON IMAGE PROCESSING 13 (4), pp. 600–612. Cited by: §3.3.
  33. Illusory motion reproduced by deep neural networks trained for prediction. Frontiers in Psychology 9, pp. 345. External Links: Link, Document, ISSN 1664-1078 Cited by: §2.
  34. Deep predictive coding network for object recognition. CoRR abs/1802.04762. External Links: Link, 1802.04762 Cited by: §2.
  35. AFA-prednet: the action modulation within predictive coding. CoRR abs/1804.03826. External Links: Link, 1804.03826 Cited by: §2.
  36. Encoding longer-term contextual multi-modal information in a predictive coding model. CoRR abs/1804.06774. External Links: Link, 1804.06774 Cited by: §2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
407623
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description