Abstract
Bilinear feature learning models, like the gated autoencoder, were proposed as a way to model relationships between frames in a video. By minimizing reconstruction error of one frame, given the previous frame, these models learn “mapping units” that encode the transformations inherent in a sequence, and thereby learn to encode motion. In this work we extend bilinear models by introducing “higher order mapping units” that allow us to encode transformations between frames and transformations between transformations.
We show that this makes it possible to encode temporal structure that is more
complex and longerrange than the structure captured within standard bilinear models.
We also show that a natural way to train the model is by replacing the commonly
used reconstruction objective with a prediction objective which forces the
model to correctly predict the evolution of the input multiple steps into the future.
Learning can be achieved by backpropagating the multistep prediction through time.
We test the model on various temporal prediction tasks, and show that higherorder
mappings and predictive training both yield a significant improvement over bilinear models
in terms of prediction accuracy.
Modeling sequential data using
higherorder relational features and predictive training
Vincent Michalski vmichals@rz.unifrankfurt.de
GoetheUniversity Frankfurt, Frankfurt, Germany
Roland Memisevic roland.memisevic@umontreal.ca
University of Montreal, Montreal, Canada
Kishore Konda konda@informatik.unifrankfurt.de
GoetheUniversity Frankfurt, Frankfurt, Germany
We explore the application of relational feature learning models (e.g. Memisevic & Hinton, 2007; Taylor & Hinton, 2009) in sequence modeling. To this end, we propose a bilinear model to describe frametoframe transitions in image sequences. In contrast to existing work on modeling relations, we propose a new training scheme, which we call predictive training: after a transformation is extracted from two frames in the video, the model tries to predict the next frame by assuming constancy of the transformations through time.
We then introduce a deep bilinear model as a natural application of predictive training, and as a way to relax the assumption of constancy of the transformation.
The model learns relational features, as well as “higherorder relational features”, representing relations between the transformations themselves. To this end, the bottomlayer bilinear model infers a representation of motion from two seed frames as well as a representation of motion from two later frames. The top layer is itself a bilinear model, that learns to represent the relation between the inferred lowerlevel transformations. It can be thought of as learning a secondorder “derivative” of the temporal evolution of the highdimensional input time series. We show that an effective way to train these models is to first pretrain the layers individually using pairs of frames for the bottom layer and pairs of inferred transformations for the next layer, and to subsequently finetune parameters using complete sequences, by backpropagating a multistep lookahead cost through time.
The model as a whole may be thought of as a way to model a dynamical system as a second order partial difference equation. While in principle the model could be stacked to take into account differences of arbitrary order, we demonstrate that the twolayer model is surprisingly effective at modeling a variety of complicated image sequences.
Both layers of our model make use of multiplicative interactions between filter responses in order to model relations (Memisevic, 2013). Multiplicative interactions were recently shown to be useful in recurrent neural networks by (Sutskever et al., 2011). In contrast to our work, (Sutskever et al., 2011) use multiplicative interactions to gate the connections between hidden states, so that each observation can be thought of as blending in a separate hidden state transition. A natural application of this is sequences of discrete symbols, and the model is consequently demonstrated on text. In our work, the role of multiplicative interactions is explicitly to yield encodings of transformations, such as frames in a video, and we apply the model primarily to video data.
Our model also bears some similarity to (Taylor & Hinton, 2009) who model MOCAP data using a generatively trained threeway Restricted Boltzmann Machine, where a second layer of hidden units can be used to model more “abstract” features of the time series. In contrast to that work, our higherorder units which are threeway units too, are used to expressly model higherorder transformations (transformations between the transformations learned in the first layer). Furthermore, we show that predictive finetuning using backprop through time allows us to train the model discriminatively and yields much better performance than generative training by itself.
In order to learn features, , that represent the relationship between two frames and as shown in Figure 1, it is necessary to learn a basis that can represent the correlation structure across the frames.
In a video, given one frame there can be a multitude of potential next frames . It is therefore common to use bilinear models, like the Gated Boltzmann Machine (GBM) (Taylor & Hinton, 2009), the Gated Autoencoder (GAE) (Memisevic, 2011), and similar models (see Memisevic, 2013, for an overview) whose hidden variables can represent which transformation, out of the pool of many possible transformations, can take to .
More formally, bilinear models learn to represent the linear transformation, , between two images and , where
(1) 
It can be shown that a weighted sum of products of filter responses is able to identify the transformation. The reason is that the weighted sum is large if the angle between filters is similar to the angle (in “pixelspace”) between the two frames. That way, hidden units represent the observed transformation in the form of a set of phasedifferences in the invariant subspaces of the transformation class (Memisevic, 2013). As hidden units encode the transformation between images, rather than the content of the images, they are commonly referred to as mapping units. We shall focus on the autoencoder variant of these models for the purposes of this paper, but one can use other models such as the GBM.
Formally, the response of a mapping unit layer can be written
(2) 
where and are parameter matrices, and where denotes elementwise multiplication. Further, is an elementwise nonlinearity, such as the logistic sigmoid.
Given mapping unit activations, , as well as the first image, the second image can be reconstructed by applying the transformation encoded in as follows (Memisevic, 2011):
(3) 
As the model is symmetric, we can likewise define the reconstruction of the first image given the second as:
(4) 
from which one obtains the reconstruction error
(5) 
for training. It can be shown that minimizing reconstruction error on image pairs will turn each row in and the corresponding row in into a pair of phaseshifted filters. Together the filters span the invariant subspaces of the transformation class inherent in the training pairs with which the model was trained. As as result, each component of is tuned to a phasedelta after learning, and it is independent of the absolute phase of each image (Memisevic, 2013).
A quite natural extension of the concept of relational features can be motivated by looking at relational models as performing a kind of firstorder Taylor approximation of the input sequence, where the hidden representation models the partial firstorder derivatives of the inputs with respect to time. Based on this view, we propose an approach that exploits correlations between subsequent sequence elements to model a dynamical system which approximates the sequence. This is a very different way to address longrange correlations than assuming memory units that explicitly keep state (Hochreiter & Schmidhuber, 1997). Instead, here we assume that there is structure in the temporal evolution of the input stream and we focus on capturing this structure.
As an intuitive example, consider a video that is known to be a sinusoidal signal, but with unkown frequency, phase and motion direction. The complete video can be specified exactly and completely by the first three seed images. Therefore, given these three images, we would in principle be able to predict the rest of the video ad infinitum.
The firstorder partial derivative of a multidimensional discretetime dynamical system describes the correspondences between state vectors at subsequent time steps. Relational feature learning applied to subsequent elements of a sequence can be viewed as a way to learn these derivatives, suggesting that we may model higherorder partial derivatives with higherorder relational features.
We model secondorder derivatives by cascading relational features in a pyramid as depicted^{1}^{1}1Images taken from the NORB data set described in (LeCun et al., 2004) in 3. Given a sequence of inputs , firstorder relational features describe the transformations between two subsequent inputs and . Secondorder relational features describe correspondences between two firstorder relational features and , modeling the analog of the partial secondorder derivatives of the inputs with respect to time. Section id1 presents experiments with two layers of relational features that support this view.
We implement a higherorder gated autoencoder (HGAE) using the following modular approach. The secondorder HGAE is constructed using two GAE modules, one that relates inputs and another that relates mappings of the first GAE.
The firstlayer GAE instance models correspondences between input pairs using filter matrices and (the subscript index refers to the layer). Using the firstlayer GAE, mappings and for overlapping input pairs and are inferred and this pair of firstlayer mappings is used as input for a second GAE instance. This second GAE models correspondences between the mappings of the firstlayer using filter matrices and .
For the twolayer model, inference amounts to computing first and secondorder mappings according to
(6)  
(7)  
(8) 
Cascading GAE modules in this way can also be motivated from the perspective of orthogonal transformations as subspace rotations. As stated in (Memisevic, 2013), summing over filterresponse products can yield transformation detectors which are invariant to the initial phase of the transformation and also partially invariant to the content of the images. The relative rotation angle (phase delta) between two projections is itself an angle, and their relation can be viewed as an “angular acceleration”.
In contrast to the standard twoframe model, in this model reconstruction error is not directly applicable (although a naive way to train the model is to minimize reconstruction error for each pair of adjacent nodes in each layer). However, there is a more natural way to train the model if training data forms a sequence, as we discuss next.
Given the first two frames of a sequence one can use the GAE to compute a prediction of the third frame as follows. First, mappings are inferred from and (see Equation 2) and then used to compute a prediction by applying the inferred transformation to frame . Applying the transformation amounts to computing:
(9) 
This prediction of will be a good prediction under the assumption that the frametoframe transformations from to and from to are approximately the same, in other words if transformations themselves are assumed to be approximately constant in time.
In this case, one can train the GAE to minimize the prediction error
(10) 
instead of minimizing the reconstruction error in Equation 5. This type of supervised training objective, in contrast to the standard GAE objective, can also guide the mapping representation to be invariant to image content, because encoding the content of will not in general help predicting .
When the assumption of constancy of the transformations is violated, we can use a higher layer to model how transformations themselves change over time. This will require a farther lookahead for predictive training which we discuss in the following.
One can iterate the inferenceprediction process to look more than one frame ahead in time. To compute a prediction one infers mappings from and :
(11) 
and computes the prediction
(12) 
Then mappings can be inferred again from and to compute a prediction of , and so on.
For the twolayer HGAE this amounts to the assumption that the secondorder relational structure in the sequence changes slowly over time and under this assumption we compute a prediction in two steps: First a prediction is made of the firstorder relational features describing the correspondence between and :
(13) 
Using this prediction of the transformation between and the prediction is made as follows:
(14) 
As with the GAE, one can predict multiple steps ahead in time using the HGAE by repeating the inferenceprediction process on and , i.e. by appending the prediction to the sequence and increasing by one.
The prediction process simply consists of iteratively computing predictions of the next lower level’s activations beginning from the top. To compute the toplevel activations themselves, one needs a number of seed frames corresponding to the depth of the model. While two frames are sufficient to infer the transformations in the case of the GAE, three frames are required in the case of the twolayer model.
The models can be trained using backprop through time (Werbos, 1988) to compute the gradients of the step ahead prediction error w.r.t. the parameters:
(15) 
In our experiments, we observed that starting with singlestep prediction, training and iteratively increasing the number of prediction steps during training considerably stabilizes the dynamics of the model and helps to prevent explosions in the magnitude of the predictions.
We tested and compared the models on videos with varying degrees of complexity, from synthetic constant to synthetic accelerated transformations to more complex realworld transformations.
For all data sets PCA whitening was used for dimensionality reduction, retaining around of the variance.
Predictive training of the HGAE only worked after layerwise pretraining. We used gradient descent with a learning rate of and momentum . Without pretraining the parameters did not converge to a useful configuration. The firstlayer GAE was trained to reconstruct pairs of subsequent sequence elements (as described in Section id1). Then pairs of mappings were computed on three subsequent inputs using the pretrained firstlayer GAE. These mapping pairs were then used for reconstructive pretraining of the secondlayer GAE.




To evaluate whether predictive training of the GAE yields better representations of transformations than training with the reconstruction objective, a classification experiment on videos showing artificially transformed natural images was performed. The patches were cropped from the Berkeley Segmentation data set (Martin et al., 2001). Two data sets with videos featuring constant velocity shifts ( ConstShift) and rotations (ConstRot) were generated. The elements of the shift vectors for the ConstShift data set were sampled uniformly from the interval (in pixels). The rotation angles were sampled uniformly from the interval . Labels for the ConstShift data set were generated by dividing the shift vectors as shown in Figure 7. For ConstRot the angles were divided into equallysized bins. Both data sets were partitioned into a training set containing , a validation set containing and a test set containing sequences.
The numbers of filters and mapping units were chosen using a grid search. The setting with best performance on the validation set was filters and mapping units each for both training objectives and both data sets. The models were each trained for epochs using stochastic gradient descent (SGD) with a learning rate of and momentum . For the experiment the mappings of the first two inputs were used as input to a logistic regression classifier. The experiment was performed multiple times on both data sets and the mean classification accuracies are reported in Table 1. In all trials the GAE trained with step predictive training achieved a higher accuracy than the GAE trained on the reconstruction objective. This suggests that predictive training is able to generate a more explicit representation of transformations, that is plagued less by image content, as discussed in Section id1.
Model  ConstRot  ConstShift 

rec. training  97.6  76.4 
pred. training  98.2  79.4 
To test the hypothesis that the HGAE learns to model secondorder correspondences in sequences, image sequences with accelerated shifts (AccShift) and rotations (AccRot) of natural image patches were generated. The patches were again cropped from the Berkeley Segmentation data set and artificially transformed with initial (angular) velocity and constant (angular) acceleration. The scalar angular accelerations were sampled uniformly from the interval degrees. The initial angular velocites were sampled from the same interval. To get labels for classification, the angular accelerations were divided into equally sized bins. For the accelerated shifts data set, elements of the velocity and acceleration vectors were sampled in the interval (in pixels). The discretization of acceleration vectors is the same as for the shift vectors in ConstShift (see Figure 7). The partition sizes are the same as for ConstRot and ConstShift.
The number of filters and mapping units was set to and , respectively (after performing a grid search). After pretraining the HGAE was trained with gradient descent using a learning rate of and momentum of , first for epochs on singlestep prediction and then epochs on twostep prediction.
After training, first and secondlayer mappings were inferred from the first three frames of the test sequences. The classification accuracies using logistic regression with secondlayer mappings of the HGAE () as descriptor, using the individual firstlayer mappings ( and ), and using the concatenation of both firstlayer mappings are reported in Table 2 for both data sets (before and after predictive finetuning).
Descriptor  AccRot  AccShift  

pretrained  19.4  20.6  
30.9  33.3  
64.9  38.4  
53.7  63.4  
finetuned  18.1  20.9  
29.3  34.4  
74.0  42.7  
74.4  80.6 
The secondlayer mappings achieved a significantly higher accuracy for both data sets after predictive training. For the AccRot data set, the concatenation of firstlayer mappings performed better than the secondlayer mappings before finetuning, which may be because the angular acceleration data is based on a oneparameter transformation and is thus simpler than the shift acceleration data, which is based on a twoparameter transformation. Predictive finetuning also helped improve the intermediate representation, as can be observed by the increase in accuracy for the concatenation of the firstlayer mappings.
These results shows that the second layer of the HGAE can build a much better representation of the secondorder relational structure in the data than the singlelayer GAE model. They further show that predictive training improves the capability of both models and is crucial for the twolayer model to work well.
In this experiment we test the capability of the models to predict previously unseen sequences multiple steps into the future. This allows us to assess to what degree modeling second order “derivatives” makes it possible to capture the temporal evolution without resorting to an explicit representation of a hidden state. After training, test sequences were generated by seeding the models with two (GAE) or three (HGAE) seed frames. Figure 6 shows some of the filter pairs learned by the HGAE on different data sets after predictive training.
Figures 8 and 9 show predictions with the HGAE model on the data sets introduced in Section id1 after different stages of training. As can be seen in the figures, the accuracy of the predictions increases significantly with multistep training.
The NORBvideos data set introduced in (Memisevic & Exarchakis, 2013) contains videos of objects from the NORB dataset (LeCun et al., 2004). These are objects divided into classes (fourlegged animals, human figures, airplanes, trucks and cars), each with instances. The frames of each video from the NORBvideos data show incrementally changed viewpoints of one of the objects. We trained our sequence learning models on this data, using the author’s original split: all videos of objects from instances are in the training set and instance objects are in the test set. This yields training examples and test examples. The frame size is and the videos are frames long. The GAE and the HGAE model were trained on the multistep prediction task with a learning rate of and momentum . Both models used features and mapping units (per layer). The testperformance of the GAE model seemed to stop improving at features, while the HGAE was able to make use of the additional parameters.
Figure 10 shows predictions made by both models. The HGAE manages to generate predictions that correctly reflect the 3D structure in the data. In contrast to the GAE model it is much better at extrapolating the observed transformations. Note that seed frames are from test data.
Due to the large input dimensionality and the low number of training samples a few of the filters shown in Figure 6(d) seem to be overfitting on the training data while many others are localized Gaborlike features.
We also trained the HGAE on the bouncing balls data set^{2}^{2}2 The training and test sequences were generated using the script released with (Sutskever et al., 2008). to see whether the HGAE captures the highly nonlinear dynamics of this data set. The number of features was set to and the number of mappings to . Figure 11 shows two predictions on test data. The predictions show that the secondorder model is able to correctly capture reflections on the boundaries and the other balls, and makes consistent predictions over in some cases up to around frames.
A major longstanding problem in sequence modeling is how to deal with long range correlations. It has been proposed that deep learning may help address this problem by finding representations that capture better the abstract, semantic content of the inputs (Bengio, 2009). In this work we propose learning representations with the explicit goal to enable the prediction of the temporal evolution of the input stream multiple time steps ahead. Thus we seek a hidden representation that captures exactly those aspects of the input data which allow us to make predictions about the future.
It is interesting to note that predictive training can also be viewed as an analogy making task (Memisevic & Hinton, 2010). It amounts to taking the transformation taking frame to and applying it to a new observation at time or later. The difference is that in a genuine analogy making task, the target image may be unrelated to the source image pair, whereas here target and source are related. It would be interesting to apply the model to word representations, or language in general, as this is a domain where both, sequentially structured data and analogical relationships between datapoints, play a crucial role (e.g. Mikolov et al., 2013).
Acknowledgments This work was supported by the German Federal Ministry of Education and Research (BMBF) in project 01GQ0841 (BFNT Frankfurt), by an NSERC Discovery grant and by a Google faculty research award.
References
 Bengio (2009) Bengio, Yoshua. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):1–127, 2009. Also published as a book. Now Publishers, 2009.
 Hochreiter & Schmidhuber (1997) Hochreiter, Sepp and Schmidhuber, Jürgen. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 LeCun et al. (2004) LeCun, Yann, Huang, Fu Jie, and Bottou, Leon. Learning methods for generic object recognition with invariance to pose and lighting. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 2, pp. II–97. IEEE, 2004.
 Martin et al. (2001) Martin, D., Fowlkes, C., Tal, D., and Malik, J. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proc. 8th Int’l Conf. Computer Vision, volume 2, pp. 416–423, July 2001.
 Memisevic & Hinton (2010) Memisevic, R. and Hinton, G.E. Learning to represent spatial transformations with factored higherorder boltzmann machines. Neural Computation, 22(6):1473–1492, 2010.
 Memisevic (2011) Memisevic, Roland. Gradientbased learning of higherorder image features. In Computer Vision (ICCV), 2011 IEEE International Conference on, pp. 1591–1598. IEEE, 2011.
 Memisevic (2013) Memisevic, Roland. Learning to relate images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1829–1846, 2013.
 Memisevic & Exarchakis (2013) Memisevic, Roland and Exarchakis, Georgios. Learning invariant features by harnessing the aperture problem. In Proceedings of the 30th International Conference on Machine Learning (ICML 2013), 2013.
 Memisevic & Hinton (2007) Memisevic, Roland and Hinton, Geoffrey. Unsupervised learning of image transformations. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2007.
 Mikolov et al. (2013) Mikolov, Tomas, Chen, Kai, Corrado, Greg, and Dean, Jeffrey. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781, 2013.
 Sutskever et al. (2008) Sutskever, Ilya, Hinton, Geoffrey E, and Taylor, Graham W. The recurrent temporal restricted boltzmann machine. In Advances in Neural Information Processing Systems, pp. 1601–1608, 2008.
 Sutskever et al. (2011) Sutskever, Ilya, Martens, James, and Hinton, Geoffrey. Generating text with recurrent neural networks. In Proceedings of the 2011 International Conference on Machine Learning (ICML2011), 2011.
 Taylor & Hinton (2009) Taylor, Graham and Hinton, Geoffrey. Factored conditional restricted Boltzmann machines for modeling motion style. In Proceedings of the 26th International Conference on Machine Learning, Montreal, June 2009. Omnipress.
 Werbos (1988) Werbos, Paul J. Generalization of backpropagation with application to a recurrent gas market model. Neural Networks, 1(4):339–356, 1988.