# Linearizing Visual Processes with Convolutional Variational Autoencoders

## Abstract

This work studies the problem of modeling non-linear visual processes by learning linear generative models from observed sequences. We propose a joint learning framework, combining a Linear Dynamic System and a Variational Autoencoder with convolutional layers. After discussing several conditions for linearizing neural networks, we propose an architecture that allows Variational Autoencoders to simultaneously learn the non-linear observation as well as the linear state-transition from a sequence of observed frames. The proposed framework is demonstrated experimentally in three series of synthesis experiments.

Technical Report

## 1 Introduction

While classification of image and video with Convolutional Neural Networks (CNN) is becoming an established practice, unsupervised learning and generative modeling remain to be challenging problems in deep learning. A successful construction of generative models of a visual process enables the possibility of generating sequences of video frames such that the appearance as well as the dynamics approximately resemble the original training process without copying it. This procedure is typically referred to as video generation [1, 2] or video synthesis [3]. More technically, this means, that in addition to a suitable probability model for the individual frames, a probabilistic description for the frame-to-frame transition is also necessary. Analysis and reproduction of visual processes simplifies considerably, if this transition can be assumed to be a linear function. For instance, linear transformations are easily invertible and by means of spectral analysis, it can be studied how successive applications of the same transformation behave in the long term.

Unfortunately, most frame transitions in real-world visual processes unlikely are linear functions. Nevertheless, unsupervised learning has come up with many approaches to fit linear transition models to real-world processes, for instance by using linear low-rank [4], or sparse approximations of the frames [5], or applying the kernel trick to them [6].

The success of Generative Adversarial Networks (GAN) [7] and
Variational Autoencoders (VAE) [8] have lead to an increased
interest in deep generative learning and it seems natural to apply such techniques to sequential processes.
We approach this idea from the perspective of *linearization* in order to keep the model as simple as possible.
In an analogous way as physicists transforming non-linear differential equations into linear ones by means of an appropriate change of variables, our approach is to *learn* a latent representations of visual processes, such that the latent state-to-state transition can be described by a linear model. To this end, we *jointly* learn a non-linear observation and a linear state transition function by means of a modified VAE.

## 2 Related Work

Similarly to our work, the authors of [9] combine Linear Dynamic Systems (LDSs) with VAEs.
However, the focus of their work is on control rather than on synthesis. Furthermore, their model is *locally* linear and the transition distribution is modeled in the variational bound, whereas we model it as a separate layer. This also is the main difference to
the work in [10], where VAEs are combined with linear dynamic models for forecasting images in video sequences, and to [11], in which VAEs are used as Kalman Filters.

The work [12] deals with linearizing transformations under uncertainty via neural networks. It resembles this work in that we also focus on *representation learning* rather than on a particular application. However, unlike our work, it does not employ VAEs. Theoretical groundwork regarding learned visual transformations has been done in [13, 14, 15] and [16]. More generally, the synthesis of video dynamics by means of neural networks has been discussed, among others as in [17] and [18].

Finally, the core contribution of this work is a combination of neural networks with Markov processes. This has been the subject of many works in the recent past. For a broad overview of results in this field, the reader is referred to Chapter 20 of [19].

## 3 Visual Processes and Linearization

### 3.1 Dynamic Systems

Dynamic textures [4] have popularized LDSs in the modeling of visual processes. Typically, an LDS is of the following form

(1) |

where is the low dimensional *state space variable* at time , the *state transition matrix*, the observation at time and the *observation matrix*. The vector represents a constant offset in the observation space. The input terms and are modeled as zero-mean i.i.d. Gaussian noise, and are independent of .

The simplicity of state transition in the model (1) enables straightforward prediction, generation, and analysis of observations or synthesis. For real-world visual processes that are often highly non-linear, it is therefore of great interest to find a model that linearizes the underlying process, so that in some latent state space representation, the state transition admits both the linearity and Gaussianity as depicted in Eq. (1). Specifically, this work focuses on the following non-linear dynamic system model, i.e., a linear state transition and a non-linear observation mapping

(2) |

where is assumed to be nonlinear in the rest of the paper. For algorithmic reasons, we assume that is drawn from an isotropic Gaussian distribution, i.e.,

(3) |

Note, that the model in Eq. (2) is not unique with respect to changes of basis in the state space [20]. Let be a full rank matrix, we define the following substitution

(4) |

Then the following system is equivalent to (2)

(5) |

Specifically, given one visual process described by (2) with , one can define an equivalent system via the transformations

(6) |

If is implemented via a neural network, we can ensure that it accounts for a possible change of basis. Therefore, without loss of generality, we propose the following assumption on the latent samples .

###### Assumption 1.

The latent samples abide an i.i.d. standard normal distribution, i.e.,

(7) |

###### Remark 1.

If the state transition matrix is given, and the latent samples are assumed to be stationary, then Assumption 1 essentially identifies the process noise model. Namely, we have

(8) |

and in order to make sure that the latent states remain Gaussian in sequential synthesis scenarios, i.e., , we just need to ensure that the process noise is zero-mean and has the covariance matrix .

### 3.2 Linearizability of Non-linear Visual Transformations

The purpose of this subsection is to justify our aim to learn a linear state-transition model (2) from a conceptual point of view and provide cues on how to chose a neural network architecture for linearizing visual transformations.

In the model description (2), the observation mapping is modeled by a non-linear function . In what follows, we aim to show the feasibility of the linear model on the state transitions. Let us consider a visual transformation of observations in by . Such transformations are in general very difficult to model, and hence it is very unlikely to find a global observation mapping such that the respective transformation in the latent space can be exactly modeled by a linear transition. Here, we firstly form the notion of local linearization of a nonlinear self map.

###### Definition 1.

Let , be a continuous self map, and be a local diffeomorphism at all . The map is said to be a local linearizer of at , if there exists a matrix such that the following equality holds true with

(9) |

Here, the map behaves as a chart of the data manifold . If and is a fixed point of , then it is obvious that the map is linearized by at . In general, the map cannot be guaranteed to have a fixed point. Nevertheless, by the Brouwer’s fixed-point theorem, we propose the following assumption to ensure existence of local linearizability of .

###### Assumption 2.

The set is compact and convex, and is a continuous self map, i.e., the map has at least a fixed point .

This assumption can be easily justified by applications in image/video processing, where images lie in some hyper-cube, e.g. . It has some resemblance to control theory, where linearization of non-linear dynamical systems can be carried out around equilibrium points [21]. The following proposition thus makes the assumption of existence of to characterize neural networks that locally linearize transformations.

###### Proposition 1.

Let , be a continuous self map, and be a local diffeomorphism at all . If is a fixed point of , then the following map

(10) |

locally linearizes .

###### Proof.

Since is a diffeomorphism, is differentiable. We denote by the Jacobian matrix of at and Taylor’s theorem yields

(11) |

Knowing that is a fixed point of , we can rewrite the expression by substituting as

(12) |

We define and substitute

(13) |

into (12). This yields

(14) |

which finalizes the proof. ∎

The error term in (11) is driven by the curvature of around . Incidentally, the authors of [12] also achieve linearization by penalizing curvature.

Essentially, Proposition 10 suggests to include a bias for
the first layer of a linearizing neural network that tries to implement .
However, in general this is not enough to achieve low linearization error *globally*. In fact, a neural network consisting of one single affine layer suffices to locally linearize an appropriate transformation .

### 3.3 Linearization via CNNs

In this subsection, we discuss several additional heuristics for linearization, and argue that employing convolutional layers are a suitable choice.

We start by observing that CNNs are capable of representing data in a way such that it is almost invariant to certain classes of transformations in the data [22, 23]. In other words, a transformation applied to a data sample does not greatly displace its representation.
For illustration purposes, let denote an image depicting an object,
and an image depicting the same object, deformed by applying certain forces to it. Due to the curse of dimensionality, the application of can lead to a significant displacement of the pixel representation in the Euclidean space, but analysis of simplified CNNs with fixed filter weights and absolute value or ReLU activation functions, so-called *Scattering transforms*, has shown that it is possible to find a representation that is contracting with respect to spatial deformations of images . More specifically, in [22] a deformation is described as a warping of spatial coordinates. For such deformations, a bound was derived such that

(15) |

holds, if is implemented by a Scattering transform. Even though the discussion in [22] is limited to deformations, it is generally assumed in [24] that approximately invariant representations with respect to much broader classes of transformations can be learned by CNNs. The smaller the contraction constant , the more regularity is introduced to the data, with respect to the linearizability of . To see this, we introduce the minimal expected linearization error , which measures how well a transformation can be modeled by a multiplication with a matrix as follows

(16) |

The following inequality easily follows

(17) |

Note however, that the measure does not account for how much expands or shrinks its input. The contraction constant , for instance, was derived for approximately norm-preserving functions [22].

To summarize, we can hope to linearize broad classes of transformations, given the right neural network architecture. In particular, the preceding discussion suggests auto-encoders with (almost everywhere) differentiable activation functions to account for the diffeomorphism property in Proposition 10, and input layers with bias to account for (10). Due to the contraction properties suspected from CNNs, it seems natural to employ convolutional layers and ReLU activations for both the encoder and the decoder.

Until now, we discussed heuristics for the choice of architecture, but the advantage of neural networks is that we can make design goals like linearizability explicit by formulating an appropriate loss function. We tackle this problem from a stochastic perspective by constraining the joint probability distribution of succeeding samples in the latent space.

## 4 Variational Autoencoders for Sequences

### 4.1 Variational Autoencoders: Review

According to Assumption 1, the observation mapping transforms a standard normal distribution to the observation distribution , where for a latent sample , the expected observation is and the according conditional probability distribution is given by the noise model (3).

Conveniently, VAEs provide a framework to do just that. Let be a standard normal distributed random variable. Given a set of realizations of a random variable with the distribution , the objective of the VAE is to maximize the log-likelihood function

(18) |

by learning a parametrized function that approximately transforms to . Accordingly to (3), we fix the following assumption,

(19) |

Then, applying the expectation yields

(20) |

The parameter should thus maximize the term .

However, directly maximizing the expected value of (20) by standard Monte Carlo methods is infeasible for computational reasons [25]. Luckily, variational inference provides a lower bound for the likelihood function that can be optimized by stochastic gradient descent. Let be a parametrized, measurable function which maps from and to the codomain of . Let the random variable

(21) |

have the probability density function . Let us consider the expression

(22) |

with denoting the Kullback-Leibler Divergence (KLD). Since the KLD is always nonnegative, then the following inequality holds true

(23) |

We can rewrite the KLD as an expected value. Since is not a random variable, it is not affected by the expected value, and we reformulate (22) as

(24) |

As a consequence, the lower bound of can be maximized by minimizing

(25) |

It is then straightforward to Compute the gradient of the squared norm in .
For an estimation of the expected value, we draw one sample from and several samples from by applying on samples of standard normal noise. This is known as the *reparametrization trick*.
Slightly more elaborate is the KLD term in .
Since the distribution depends on , in order to make the task computationally tractable,
is modeled as an affine function of the form

(26) |

In technical terms, this means the encoder part of Fig. 1 is a subnetwork that maps from a training sample to the two vectors and . The random variable is thus described by the distribution

(27) |

and the KLD in (25) can be written as

(28) |

In such a way, stochastic gradient descent of (25) can be therefore applied via backpropagation.

### 4.2 Markov Assumptions

We want to model a sequential, stochastic visual process (2) such that is performed by the decoder part of a VAE. Let us assume that we are given a sequence of vectorized video frames. Hereby, is carried out by a neural network described by the trainable parameter tuple . If we neglect the temporal order of the frames, we can theoretically train a VAE to generate frames similar to , because the latent variables are from the standard normal distribution. However, crucial to *synthesizing* a visual process is not only the capability to create still-image frames, but also to create them according to a temporal model. First and foremost, this implies a possibility to infer in addition to . The easiest way to approach this is by first learning by training the VAE and then inferring via squared error minimization as

(29) |

Such an approach clearly has its advantages in terms of simplicity, but given the high capacity of trainable neural networks, it is more elegant to learn and simultaneously. By doing so, we force the latent variables already during the training process to fit a linear transition model instead of fitting a linear state transition model to a sequence of already learned latent variables.

The temporal model at hand is a first order Markov process. Initially, the data needs to be adapted to the problem. We formulate our problem setting thus as a generative model for , where each observation

(30) |

contains two succeeding frames.

A sample is a realization of the random variable and is composed of two subvectors with the same statistical properties. This means, the distributions of the upper and the lower half subvector of , i.e., of the current and the predicted frame, must be identical. Specifically, following the discussion of Section 4.1, we assume that is driven by a latent variable and the conditional distribution has the form

(31) |

where denotes the variance of the observation noise in (2). The subvectors stand for the latent variables, i.e., the state space vectors, belonging to the upper and lower half of . As agreed on before, their marginal distribution is standard normal. However, their joint distribution is not, since the choice of depends on . In fact, from the previous section, we can deduce the joint probability distribution as

(32) |

This contradicts the premise of the VAE which models latent variables by standard normal distributions. However, if we assume that it is possible to adapt the model of the classical VAE, such that the decoder part in Fig. 1 can be fed with samples drawn from distribution (32) *and* make the parameter trainable, we can simultaneously learn the observation and dynamic state transition of a visual process.

### 4.3 A Dynamic VAE

In this section, we propose a neural network architecture that produces samples similar to from realizations of the distribution (32). We achieve this by modeling the linear dynamics with an additional layer between the latent space layer and the decoder. Let us refer to such a layer as the *dynamic* layer and to the architecture in its entirety as a *Dynamic* VAE. The purpose of the dynamic layer is to map the random variable which has standard normal distribution, to a random variable which has the distribution indicated in (32). Let us denote by the upper and lower half of . Then such a mapping can be achieved by a function of the form

(33) |

where is a matrix such that . Fig. 2 depicts the resulting architecture.

In order to guarantee stationarity, we need to ensure that condition (8) is satisfied. This can be done by including a regularizer. The loss function of the Dynamic VAE parameters and is thus defined as

(34) |

where should be chosen high enough to keep the regularizer close to . The KLD term depends on via (28). Note however, that is to be replaced by in this context.

## 5 Experiments

### 5.1 Overview

The experiments treated three different kinds of visual processes. In each experiment, the Dynamic VAE was trained with a sequence of frames. Afterwards, each sequence was generated from the trained model. Latent states of dimension were synthesized according to the rule

(35) |

where were drawn from a standard normal distribution and the initial state was inferred from the expected value of the conditional latent distribution of a test frame pair . The frame pair was excluded from the training set. This was done in order to improve the significance of the experimental outcome with respect to how well the model generalizes.

The observer neural network was implemented via a fully-connected layer followed by three convolutional layers with ReLU activations and nearest neighbor upsampling. The number of channels was decreased with each layer by the factor four, such that the number of pixels in each hidden layer remains roughly unchanged. The encoder mirrored the structure with the same number of convolutional layers, ReLU activations, increasing number of channels and Max-Pooling layers. For each convolutional layer, the same filter size was used. All experiments were implemented in Python 3.6 with PyTorch 0.1.12 on CUDA 8.0. The choice of parameters for each experiment is described in Table 1. The code is publicly available [26].

Experiment | Filter size | ||
---|---|---|---|

MNIST | 1.5 | 100.0 | |

UCLA-50 | 0.31 | 100.0 | |

NORB | 5.0 | 100.0 |

Evaluating generative models is particularly challenging. This is due to the very nature of the problem that demands measuring the similarity of the probability distribution underlying the training data to the probability distribution that generated the test data. Neither of the two is available in closed form but can be only estimated from a limited number of samples in a very high-dimensional space. It is thus an established practice to evaluate generative models by visual inspection of the generated samples [19]. However, it is important to consider overfitting that can lead to supposedly very realistic samples. The following experiments have the purpose of demonstrating the principal capability of the proposed methods to infer a linear model from a highly non-linear process by generating sequences from Gaussian noise. Therefore, we acknowledge the fact that our choice of hyperparameters is possibly suboptimal and architectures that are optimized for a specific task could lead to visually more appealing results. Due to space constraints, only a few experimental outcomes are shown in each subsection. The supplementary material to this paper contains synthesis results for each performed experiment.

### 5.2 Learning to Count

In the first series of experiments, we trained our architecture with sequences of images from the MNIST data set. One sequence was used for each experiment. The aim was to learn a generative, sequential model that can produce repeating sequences of numbers. For instance, in the first experiment, the frame transition to be linearized was a mapping of a 1 to a 2, a 2 to a 3 and a 3 to a 4 and a 4 to a 1. Each training sequence contained 7999 MNIST image pairs.

Fig. 3 visualizes the synthesis of the sequences 12341234… and 67896789… in comparison to the result of a purely linear model as described in [4]. The dynamic VAE did well in synthesizing number sequences of length 4 or smaller. More challenging were longer sequences as Fig. 4 shows. While some sequences, like and , could be sufficiently well trained, other sequences, like appeared to yield non-stationary systems, or like were to unpredictable for the Dynamic VAE.

### 5.3 Dynamic Textures

The second series of experiments focused on the synthesis of dynamic textures. In each experiment, the Dynamic VAE was trained with one class of dynamic texture from the cropped UCLA-50 database [27, 28].

Fig. 5 depicts the synthesis results for the dynamic texture wfalls-c. In general, we observed that the synthesis of predictable sequences, e.g. oscillations or cyclic phenomena produces realistic results. Chaotic textures yielded some frames that looked artificial. This could be observed, for instance, in the synthesis of the candle dynamic texture.

### 5.4 Rotating Objects

The *Small NORB* [29] dataset consists of pictures taken of different miniature objects under varying lighting conditions, elevation and azimuthal angles. One object at once was used for training. We trained our model to linearize a counterclockwise azimuthal rotation by . Since the Small NORB dataset contains little variability apart for the intentional one, we decided to exclude one configuration of lighting conditions and elevation angle form the training data and use the contained sequence of azimuthal positions as ground truth for our experiment. Generally, the rotation could be well reproduced by the linear state transition model, except for the category 1 which contains human figures.
Fig. 6 depicts the Dynamic VAE synthesis of a rotating horse compared to a linear synthesis [4].

The angle is slightly higher than , since the columns are not aligned. The model seems to be confused by diametrical angles. For instance, at certain positions of Fig. 6, it becomes indeterminable whether the horse faces towards or away from the observer, leading to the skipping of the following .

## 6 Conclusion

This work presented an approach to infer linear models of visual processes by means of Variational Autoencoders. To this end, the classical VAE model was modified to include an additional layer that models the latent dynamics of the visual process. The capability of the proposed model was demonstrated in three series of synthesis experiments. Additionally, the aim of this work was to develop a notion of linearizability and what implications it has on the choice of neural network architectures. While yielding first conceptual results, we understand that the theoretical analysis on this matter has room for improvement. Therefore, in future work, we plan to gain further insights in the theoretical concept of linearizability but also improve the architecture to handle more complex data.

### References

- Vondrick, C., Torralba, A.: Generating the future with adversarial transformers. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2017)
- De Souza, C., Gaidon, A., Cabon, Y., Lopez Pena, A.: Procedural generation of videos to train deep action recognition networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2017)
- Liu, Z., Yeh, R., Tang, X., Liu, Y., Agarwala, A.: Video frame synthesis using deep voxel flow. In: International Conference on Computer Vision (ICCV). Volume 2. (2017)
- Doretto, G., Chiuso, A., Wu, Y.N., Soatto, S.: Dynamic textures. International Journal of Computer Vision 51(2) (2003) 91–109
- Wei, X., Li, Y., Shen, H., Chen, F., Kleinsteuber, M., Wang, Z.: Dynamical textures modeling via joint video dictionary learning. IEEE Transactions on Image Processing 26(6) (2017) 2929–2943
- Chan, A.B., Vasconcelos, N.: Classifying video with kernel dynamic textures. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE (2007) 1–6
- Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in neural information processing systems. (2014) 2672–2680
- Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
- Watter, M., Springenberg, J., Boedecker, J., Riedmiller, M.: Embed to control: A locally linear latent dynamics model for control from raw images. In: Advances in neural information processing systems. (2015) 2746–2754
- Johnson, M., Duvenaud, D.K., Wiltschko, A., Adams, R.P., Datta, S.R.: Composing graphical models with neural networks for structured representations and fast inference. In: Advances in neural information processing systems. (2016) 2946–2954
- Krishnan, R.G., Shalit, U., Sontag, D.: Deep kalman filters. arXiv preprint arXiv:1511.05121 (2015)
- Goroshin, R., Mathieu, M.F., LeCun, Y.: Learning to linearize under uncertainty. In: Advances in Neural Information Processing Systems. (2015) 1234–1242
- Cohen, T.S., Welling, M.: Transformation properties of learned visual representations. arXiv preprint arXiv:1412.7659 (2014)
- Cohen, T., Welling, M.: Group equivariant convolutional networks. In: International Conference on Machine Learning. (2016) 2990–2999
- Hinton, G.E., Krizhevsky, A., Wang, S.D.: Transforming auto-encoders. In: International Conference on Artificial Neural Networks, Springer (2011) 44–51
- Memisevic, R.: Learning to relate images. IEEE transactions on pattern analysis and machine intelligence 35(8) (2013) 1829–1846
- Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: Advances In Neural Information Processing Systems. (2016) 613–621
- Xue, T., Wu, J., Bouman, K., Freeman, B.: Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In: Advances in Neural Information Processing Systems. (2016) 91–99
- Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y.: Deep learning. Volume 1. MIT press Cambridge (2016)
- Afsari, B., Vidal, R.: The alignment distance on spaces of linear dynamical systems. In: Decision and Control (CDC), 2013 IEEE 52nd Annual Conference on, IEEE (2013) 1162–1167
- Perko, L.: Differential equations and dynamical systems. Volume 7. Springer Science & Business Media (2013)
- Mallat, S.: Group invariant scattering. Communications on Pure and Applied Mathematics 65(10) (2012) 1331–1398
- Wiatowski, T., Bölcskei, H.: A mathematical theory of deep convolutional neural networks for feature extraction. IEEE Transactions on Information Theory (2017)
- Mallat, S.: Understanding deep convolutional networks. Phil. Trans. R. Soc. A 374(2065) (2016) 20150203
- Doersch, C.: Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908 (2016)
- Sagel, A.: Gitlab repository: https://gitlab.lrz.de/ga68biq/dynamicvae
- Saisan, P., Doretto, G., Wu, Y.N., Soatto, S.: Dynamic texture recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Volume 2., IEEE (2001) II–II
- Chan, A.B., Vasconcelos, N.: Probabilistic kernels for the classification of auto-regressive visual processes. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Volume 1. (June 2005) 846–851 vol. 1
- LeCun, Y., Huang, F.J., Bottou, L.: Learning methods for generic object recognition with invariance to pose and lighting. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Volume 2., IEEE (2004) II–104