Learning Stable Group Invariant Representations with Convolutional Networks
Learning Stable Group Invariant Representations with Convolutional Networks
Joan Bruna, Arthur Szlam and Yann LeCun Courant Institute New York University New Nork, NY, 10013 {bruna,lecun}@cims.nyu.edu
1 Introduction
Many signal categories in vision and auditory problems are invariant to the action of transformation groups, such as translations, rotations or frequency transpositions. This property motivates the study of signal representations which are also invariant to the action of these transformation groups. For instance, translation invariance can be achieved with a registration or with autocorrelation measures.
Transformation groups are in fact lowdimensional manifolds, and therefore mere group invariance is in general not enough to efficiently describe signal classes. Indeed, signals may be perturbed with additive noise and also with geometrical deformations, so one can then ask for invariant representations which are stable to these perturbations. Scattering convolutional networks [1] construct locally translation invariant signal representations, with additive and geometrical stability, by cascading complex wavelet modulus operators with a lowpass smoothing kernel. By defining wavelet decompositions on any locally compact Lie Group, scattering operators can be generalized and cascaded to provide local invariance with respect to more general transformation groups [2, 3]. Although such transformation groups are present across many recognition problems, they require prior information which sometimes cannot be assumed.
Convolutional networks [4] cascade filter banks with pointwise nonlinearities and local pooling operators. By remapping the output of each layer with the input of the following one, the trainable filters implement convolution operators. We show that the invariance properties built by deep convolutional networks can be cast as a form of stable group invariance. The network wiring architecture determines the invariance group, while the trainable filter coefficients characterize the group action.
Deep convolutional architectures cascade several layers of convolutions, nonlinearities and pooling. These architectures have the capacity to generate local invariance to the action of more general groups. Under appropriate conditions, these groups can be factorized as products of smaller groups. Each of these factors can then be associated with a subset of consecutive layers of the convolutional network. In these conditions, the invariance properties of the final representation can be studied from the group structure generated by each layer.
2 Problem statement
2.1 Stable Group Invariance
A transformation group acts on the input space (assumed to be a Hilbert space) with a linear group action , which is compatible with the group operation.
A signal representation is invariant to the action of if However, mere group invariance is in general too weak, due to the presence of a much larger, high dimensional variability which does not belong to the lowdimensional group. It is then necessary to incorporate the notion of outer “deformations” with another group action where is a larger group containing . The geometric stability can be stated with a Lipschitz continuity property
(1) 
where measures the “distance” from to the invariance group . For instance, when is the translation group of and is the group of diffeomorphisms of , then and one can select as distance the elastic deformation metric , where [2].
Even though the group invariance formalism describes global invariance properties of the representation, it also provides a valid and useful framework to study local invariance properties. Indeed, if one replaces (1) by
(2) 
where is a projection of to and is a metric on measuring the amount of transformation being applied, then the local invariance is expressed by adjusting the proportionality between the two metrics.
2.2 Convolutional Networks
A generic convolutional network defined on a space of squareintegrable signals starts with a filter bank , , which for each input produces the collection
If the filter bank defines a stable, invertible frame, then there exist two constants such that
where . By defining , the first layer of the network can be written as the linear mapping
is then transformed with a pointwise nonlinear operator which is usually nonexpansive, meaning that . Finally, a local pooling operator can be defined as any linear or nonlinear operator
which reduces the resolution of the signal along one or more coordinates and which avoids “aliasing”. If and denote the loss of resolution along each coordinate, it results that , with , , where is an oversampling factor. Linear pooling operators are implemented as lowpass filters followed by a downsampling.
Then, a layer convolutional network is a cascade
(3) 
which produces successively .
The filter banks , together with the pooling operators , progressively transform the signal domain; filter bank steps lift the domain of definition by adding new coordinates, whereas pooling steps reduce the resolution along certain coordinates.
3 Invariance Properties of Convolutional Networks
3.1 The case of oneparameter transformation groups
Let us start by assuming the simplest form of variability produced by a transformation group. A oneparameter transformation group is a family of unitary linear operators of such that (i) it is strongly continuous: for every , and (ii) . One parameter transformation groups are thus homeomorphic to (with the addition as group operation), and define an action which is continuous in the group variable. Unidimensional translations , frequency transpositions (where , are respectively the forward and inverse Fourier transform) or unitary dilations are examples of oneparameter transformation groups.
Oneparameter transformation groups are particularly simple to study thanks to Stone’s theorem [5], which states that unitary oneparameter transformation groups are uniquely generated by a complex exponential of a selfadjoint operator:
Here, the complex exponential of a selfadjoint operator should be interpreted in terms of its spectra. In the finite dimensional case (when is discrete), this means that there exists an orthogonal transform such that if , then
(4) 
In other words, the group action can be expressed as a linear phase change in the basis which diagonalizes the unique selfadjoint operator given by Stone’s theorem. In the particular case of translations, the change of basis is given by the Fourier transform. As a result, one can obtain a representation which is invariant to the action of with a single layer of a neural network: a linear decomposition which expresses the data in the basis given by followed by a pointwise complex modulus. In the case of the translation group, this corresponds to taking the modulus of the Fourier transform.
3.2 Presence of deformations
Stone’s theorem provides a recipe for global group invariance for strongly continuous group actions. Without noise nor deformations, an invariant representation can be obtained by taking complex moduli in a basis which diagonalizes the group action, which can be implemented in a shallow layer architecture. However, the underlying lowdimensional assumption is rarely satisfied, due to the presence of more complex forms of variability.
This complex variability can be modeled as follows. If is the basis which diagonalizes a given oneparameter group, then the group action is expressed in the basis as the translation operator . Whereas the group action consists in rigid translations on this basis, by analogy a deformation is defined as a nonrigid warping in this domain: , where is a displacement field along the indexes of the decomposition.
The amount of deformation can be measured with the regularity of , which controls how distant the warping is from being a rigid translation and hence an element of the group. This suggests that, in order to obtain stability to deformations, rather than looking for eigenvectors of the infinitesimal group action, one should look for linear measurements which are well localized in the domain where deformations occur, and which nearly diagonalize the group action. In particular, these measurements can be implemented with convolutions using compactly supported filters, such as in convolutional networks.
Let be an intermediate representation in a convolutional network, and whose first layer is fully connected. Suppose that is a group acting on via
(5) 
where . This corresponds to the idealized case where the transformation only modifies one component of the representation. A local pooling operator along the variable , at a certain scale , attenuates the transformation by as soon as . It thus produces local invariance with respect to the action of .
3.3 Group Factorization with Deep Networks
Deep convolutional networks have the capacity to learn complex relationships of the data and to build invariance with respect to a large family of transformations. These properties can be partly explained in terms of a factorization of the invariance groups performed successively.
Whereas pooling operators efficiently produce stable local invariance, convolution operators preserve the invariance generated by previous layers. Indeed, suppose is an intermediate representation in a convolutional network, and that acts on via . It follows that if the next layer is constructed as
then acts on via , since convolutions commute with the group action, which by construction is expressed as a translation in the coefficients . The new coordinates are thus unaffected by the action of .
As a consequence, this property enables a systematic procedure to generate invariance to groups of the form , where is the semidirect product of groups. In this decomposition, each factor is associated with a range of convolutional layers, along the coordinates where the action of is perceived.
4 Perspectives
The connections between group invariance and deep convolutional networks offer an interpretation of their efficiency on several recognition tasks. In particular, they might explain why the weight sharing induced by convolutions is a valid regularization method in presence of group variability.
More concretely, we shall also concentrate on the following aspects:

Group Discovery. One might ask for the group of transformations which best explains the variability observed in a given dataset . In the case where no geometric deformations are present, one can start by learning the (complex) eigenvectors of the group action:
When the data corresponds to a uniform measure on the group, then this decomposition can be obtained from the diagonalization of the covariance operator . In that case, the real eigenvectors of are grouped into pairs of vectors with identical eigenvalue, which then define the complex decomposition diagonalizing the group action.
In presence of deformations, the global invariance is replaced by a measure of local invariance. This problem is closely related to the sparse coding with slowness from [6].

Structured Convolutional Networks. Groups offer a powerful framework to incorporate structure into the families of filters, similarly is in [7]. On the one hand, one can enforce global properties of the group by defining the convolutions accordingly. For instance, by wrapping the domain of the convolution, one is enforcing a periodic group to emerge. On the other hand, one could further regularize the learning by enforcing a group structure within a filter bank. For instance, one could ask a certain filter bank to have the form , where is a rotation with an angle .
References
 [1] J. Bruna, S. Mallat, “Invariant Scattering Convolutional Networks”, IEEE TPAMI, 2012.
 [2] S. Mallat, “Group Invariant Scattering”, CPAM, 2012.
 [3] L.Sifre, S.Mallat, “Combined Scattering for Rotation Invariant Texture Analysis”, ESANN, 2012.
 [4] Y.LeCun, L.Bottou, Y.Bengio, P.Haffner, “GradientBased Learning Applied to Document Recognition”, IEEE, 1998
 [5] M.H. Stone, “On oneparameter unitary Groups in Hilbert Space”, Ann. of Mathematics, 1932.
 [6] C.Cadieu, B.Olshausen, “Learning Transformational Invariants from Natural Movies”, NIPS 2009.
 [7] K.Gregor, A. Szlam, Y.Lecun, “Structured Sparse Coding via lateral inhibition”, NIPS, 2011.