Unsupervised learning of invariant representations with low sample complexity: the magic of sensory cortex or a new framework for machine learning?

Unsupervised learning of invariant representations with low sample complexity: the magic of sensory cortex or a new framework for machine learning?

Fabio Anselmi Center for Brains, Minds and Machines, Massachusetts Institute of Technology, Cambridge, MA 02139Istituto Italiano di Tecnologia, Genova, 16163 1 2    Joel Z. Leibo 1    Lorenzo Rosasco 1 2    Jim Mutch 1    Andrea Tacchetti 1 2       Tomaso Poggio 1 2
Abstract

The present phase of Machine Learning is characterized by supervised learning algorithms relying on large sets of labeled examples (). The next phase is likely to focus on algorithms capable of learning from very few labeled examples (), like humans seem able to do. We propose an approach to this problem and describe the underlying theory, based on the unsupervised, automatic learning of a “good” representation for supervised learning, characterized by small sample complexity (). We consider the case of visual object recognition though the theory applies to other domains. The starting point is the conjecture, proved in specific cases, that image representations which are invariant to translations, scaling and other transformations can considerably reduce the sample complexity of learning. We prove that an invariant and unique (discriminative) signature can be computed for each image patch, , in terms of empirical distributions of the dot-products between and a set of templates stored during unsupervised learning. A module performing filtering and pooling, like the simple and complex cells described by Hubel and Wiesel, can compute such estimates. Hierarchical architectures consisting of this basic Hubel-Wiesel moduli inherit its properties of invariance, stability, and discriminability while capturing the compositional organization of the visual world in terms of wholes and parts. The theory extends existing deep learning convolutional architectures for image and speech recognition. It also suggests that the main computational goal of the ventral stream of visual cortex is to provide a hierarchical representation of new objects/images which is invariant to transformations, stable, and discriminative for recognition—and that this representation may be continuously learned in an unsupervised way during development and visual experience.111Notes on versions and dates The current paper evolved from one that first appeared online in Nature Precedings on July 20, 2011 (npre.2011.6117.1). It follows a CSAIL technical report which appeared on December 30th, 2012,MIT-CSAIL-TR-2012-035 and a CBCL paper, Massachusetts Institute of Technology, Cambridge, MA, April 1, 2013 by the title ”Magic Materials: a theory of deep hierarchical architectures for learning sensory representations”([5]). Shorter papers describing isolated aspects of the theory have also appeared:[6, 7].

Invariance—Hierarchy—Convolutional networks—Visual cortex

http://cbmm.mit.edu \copyrightyear2014 \issuedateMarch, 2014 \volume \issuenumber

 

 

CBMM Memo No. 001            September 28, 2019

Unsupervised learning of invariant representations with low sample complexity: the magic of sensory cortex or a new framework for machine learning?

by

Fabio Anselmi, Joel Z. Leibo, Lorenzo Rosasco, Jim Mutch, Andrea Tacchetti and Tomaso Poggio.

Abstract: The present phase of Machine Learning is characterized by supervised learning algorithms relying on large sets of labeled examples (). The next phase is likely to focus on algorithms capable of learning from very few labeled examples (), like humans seem able to do. We propose an approach to this problem and describe the underlying theory, based on the unsupervised, automatic learning of a “good” representation for supervised learning, characterized by small sample complexity (). We consider the case of visual object recognition though the theory applies to other domains. The starting point is the conjecture, proved in specific cases, that image representations which are invariant to translations, scaling and other transformations can considerably reduce the sample complexity of learning. We prove that an invariant and unique (discriminative) signature can be computed for each image patch, , in terms of empirical distributions of the dot-products between and a set of templates stored during unsupervised learning. A module performing filtering and pooling, like the simple and complex cells described by Hubel and Wiesel, can compute such estimates. Hierarchical architectures consisting of this basic Hubel-Wiesel moduli inherit its properties of invariance, stability, and discriminability while capturing the compositional organization of the visual world in terms of wholes and parts. The theory extends existing deep learning convolutional architectures for image and speech recognition. It also suggests that the main computational goal of the ventral stream of visual cortex is to provide a hierarchical representation of new objects/images which is invariant to transformations, stable, and discriminative for recognition—and that this representation may be continuously learned in an unsupervised way during development and visual experience.

 

 

This work was supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF - 1231216.

{article}

It is known that Hubel and Wiesel’s original proposal [1] for visual area V1—of a module consisting of complex cells (C-units) combining the outputs of sets of simple cells (S-units) with identical orientation preferences but differing retinal positions—can be used to construct translation-invariant detectors. This is the insight underlying many networks for visual recognition, including HMAX [2] and convolutional neural nets [3, 4]. We show here how the original idea can be expanded into a comprehensive theory of visual recognition relevant for computer vision and possibly for visual cortex. The first step in the theory is the conjecture that a representation of images and image patches, with a feature vector that is invariant to a broad range of transformations—such as translation, scale, expression of a face, pose of a body, and viewpoint—makes it possible to recognize objects from only a few labeled examples, as humans do. The second step is proving that hierarchical architectures of Hubel-Wiesel (‘HW’) modules (indicated by in Fig. 1) can provide such invariant representations while maintaining discriminative information about the original image. Each -module provides a feature vector, which we call a signature, for the part of the visual field that is inside its “receptive field”; the signature is invariant to () affine transformations within the receptive field. The hierarchical architecture, since it computes a set of signatures for different parts of the image, is proven to be invariant to the rather general family of locally affine transformations (which includes globally affine transformations of the whole image). The basic HW-module is at the core of the properties of the architecture. This paper focuses first on its characterization and then outlines the rest of the theory, including its connections with machine learning, machine vision and neuroscience. Most of the theorems are in the supplementary information, where in the interest of telling a complete story we quote some results which are described more fully elsewhere [5, 6, 7].

Figure 1: A hierarchical architecture built from HW-modules. Each red circle represents the signature vector computed by the associated module (the outputs of complex cells) and double arrows represent its receptive fields – the part of the (neural) image visible to the module (for translations this is also the pooling range). The “image” is at level , at the bottom. The vector computed at the top of the hierarchy consists of invariant features for the whole image and is usually fed as input to a supervised learning machine such as a classifier; in addition signatures from modules at intermediate layers may also be inputs to classifiers for objects and parts.

1 Invariant representations and sample complexity

One could argue that the most important aspect of intelligence is the ability to learn. How do present supervised learning algorithms compare with brains? One of the most obvious differences is the ability of people and animals to learn from very few labeled examples. A child, or a monkey, can learn a recognition task from just a few examples. The main motivation of this paper is the conjecture that the key to reducing the sample complexity of object recognition is invariance to transformations. Images of the same object usually differ from each other because of simple transformations such as translation, scale (distance) or more complex deformations such as viewpoint (rotation in depth) or change in pose (of a body) or expression (of a face).

The conjecture is supported by previous theoretical work showing that almost all the complexity in recognition tasks is often due to the viewpoint and illumination nuisances that swamp the intrinsic characteristics of the object [8]. It implies that in many cases, recognition—i.e., both identification, e.g., of a specific car relative to other cars—as well as categorization, e.g., distinguishing between cars and airplanes—would be much easier (only a small number of training examples would be needed to achieve a given level of performance, i.e. ), if the images of objects were rectified with respect to all transformations, or equivalently, if the image representation itself were invariant. In SI Appendix, section 0 we provide a proof of the conjecture for the special case of translation (and for obvious generalizations of it).

The case of identification is obvious since the difficulty in recognizing exactly the same object, e.g., an individual face, is only due to transformations. In the case of categorization, consider the suggestive evidence from the classification task in Fig. 2. The figure shows that if an oracle factors out all transformations in images of many different cars and airplanes, providing “rectified” images with respect to viewpoint, illumination, position and scale, the problem of categorizing cars vs airplanes becomes easy: it can be done accurately with very few labeled examples. In this case, good performance was obtained from a single training image of each class, using a simple classifier. In other words, the sample complexity of the problem seems to be very low. We propose that the ventral stream in visual cortex tries to approximate such an oracle, providing a quasi-invariant signature for images and image patches.

Figure 2: Sample complexity for the task of categorizing cars vs airplanes from their raw pixel representations (no preprocessing). A. Performance of a nearest-neighbor classifier (distance metric = 1 - correlation) as a function of the number of examples per class used for training. Each test used 74 randomly chosen images to evaluate the classifier. Error bars represent +/- 1 standard deviation computed over 100 training/testing splits using different images out of the full set of 440 objects number of transformation conditions. Solid line: The rectified task. Classifier performance for the case where all training and test images are rectified with respect to all transformations; example images shown in B. Dashed line: The unrectified task. Classifier performance for the case where variation in position, scale, direction of illumination, and rotation around any axis (including rotation in depth) is allowed; example images shown in C. The images were created using 3D models from the Digimation model bank and rendered with Blender.

2 Invariance and uniqueness

Consider the problem of recognizing an image, or an image patch, independently of whether it has been transformed by the action of a group like the affine group in . We would like to associate to each object/image a signature, i.e., a vector which is unique and invariant with respect to a group of transformations, . (Note that our analysis, as we will see later, is not restricted to the case of groups.) In the following, we will consider groups that are compact and, for simplicity, finite (of cardinality ). We indicate, with slight abuse of notation, a generic group element and its (unitary) representation with the same symbol , and its action on an image as (e.g., a translation, ). A natural mathematical object to consider is the orbit —the set of images generated from a single image under the action of the group. We say that two images are equivalent when they belong to the same orbit: if such that . This equivalence relation formalizes the idea that an orbit is invariant and unique. Indeed, if two orbits have a point in common they are identical everywhere. Conversely, two orbits are different if none of the images in one orbit coincide with any image in the other[9].

How can two orbits be characterized and compared? There are several possible approaches. A distance between orbits can be defined in terms of a metric on images, but its computation is not obvious (especially by neurons). We follow here a different strategy: intuitively two empirical orbits are the same irrespective of the ordering of their points. This suggests that we consider the probability distribution induced by the group’s action on images ( can be seen as a realization of a random variable). It is possible to prove (see Theorem 2 in SI Appendix section 2) that if two orbits coincide then their associated distributions under the group are identical, that is

(1)

The distribution is thus invariant and discriminative, but it also inhabits a high-dimensional space and is therefore difficult to estimate. In particular, it is unclear how neurons or neuron-like elements could estimate it.

As argued later, neurons can effectively implement (high-dimensional) inner products, , between inputs and stored “templates” which are neural images. It turns out that classical results (such as the Cramer-Wold theorem [10], see Theorem 3 and 4 in section 2 of SI Appendix) ensure that a probability distribution can be almost uniquely characterized by one-dimensional probability distributions induced by the (one-dimensional) results of projections , where are a set of randomly chosen images called templates. A probability function in variables (the image dimensionality) induces a unique set of 1-D projections which is discriminative; empirically a small number of projections is usually sufficient to discriminate among a finite number of different probability distributions. Theorem 4 in SI Appendix section 2 says (informally) that an approximately invariant and unique signature of an image can be obtained from the estimates of 1-D probability distributions for . The number of projections needed to discriminate orbits, induced by images, up to precision (and with confidence ) is , where is a universal constant.

Thus the discriminability question can be answered positively (up to ) in terms of empirical estimates of the one-dimensional distributions of projections of the image onto a finite number of templates under the action of the group.

3 Memory-based learning of invariance

Notice that the estimation of requires the observation of the image and “all” its transforms . Ideally, however, we would like to compute an invariant signature for a new object seen only once (e.g., we can recognize a new face at different distances after just one observation, i.e. ). It is remarkable and almost magical that this is also made possible by the projection step. The key is the observation that . The same one-dimensional distribution is obtained from the projections of the image and all its transformations onto a fixed template, as from the projections of the image onto all the transformations of the same template. Indeed, the distributions of the variables and are the same. Thus it is possible for the system to store for each template all its transformations for all and later obtain an invariant signature for new images without any explicit knowledge of the transformations or of the group to which they belong. Implicit knowledge of the transformations, in the form of the stored templates, allows the system to be automatically invariant to those transformations for new inputs (see eq. in SI Appendix).

Estimates of the one-dimensional probability density functions (PDFs) can be written in terms of histograms as , where is a set of nonlinear functions (see remark 1 in SI Appendix section 1 or Theorem 6 in section 2 but also [11]). A visual system need not recover the actual probabilities from the empirical estimate in order to compute a unique signature. The set of values is sufficient, since it identifies the associated orbit (see box 1 in SI Appendix). Crucially, mechanisms capable of computing invariant representations under affine transformations for future objects can be learned and maintained in an unsupervised, automatic way by storing and updating sets of transformed templates which are unrelated to those future objects.

4 A theory of pooling

The arguments above make a few predictions. They require an effective normalization of the elements of the inner product (e.g. ) for the property to be valid (see remark 8 of SI Appendix section 1 for the affine transformations case). Notice that invariant signatures can be computed in several ways from one-dimensional probability distributions. Instead of the components directly representing the empirical distribution, the moments of the same distribution can be used [12] (this corresponds to the choice ). Under weak conditions, the set of all moments uniquely characterizes the one-dimensional distribution (and thus ). corresponds to pooling via sum/average (and is the only pooling function that does not require a nonlinearity); corresponds to ”energy models” of complex cells and is related to max-pooling. In our simulations, just one of these moments usually seems to provide sufficient selectivity to a hierarchical architecture (see SI Appendix section 6). Other nonlinearities are also possible[5]. The arguments of this section begin to provide a theoretical understanding of “pooling”, giving insight into the search for the “best” choice in any particular setting—something which is normally done empirically [13]. According to this theory, these different pooling functions are all invariant, each one capturing part of the full information contained in the PDFs.

5 Implementations

The theory has strong empirical support from several specific implementations which have been shown to perform well on a number of databases of natural images. The main support is provided by HMAX, an architecture in which pooling is done with a max operation and invariance, to translation and scale, is mostly hardwired (instead of learned). Its performance on a variety of tasks is discussed in SI Appendix section 6. Good performance is also achieved by other very similar architectures [14]. This class of existing models inspired the present theory, and may now be seen as special cases of it. Using the principles of invariant recognition the theory makes explicit, we have now begun to develop models that incorporate invariance to more complex transformations which cannot be solved by the architecture of the network, but must be learned from examples of objects undergoing transformations. These include non-affine and even non-group transformations, allowed by the hierarchical extension of the theory (see below). Performance for one such model is shown in Figure 3 (see caption for details).

Figure 3: Performance of a recent model [7] (inspired by the present theory) on Labeled Faces in the Wild, a same/different person task for faces seen in different poses and in the presence of clutter. A layer which builds invariance to translation, scaling, and limited in-plane rotation is followed by another which pools over variability induced by other transformations.

6 Extensions of the Theory

6.1 Invariance Implies Localization and Sparsity

The core of the theory applies without qualification to compact groups such as rotations of the image in the image plane. Translation and scaling are however only locally compact, and in any case, each of the modules of Fig. 1 observes only a part of the transformation’s full range. Each -module has a finite pooling range, corresponding to a finite “window” over the orbit associated with an image. Exact invariance for each module, in the case of translations or scaling transformations, is equivalent to a condition of localization/sparsity of the dot product between image and template (see Theorem 6 and Fig. 5 in section 2 of SI Appendix). In the simple case of a group parameterized by one parameter the condition is (for simplicity and have support center in zero):

(2)

Since this condition is a form of sparsity of the generic image w.r.t. a dictionary of templates (under a group), this result provides a computational justification for sparse encoding in sensory cortex[15].

It turns out that localization yields the following surprising result (Theorem 7 and 8 in SI Appendix): optimal invariance for translation and scale implies Gabor functions as templates. Since a frame of Gabor wavelets follows from natural requirements of completeness, this may also provide a general motivation for the Scattering Transform approach of Mallat based on wavelets [16].

The same Equation 2, if relaxed to hold approximately, that is , becomes a sparsity condition for the class of w.r.t. the dictionary under the group when restricted to a subclass of similar images. This property (see SI Appendix, end of section 2), which is an extension of the compressive sensing notion of “incoherence”, requires that and have a representation with sharply peaked correlation and autocorrelation. When the condition is satisfied, the basic HW-module equipped with such templates can provide approximate invariance to non-group transformations such as rotations in depth of a face or its changes of expression (see Proposition 9, section 2, SI Appendix). In summary, Equation 2 can be satisfied in two different regimes. The first one, exact and valid for generic , yields optimal Gabor templates. The second regime, approximate and valid for specific subclasses of , yields highly tuned templates, specific for the subclass. Note that this argument suggests generic, Gabor-like templates in the first layers of the hierarchy and highly specific templates at higher levels. (Note also that incoherence improves with increasing dimensionality.)

6.2 Hierarchical architectures

We have focused so far on the basic HW-module. Architectures consisting of such modules can be single-layer as well as multi-layer (hierarchical) (see Fig. 1). In our theory, the key property of hierarchical architectures of repeated HW-modules—allowing the recursive use of modules in multiple layers—is the property of covariance. By a covariant response at layer we mean that the distribution of the values of each projection is the same if we consider the image or the template transformations, i.e. (see Property 1 and Proposition 10 in section 3, SI Appendix), .
One-layer networks can achieve invariance to global transformations of the whole image while providing a unique global signature which is stable with respect to small perturbations of the image (see Theorem 5 in section 2 of SI Appendix and [5]). The two main reasons for a hierarchical architecture such as Fig. 1 are (a) the need to compute an invariant representation not only for the whole image but especially for all parts of it, which may contain objects and object parts, and (b) invariance to global transformations that are not affine, but are locally affine, that is, affine within the pooling range of some of the modules in the hierarchy. Of course, one could imagine local and global one-layer architectures used in the same visual system without a hierarchical configuration, but there are further reasons favoring hierarchies including compositionality and reusability of parts. In addition to the issues of sample complexity and connectivity, one-stage architectures are unable to capture the hierarchical organization of the visual world where scenes are composed of objects which are themselves composed of parts. Objects can move in a scene relative to each other without changing their identity and often changing the scene only in a minor way; the same is often true for parts within an object. Thus global and local signatures from all levels of the hierarchy must be able to access memory in order to enable the categorization and identification of whole scenes as well as of patches of the image corresponding to objects and their parts. Fig. 4 show examples of invariance and stability for wholes and parts. In the architecture of Fig. 1, each -module provides uniqueness, invariance and stability at different levels, over increasing ranges from bottom to top. Thus, in addition to the desired properties of invariance, stability and discriminability, these architectures match the hierarchical structure of the visual world and the need to retrieve items from memory at various levels of size and complexity. The results described here are part of a general theory of hierarchical architectures which is beginning to take form (see [5, 16, 17, 18]) around the basic function of computing invariant representations.

Figure 4: Empirical demonstration of the properties of invariance, stability and uniqueness of the hierarchical architecture in a specific 2 layers implementation (HMAX). Inset (a) shows the reference image on the left and a deformation of it (the eyes are closer to each other) on the right; (b) shows the relative change in signature provided by HW-modules at layer whose receptive fields contain the whole face. This signature vector is (Lipschitz) stable with respect to the deformation. Error bars represent standard deviation. Two different images (c) are presented at various location in the visual field. In (d) the relative change of the signature vector for different values of translation. The signature vector is invariant to global translation and discriminative (between the two faces). In this example the HW-module represents the top of a hierarchical, convolutional architecture. The images we used were pixels and error bars represent standard deviation.

The property of compositionality discussed above is related to the efficacy of hierarchical architectures vs. one-layer architectures in dealing with the problem of partial occlusion and the more difficult problem of clutter in object recognition. Hierarchical architectures are better at recognition in clutter than one-layer networks [19] because they provide signatures for image patches of several sizes and locations. However, hierarchical feedforward architectures cannot fully solve the problem of clutter. More complex (e.g. recurrent) architectures are likely needed for human-level recognition in clutter (see for instance [20, 21, 22]) and for other aspects of human vision. It is likely that much of the circuitry of visual cortex is required by these recurrent computations, not considered in this paper.

7 Visual Cortex

The theory described above effectively maps the computation of an invariant signature onto well-known capabilities of cortical neurons. A key difference between the basic elements of our digital computers and neurons is the number of connections: vs. synapses per cortical neuron. Taking into account basic properties of synapses, it follows that a single neuron can compute high-dimensional () inner products between input vectors and the stored vector of synaptic weights [23].
Consider an HW-module of “simple” and “complex” cells [1] looking at the image through a window defined by their receptive fields (see SI Appendix, section 2, POG). Suppose that images of objects in the visual environment undergo affine transformations. During development—and more generally, during visual experience—a set of simple cells store in their synapses an image patch and its transformations —one per simple cell. This is done, possibly at separate times, for different image patches (templates), . Each for is a sequence of frames, literally a movie of image patch transforming. There is a very simple, general, and powerful way to learn such unconstrained transformations. Unsupervised (Hebbian) learning is the main mechanism: for a “complex” cell to pool over several simple cells, the key is an unsupervised Foldiak-type rule: cells that fire together are wired together. At the level of complex cells this rule determines classes of equivalence among simple cells – reflecting observed time correlations in the real world, that is, transformations of the image. Time continuity, induced by the Markovian physics of the world, allows associative labeling of stimuli based on their temporal contiguity.

Later, when an image is presented, the simple cells compute for . The next step, as described above, is to estimate the one-dimensional probability distribution of such a projection, that is, the distribution of the outputs of the simple cells. It is generally assumed that complex cells pool the outputs of simple cells. Thus a complex cell could compute where is a smooth version of the step function ( for , for ) and (this corresponds to the choice ) . Each of these complex cells would estimate one bin of an approximated CDF (cumulative distribution function) for . Following the theoretical arguments above, the complex cells could compute, instead of an empirical CDF, one or more of its moments. is the mean of the dot products, corresponds to an energy model of complex cells [24]; very large corresponds to a operation. Conventional wisdom interprets available physiological data to suggest that simple/complex cells in V1 may be described in terms of energy models, but our alternative suggestion of empirical histogramming by sigmoidal nonlinearities with different offsets may fit the diversity of data even better.

As described above, a template and its transformed versions may be learned from unsupervised visual experience through Hebbian plasticity. Remarkably, our analysis and empirical studies[5] show that Hebbian plasticity, as formalized by Oja, can yield Gabor-like tuning—i.e., the templates that provide optimal invariance to translation and scale (see SI Appendix section 2).

The localization condition (Equation 2) can also be satisfied by images and templates that are similar to each other. The result is invariance to class-specific transformations. This part of the theory is consistent with the existence of class-specific modules in primate cortex such as a face module and a body module [25, 26, 6]. It is intriguing that the same localization condition suggests general Gabor-like templates for generic images in the first layers of a hierarchical architecture and specific, sharply tuned templates for the last stages of the hierarchy. This theory also fits physiology data concerning Gabor-like tuning in V1 and possibly in V4 (see [5]). It can also be shown that the theory, together with the hypothesis that storage of the templates takes place via Hebbian synapses, also predicts properties of the tuning of neurons in the face patch AL of macaque visual cortex [5, 27].

From the point of view of neuroscience, the theory makes a number of predictions, some obvious, some less so. One of the main predictions is that simple and complex cells should be found in all visual and auditory areas, not only in V1. Our definition of simple cells and complex cells is different from the traditional ones used by physiologists; for example, we propose a broader interpretation of complex cells, which in the theory represent invariant measurements associated with histograms of the outputs of simple cells or of moments of it. The theory implies that invariance to all image transformations could be learned, either during development or in adult life. It is, however, also consistent with the possibility that basic invariances may be genetically encoded by evolution but also refined and maintained by unsupervised visual experience. Studies on the development of visual invariance in organisms such as mice raised in virtual environments could test these predictions.

8 Discussion

The goal of this paper is to introduce a new theory of learning invariant representations for object recognition which cuts across levels of analysis [5, 28]. At the computational level, it gives a unified account of why a range of seemingly different models have recently achieved impressive results on recognition tasks. HMAX [2, 29, 30], Convolutional Neural Networks [3, 4, 31, 32] and Deep Feedforward Neural Networks [33, 34, 35] are examples of this class of architectures—as is, possibly, the feedforward organization of the ventral stream. At the algorithmic level, it motivates the development, now underway, of a new class of models for vision and speech which includes the previous models as special cases. At the level of biological implementation, its characterization of the optimal tuning of neurons in the ventral stream is consistent with the available data on Gabor-like tuning in V1[5] and the more specific types of tuning in higher areas such as in face patches.

Despite significant advances in sensory neuroscience over the last five decades, a true understanding of the basic functions of the ventral stream in visual cortex has proven to be elusive. Thus it is interesting that the theory of this paper follows from a novel hypothesis about the main computational function of the ventral stream: the representation of new objects/images in terms of a signature which is invariant to transformations learned during visual experience, thereby allowing recognition from very few labeled examples—in the limit, just one. A main contribution of our work to machine learning is a novel theoretical framework for the next major challenge in learning theory beyond the supervised learning setting which is now relatively mature: the problem of representation learning, formulated here as the unsupervised learning of invariant representations that significantly reduce the sample complexity of the supervised learning stage.

Acknowledgements.
We would like to thank the McGovern Institute for Brain Research for their support. We would also like to thank for having read earlier versions of the manuscript Yann LeCun, Ethan Meyers, Andrew Ng, Bernhard Schoelkopf and Alain Yuille. We also thanks Michael Buice, Charles Cadieu, Robert Desimone, Leyla Isik, Christof Koch, Gabriel Kreiman, Lakshminarayanan Mahadevan, Stephane Mallat, Pietro Perona, Ben Recht, Maximilian Riesenhuber, Ryan Rifkin, Terrence J. Sejnowski, Thomas Serre, Steve Smale, Stefano Soatto, Haim Sompolinsky, Carlo Tomasi, Shimon Ullman and Lior Wolf for useful comments. This material is based upon work supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216. This research was also sponsored by grants from the National Science Foundation (NSF-0640097, NSF-0827427), and AFSOR-THRL (FA8650-05-C-7262). Additional support was provided by the Eugene McDermott Foundation.

9 Supplementary Information

0.Invariance significantly reduces sample complexity

In this section we show how, in the simple case of transformations which are translations, an invariant representation of the image space considerably reduces the sample complexity of the classifier.
If we view images as vectors in , the sample complexity of a learning rule depends on the covering number of the ball, , that contains all the image distribution. More precisely, the covering number, , is defined as the minimum number of balls needed to cover . Suppose has radius we have

For example, in the case of linear learning rules, the sample complexity is proportional to the logarithm of the covering number.

Consider the simplest and most intuitive example: an image made of a single pixel and its translations in a square of dimension , where . In the pixel basis the space of the image and all its translates has dimension meanwhile the image dimension is one. The associated covering numbers are therefore

where stands for the covering number of the image space and the covering number of the translated image space. The sample complexity associated to the image space (see e.g. [36]) is and that associated to the translated images . The sample complexity reduction of an invariant representation is therefore given by

The above reasoning is independent on the choice of the basis since it depends only on the dimensionality of the ball containing all the images. For example we could have determined the dimensionality looking the cardinality of eigenvectors (with non null eigenvalue) associated to a circulant matrix of dimension i.e. using the Fourier basis. In the simple case above, the cardinality is clearly .
In general any transformation of an abelian group can be analyzed using the Fourier transform on the group. We conjecture that a similar reasoning holds for locally compact groups using a wavelet representation instead of the Fourier representation.
The example and ideas above leads to the following theorem: {theorem} Consider a space of images of dimensions pixels which may appear in any position within a window of size pixels. The usual image representation yields a sample complexity (of a linear classifier) of order ; the invariant representation yields (because of much smaller covering numbers) a sample complexity of order

1. Setup and Definitions

Let be a Hilbert space with norm and inner product denoted by and , respectively. We can think of as the space of images (our images are usually “neural images”). We typically consider or or . We denote with a (locally) compact group and with an abuse of notation, we denote by both a group element in and its action/representation on .
When useful we will make the following assumptions which are justified from a biological point of view.

Normalized dot products of signals (e.g. images or “neural activities”) are usually assumed throughout the theory, for convenience but also because they provide the most elementary invariances – to measurement units (origin and scale). We assume that the dot products are between functions or vectors that are zero-mean and of unit norm. Thus sets , with the mean. This normalization stage before each dot product is consistent with the convention that the empty surround of an isolated image patch has zero value (which can be taken to be the average “grey” value over the ensemble of images). In particular the dot product of a template – in general different from zero – and the “empty” region outside an isolated image patch will be zero. The dot product of two uncorrelated images – for instance of random 2D noise – is also approximately zero.
Remarks:

  1. The -th component of the signature associated with a simple-complex module is (see Equation (10) or (13)) where the functions are such that : in words, the empirical histogram estimated for does not take into account the value, since it does not carry any information about the image patch. The functions are also assumed to be positive and bijective.

  2. Images have a maximum total possible support corresponding to a bounded region , which we refer to as the visual field, and which corresponds to the spatial pooling range of the module at the top of the hierarchy of Figure in the main text. Neuronal images are inputs to the modules in higher layers and are usually supported in a higher dimensional space, corresponding to the signature components provided by lower layers modules; isolated objects are images with support contained in the pooling range of one of the modules at an intermediate level of the hierarchy. We use the notation respectively for the simple responses and for the complex response . To simplify the notation we suppose that the center of the support of the signature at each layer , , coincides with the center of the pooling range.

  3. The domain of the dot products corresponding to templates and to simple cells is in general different from the domain of the pooling . We will continue to use the commonly used term receptive field – even if it mixes these two domains.

  4. The main part of the theory characterizes properties of the basic HW module – which computes the components of an invariant signature vector from an image patch within its receptive field.

  5. It is important to emphasize that the basic module is always the same throughout the paper. We use different mathematical tools, including approximations, to study under which conditions (e.g. localization or linearization, see end of section 2) the signature computed by the module is invariant or approximatively invariant.

  6. The pooling is effectively over a pooling window in the group parameters. In the case of 1D scaling and 1D translations, the pooling window corresponds to an interval, e.g. , of scales and an interval, e.g. , of translations, respectively.

  7. All the results in this paper are valid in the case of a discrete or a continuous compact (locally compact) group: in the first case we have a sum over the transformations, in the second an integral over the Haar measure of the group.

  8. Normalized dot products also eliminate the need of the explicit computation of the determinant of the Jacobian for affine transformations (which is a constant and is simplified dividing by the norms) assuring that , where is an affine transformation.

2. Invariance and uniqueness: Basic Module

9.1 Compact Groups (fully observable)

Given an image and a group representation , the orbit is uniquely associated to an image and all its transformations. The orbit provides an invariant representation of , i.e. for all . Indeed, we can view an orbit as all the possible realizations of a random variable with distribution induced by the group action. From this observation, a signature can be derived for compact groups, by using results characterizing probability distributions via their one dimensional projections.
In this section we study the signature given by

where each component is a histogram corresponding to a one dimensional projection defined by a template . In the following we let .

9.2 Orbits and probability distributions

If is a compact group, the associated Haar measure can be normalized to be a probability measure, so that, for any , we can define the random variable,

The corresponding distribution is defined as for any Borel set (with some abuse of notation we let be the normalized Haar measure).
Recall that we define two images, to be equivalent (and we indicate it with ) if there exists s.t. . We have the following theorem:

{theorem}

The distribution is invariant and unique i.e. . Proof:
We first prove that . By definition iff , , that is , where,

. Note that if , so that , i.e. . Conversely , so that . Using this observation we have,

where in the last integral we used the change of variable and the invariance property of the Haar measure: this proves the implication.
To prove that , note that for some implies that the support of the probability distribution of has non null intersection with that of i.e. the orbits of and intersect. In other words there exist such that . This implies , i.e. . Q.E.D.

9.3 Random Projections for Probability Distributions.

Given the above discussion, a signature may be associated to by constructing a histogram approximation of , but this would require dealing with high dimensional histograms. The following classic theorem gives a way around this problem.
For a template , where is unit sphere in , let be the associated projection. Moreover, let be the distribution associated to the random variable (or equivalently , if is unitary). Let .
{theorem} (Cramer-Wold, [10]) For any pair of probability distributions on , we have that if and only if . In words, two probability distributions are equal if and only if their projections on any of the unit sphere directions is equal. The above result can be equivalently stated as saying that the probability of choosing such that is equal to 1 if and only if and the probability of choosing such that is equal to 0 if and only if (see Theorem 3.4 in [37]). The theorem suggests a way to define a metric on distributions (orbits) in terms of

where is any metric on one dimensional probability distributions and is a distribution measure on the projections. Indeed, it is easy to check that is a metric. In particular note that, in view of the Cramer Wold Theorem, if and only if . As mentioned in the main text, each one dimensional distribution can be approximated by a suitable histogram , so that, in the limit in which the histogram approximation is accurate

(3)

where is a metric on histograms induced by .

A natural question is whether there are situations in which a finite number of projections suffice to discriminate any two probability distributions, that is . Empirical results show that this is often the case with a small number of templates (see [38] and HMAX experiments, section 6). The problem of mathematically characterizing the situations in which a finite number of (one-dimensional) projections are sufficient is challenging. Here we provide a partial answer to this question.
We start by observing that the metric (3) can be approximated by uniformly sampling templates and considering

(4)

where . The following result shows that a finite number of templates is sufficient to obtain an approximation within a given precision . Towards this end let

(5)

where is the Euclidean norm in . The following theorem holds:
{theorem} Consider images in . Let , where is a universal constant. Then

(6)

with probability , for all . Proof:
The proof follows from an application of Höeffding’s inequality and a union bound.
Fix . Define the real random variable ,

From the definitions it follows that and . Then Höeffding inequality implies

with probability at most . A union bound implies a result holding uniformly on ; the probability becomes at most . The desired result is obtained noting that this probability is less than as soon as that is . Q.E.D.

The above result shows that the discriminability question can be answered in terms of empirical estimates of the one-dimensional distributions of projections of the image and transformations induced by the group on a number of templates .
Theorem 4 can be compared to a version of the Cramer Wold Theorem for discrete probability distributions. Theorem 1 in [39] shows that for a probability distribution consisting of atoms in , we see that at most directions () are enough to characterize the distribution, thus a finite – albeit large – number of one-dimensional projections.

9.4 Memory based learning of invariance

The signature is obviously invariant (and unique) since it is associated to an image and all its transformations (an orbit). Each component of the signature is also invariant – it corresponds to a group average. Indeed, each measurement can be defined as

(7)

for finite group, or equivalently

(8)

when is a (locally) compact group. Here, the non linearity can be chosen to define an histogram approximation; in general is a bijective positive function. Then, it is clear that from the properties of the Haar measure we have

(9)

Note that in the r.h.s. of eq. (8) the transformations are on templates: this mathematically trivial (for unitary transformations) step has a deeper computational aspect. Invariance is now in fact achieved through transformations of templates instead of those of the image, not always available.

9.5 Stability

With denoting as usual the signature of an image, and , , a metric, we say that a signature is stable if it is Lipschitz continuous (see [16]), that is

(10)

In our setting we let

and assume that for and . If we call the signature map contractive. In the following we prove a stronger form of eq. 10 where the norm is substituted with the Hausdorff norm on the orbits (which is independent of the choice of and in the orbits) defined as , , i.e. we have:

{theorem}

Assume normalized templates and let s.t. , where is the Lipschitz constant of the function . Then

(11)

for all . Proof:
By definition, if the non linearities are Lipschitz continuous, for all , with Lipschitz constant , it follows that for each component of the signature we have

where we used the linearity of the inner product and Jensen’s inequality. Applying Schwartz’s inequality we obtain

where . If we assume the templates and their transformations to be normalized to unity then we finally have,

(12)

from which we obtain (10) summing over all components and dividing by since by hypothesis. Note now that the l.h.s. of (12), being each component of the signature invariant, is independent of the choice of in the orbits. We can then choose such that

In particular being the map is non expansive summing each component and dividing by we have eq. (11). Q.E.D.

The above result shows that the stability of the empirical signature

provided with the metric (4) (together with (5)) holds for nonlinearities with Lipschitz constants such that .

Box 1: computing an invariant signature
1:procedure Signature(I)
2:Given templates .
3:     for  do
4:          Compute , the normalized
dot products of the image with all the
transformed templates (all ).
5:         Pool the results: POOL().
6:     end for
7:     return the pooled results for all .       is unique and invariant if there are enough templates.
8:end procedure

9.6 Partially Observable Groups case: invariance implies localization and sparsity

This section outlines invariance, uniqueness and stability properties of the signature obtained in the case in which transformations of a group are observable only within a window “over” the orbit. The term POG (Partially Observable Groups) emphasizes the properties of the group – in particular associated invariants – as seen by an observer (e.g. a neuron) looking through a window at a part of the orbit. Let be a finite group and a subset (note: is not usually a subgroup). The subset of transformations can be seen as the set of transformations that can be observed by a window on the orbit that is the transformations that correspond to a part of the orbit. A local signature associated to the partial observation of can be defined considering

(13)

and . This definition can be generalized to any locally compact group considering,

(14)

Note that the constant normalizes the Haar measure, restricted to , so that it defines a probability distribution. The latter is the distribution of the images subject to the group transformations which are observable, that is in . The above definitions can be compared to definitions (7) and (8) in the fully observable groups case. In the next sections we discuss the properties of the above signature. While stability and uniqueness follow essentially from the analysis of the previous section, invariance requires developing a new analysis.

9.7 POG: Stability and Uniqueness

A direct consequence of Theorem 9.2 is that any two orbits with a common point are identical. This follows from the fact that if is a common point of the orbits, then

Thus the two images are transformed versions of one another and .
Suppose now that only a fragment of the orbits – the part within the window – is observable; the reasoning above is still valid since if the orbits are different or equal so must be any of their “corresponding” parts.
Regarding the stability of POG signatures, note that the reasoning in the previous section, Theorem 9.5, can be repeated without any significant change. In fact, only the normalization over the transformations is modified accordingly.

9.8 POG: Partial Invariance and Localization

Since the group is only partially observable we introduce the notion of partial invariance for images and transformations that are within the observation window. Partial invariance is defined in terms of invariance of

(15)

We recall that when and do not share any common support on the plane or and are uncorrelated, then . The following theorem, where corresponds to the pooling range states, a sufficient condition for partial invariance in the case of a locally compact group:
{theorem}Localization and Invariance. Let a Hilbert space, a set of bijective (positive) functions and a locally compact group. Let and suppose . Then for any given , the following conditions hold:

(16)

Proof:
To prove the implication note that if , being ( is the symbol for symmetric difference () we have:

The second equality is true since, being positive, the fact that the integral is zero implies (and therefore in particular ). Being the r.h.s. of the inequality positive, we have

(18)

i.e. (see also Fig. 5 for a visual explanation). Q.E.D.
Equation (4. Synopsis of Mathematical Results) describes a localization condition on the inner product of the transformed image and the template. The above result naturally raises question of weather the localization condition is also necessary for invariance. Clearly, this would be the case if eq. (9.8) could be turned into an equality, that is

Indeed, in this case, if , and we further assume the natural condition if and only if , then the localization condition (4. Synopsis of Mathematical Results) would be necessary since is a positive bijective function.
The equality in eq. (9.8) in general is not true. However, this is clearly the case if we consider the group of transformations to be translations as illustrated in Fig. 7 a). We discuss in some details this latter case.

Assume that . Let

(20)

for a given where is a unitary representation of the translation operator. We can view as sets of simple responses to a given template through two receptive fields. Let , so that for all . We assume that to be closed intervals for all . Then, recall that a bijective function (in this case ) is strictly monotonic on any closed interval so that the difference of integrals in eq. (9.8) is zero if and only if . Since we are interested in considering all the values of up to some maximum , then we can consider the condition

(21)

The above condition can be satisfied in two cases: 1) both dot products are zero, which is the localization condition, or 2) (or equivalently ) i.e. the image or the template are periodic. A similar reasoning applies to the case of scale transformations.

Figure 5: A sufficient condition for invariance for locally compact groups: if the support of is sufficiently localized it will be completely contained in the pooling interval even if the image is group shifted, or, equivalently (as shown in the Figure), if the pooling interval is group shifted by the same amount.

In the next paragraph we will see how localization conditions for scale and translation transformations imply a specific form of the templates.

Figure 6: An HW-module pooling the dot products of transformed templates with the image. The input image is shown centered on the template ; the same module is shown above for a group shift of the input image, which now localizes around the transformed template . Images and templates satisfy the localization condition with . The interval indicates the pooling window. The shift in shown in the Figure is a special case: the reader should consider the case in which the transformation parameter, instead of , is for instance rotation in depth.

The Localization condition: Translation and Scale

In this section we identify with subsets of the affine group. In particular, we study separately the case of scale and translations (in 1D for simplicity).

In the following it is helpful to assume that all images and templates are strictly contained in the range of translation or scale pooling, , since image components outside it are not measured. We will consider images restricted to : for translation this means that the support of is contained in , for scaling, since and (where indicates the Fourier transform), assuming a scale pooling range of , implies a range ( and indicates maximum and minimum) of spatial frequencies for the maximum support of and . As we will see because of Theorem 9.8 invariance to translation requires spatial localization of images and templates and less obviously invariance to scale requires bandpass properties of images and templates. Thus images and templates are assumed to be localized from the outset in either space or frequency. The corollaries below show that a stricter localization condition is needed for invariance and that this condition determines the form of the template. Notice that in our framework images and templates are bandpass because of being zero-mean. Notice that, in addition, neural “images” which are input to the hierarchical architecture are spatially bandpass because of retinal processing.

We now state the result of Theorem 9.8 for one dimensional signals under the translation group and – separately – under the dilation group.
    Let , the one dimensional locally compact group of translations ( is a unitary representation of the translation operator as before). Let, e.g., and suppose . Further suppose . Then eq. (4. Synopsis of Mathematical Results) (and the following discussion for the translation (scale) transformations) leads to
Corollary 1: Localization in the spatial domain is necessary and sufficient for translation invariance. For any fixed we have:

(22)

with .
Similarly let be the one dimensional locally compact group of dilations and denote with a unitary representation of the dilation operator. Let and suppose . Then
Corollary 2: Localization in the spatial frequency domain is necessary and sufficient for scale invariance. For any fixed we have:

(23)

with .
Localization conditions of the support of the dot product for translation and scale are depicted in Figure 7,a) ,b).
As shown by the following Lemma 9.8 Eq. (22) and (23) gives interesting conditions on the supports of and its Fourier transform . For translation, the corollary is equivalent to zero overlap of the compact supports of and . In particular using Theorem 9.8, for , the maximal invariance in translation implies the following localization conditions on