Normative theory of visual receptive fields

Normative theory of visual receptive fields

Tony Lindeberg Computational Brain Science Lab, Department of Computational Science and Technology,
KTH Royal Institute of Technology, SE-100 44 Stockholm, Sweden. Email: tony@kth.se
Abstract

This article gives an overview of a normative computational theory of visual receptive fields. It is described how idealized functional models of early spatial, spatio-chromatic and spatio-temporal receptive fields can be derived in an axiomatic way based on structural properties of the environment in combination with assumptions about the internal structure of a vision system to guarantee consistent handling of image representations over multiple spatial and temporal scales. Interestingly, this theory leads to predictions about visual receptive field shapes with qualitatively very good similarity to biological receptive fields measured in the retina, the LGN and the primary visual cortex (V1) of mammals.

Keywords—Receptive field, Functional model, Gaussian derivative, Scale covariance, Affine covariance, Galilean covariance, Temporal causality, Illumination invariance, Retina, LGN, Primary visual cortex, Simple cell, Double-opponent cell, Vision.

ptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptpt

Normative theory of visual receptive fields


Tony Lindeberg Computational Brain Science Lab, Department of Computational Science and Technology, KTH Royal Institute of Technology, SE-100 44 Stockholm, Sweden. Email: tony@kth.se


I Introduction

When light reaches a visual sensor such as the retina, the information necessary to infer properties about the surrounding world is not contained in the measurement of image intensity at a single point, but from the relations between intensity values at different points. A main reason for this is that the incoming light constitutes an indirect source of information depending on the interaction between geometric and material properties of objects in the surrounding world and on external illumination sources. Another fundamental reason why cues to the surrounding world need to be collected over regions in the visual field as opposed to at single image points is that the measurement process by itself requires the accumulation of energy over non-infinitesimal support regions over space and time. Such a region in the visual field for which a neuron responds to visual stimuli is traditionally referred to as a receptive field (Hubel and Wiesel [1, 2, 3]) (see Figure 1). In this work, we focus on a functional description of receptive fields, regarding how a neuron with a purely spatial receptive field responds to visual stimuli over image space, and regarding how a neuron with a spatio-temporal receptive field responds to visual stimuli over space and time (DeAngelis et al. [4, 5]).

Fig. 1: A receptive field is traditionally defined as a region in the visual field for which a visual sensor/neuron/operator responds to visual stimuli. This figure shows a set of partially overlapping receptive fields over the spatial domain with all the receptive fields having the same spatial extent. More generally, one can conceive distributions of receptive fields over space or space-time with the receptive fields of different size, different shape and orientation in space as well as different directions in space-time, where adjacent receptive fields may also have significantly larger relative overlap than shown in this schematic illustration. In this work, we focus on a functional description of linear receptive fields, concerning how a neuron responds to visual stimuli over image space regarding spatial receptive fields or over joint space-time regarding spatio-temporal receptive fields.

If one considers the theoretical and computational problem of designing a vision system that is going to make use of incoming reflected light to infer properties of the surrounding world, one may ask what types of image operations should be performed on the image data. Would any type of image operation be reasonable? Specifically regarding the notion of receptive fields one may ask what types of receptive field profiles would be reasonable. Is it possible to derive a theoretical model of how receptive fields “ought to” respond to visual data?

Initially, such a problem might be regarded as intractable unless the question can be further specified. It is, however, possible to address this problem systematically using approaches that have been developed in the area of computer vision known as scale-space theory (Iijima [6]; Witkin [7]; Koenderink [8]; Koenderink and van Doorn [9, 10]; Lindeberg [11, 12, 13, 14]; Florack [15]; Sporring et al. [16]; Weickert et al. [17]; ter Haar Romeny [18]). A paradigm that has been developed in this field is to impose structural constraints on the first stages of visual processing that reflect symmetry properties of the environment. Interestingly, it turns out to be possible to substantially reduce the class of permissible image operations from such arguments.

The subject of this article is to describe how structural requirements on the first stages of visual processing as formulated in scale-space theory can be used for deriving idealized functional models of visual receptive fields and implications of how these theoretical results can be used when modelling biological vision. A main theoretical argument is that idealized functional models for linear receptive fields can be derived by necessity given a small set of symmetry requirements that reflect properties of the world that one may naturally require an idealized vision system to be adapted to. In this respect, the treatment bears similarities to approaches in theoretical physics, where symmetry properties are often used as main arguments in the formulation of physical theories of the world. The treatment that will follow will be general in the sense that spatial, spatio-chromatic and spatio-temporal receptive fields are encompassed by the same unified theory.

This paper gives a condensed summary of a more general theoretical framework for receptive fields derived and presented in [13, 19, 20, 21] and in turn developed to enable a consistent handling of receptive field responses in terms of provable covariance or invariance properties under natural image transformations (see Figure 2). In relation to the early publications on this topic [13, 19, 20], this paper presents an improved version of that theory leading to an improved model for the temporal smoothing operation for the specific case of a time-causal image domain [21], where the future cannot be accessed and the receptive fields have to be solely based on information from the present moment and a compact buffer of the past. Specifically, this paper presents the improved axiomatic structure on a compact form more easy to access compared to the original publications and also encompassing the better time-causal model.

Fig. 2: Basic factors that influence the formation of images for an eye with a two-dimensional retina that observes objects in the three-dimensional world. In addition to the position, the orientation and the motion of the object in 3-D, the perspective projection onto the retina is affected by the viewing distance, the viewing direction and the relative motion of the eye in relation to the object, the spatial and the temporal sampling characteristics of the neurons in the retina as well the usually unknown external illumination field in relation to the geometry of the scene and the observer.

It will be shown that the presented framework leads to predictions of receptive field profiles in good agreement with receptive measurements reported in the literature (Hubel and Wiesel [1, 2, 3]; DeAngelis et al. [4, 5]; Conway and Livingstone [22]; Johnson et al. [23]). Specifically, explicit phenomenological models will be given of LGN neurons and simple cells in V1 and will be compared to related models in terms of Gabor functions (Marčelja [24]; Jones and Palmer [25, 26]; Ringach [27, 28]), differences of Gaussians (Rodieck [29]) and Gaussian derivatives (Koenderink and van Doorn [9]; Young [30]; Young et al. [31, 32]). Notably, the evolution properties of the receptive field profiles in this model can be described by diffusion equations and are therefore suitable for implementation on a biological architecture, since the computations can be expressed in terms of communications between neighbouring computational units, where either a single computational unit or a group of computational units may be interpreted as corresponding to a neuron or a group of neurons. Specifically, computational models involving diffusion equations arise in mean field theory for approximating the computations that are performed by populations of neurons (Omurtag et al. [33]; Mattia and Guidic [34]; Faugeras et al. [35]).

I-a Structure of this article

This paper is organized as follows: Section II gives an overview of and motivation to the assumptions that the theory is based on. A set of structural requirements is formulated to capture the effect of natural image transformations onto the illumination field that reaches the retina and to guarantee internal consistency between image representations that are computed from receptive field responses over multiple spatial and temporal scales.

Section III describes linear receptive families that arise as consequences of these assumptions for the cases of either a purely spatial domain or a joint spatio-temporal domain. The issue of how to perform relative normalization between receptive field responses over multiple spatial and temporal scales is treated, so as to enable comparisons between receptive field responses at different spatial and temporal scales. We also show how the influence of illumination transformations and exposure control mechanisms on the receptive field responses can be handled, by describing invariance properties obtained by applying the derived linear receptive fields over a logarithmically transformed intensity domain.

Section IV shows examples of how spatial, spatio-chromatic and spatio-temporal receptive fields in the retina, the LGN and the primary visual cortex can be well modelled by the derived receptive field families.

Section V gives relations to previous work, including conceptual and theoretical comparisons to previous use of Gabor models of receptive fields, approaches for learning receptive fields from image data and previous applications of a logarithmic transformation of the image intensities. Finally, Section VI summarizes some of the main results.

Ii Assumptions underlying the theory: Structural requirements

In the following, we shall describe a set of structural requirements that can be stated concerning: (i) spatial geometry, (ii) spatio-temporal geometry, (iii) the image measurement process with its close relationship to the notion of scale, (iv) internal representations of image data that are to be computed by a general purpose vision system and (v) the parameterization of image intensity with regard to the influence of illumination variations.

For modelling the image formation process, we will at any point on the retina approximate the spherical retina by a perspective projection onto the tangent plane of the retinal surface at that image point, below represented as the image plane. Additionally, we will approximate the possibly non-linear geometric transformations regarding spatial and spatio-temporal geometry by local linearizations at every image point, and corresponding to the derivative of the possibly non-linear transformation. In these ways, the theoretical analysis can be substantially simplified, while still enabling accurate modelling of essential functional properties of receptive fields in relation to the effects of natural image transformations as arising from interactions with the environment.

Ii-a Static image data over a spatial domain

In the following, we will describe a theoretical model for the computational function of applying visual receptive fields to local image patterns.

For time-independent data over a two-dimensional spatial image domain, we would like to define a family of image representations over a possibly multi-dimensional scale parameter , where the internal image representations are computed by applying some parameterized family of image operators to the image data :

(1)

Specifically, we will assume that the family of image operators should satisfy:

II-A0a Linearity

For the earliest processing stages to make as few irreversible decisions as possible, we assume that they should be linear

(2)

Specifically, linearity implies that any particular scale-space properties (to be detailed below) that we derive for the zero-order image representation will transfer to any spatial derivative of , so that

(3)

In this sense, the assumption of linearity reflects the requirement of a lack of bias to particular types of image structures, with the underlying aim that the processing performed in the first processing stages should be generic, to be used as input for a large variety of visual tasks. By the assumption of linearity, local image structures that are captured by e.g. first- or second-order derivatives will be treated in a structurally similar manner, which would not necessarily be the case if the first local neighbourhood processing stage of the first layer of receptive fields would instead be genuinely non-linear.111Note, however, that the assumption about linearity of some first layers of receptive fields does, however, not exclude the possibility of defining later stage non-linear receptive fields that operate on the output from the linear receptive fields, such as the computations performed by complex cells in the primary visual cortex. Neither does this assumption of linearity exclude the possibility of transforming the raw image intensities by a pointwise non-linear mapping function prior to the application of linear receptive fields based on processing over local neighbourhoods. In Section III-D it will be specifically shown that a pointwise logarithmic transformation of the image intensities prior to the application of linear receptive fields has theoretical advantages in terms of invariance properties of derivative-based receptive field responses under local multiplicative illumination transformations.

This genericity property is closely related to the basic property of the mammalian vision system, that the computations performed in the retina, the LGN and the primary visual cortex provide general purpose output that is used as input to higher-level visual areas.

II-A0b Shift invariance

To ensure that the visual interpretation of an object should be the same irrespective of its position in the image plane, we assume that the first processing stages should be shift invariant, so that if an object is moved a distance in the image plane, the receptive field response should remain on a similar form while shifted with the same distance. Formally, this requirement can be stated that the family of image operators should commute with the shift operator defined by :

(4)

In other words, if we shift the input by a translation and then apply the receptive field operator , the result should be similar as applying the receptive field operator to the original input and then shifting the result.

II-A0c Convolution structure

Together, the assumptions about linearity and shift-invariance imply that will correspond to a convolution operator [36]. This implies that the representation can be computed from the image data by convolution with some parameterized family of convolution kernels :

(5)

II-A0d Semi-group structure over spatial scales

To ensure that the transformation from any finer scale to any coarser scale should be of the same form for any (a requirement of algebraic closedness), we assume that the result of convolving two kernels and from the family with each other should be a kernel within the same family of kernels and with added parameter values :

(6)

This assumption specifically implies that the representation at a coarse scale can be computed from the representation at a finer scale by a convolution operation of the same form (II-A0c) as the transformation from the original image data while using the difference in scale levels as the parameter

(7)

This property does in turn imply that if we are able to derive specific properties of the family of transformations (to be detailed below), then these properties will not only hold for the transformation from the original image data to the representations at coarser scales, but also between any pair of scale levels , with the aim that image representations at coarser scales should be possible to regard as simplifications of corresponding image representations at finer scales.

In terms of mathematical concepts, this form of algebraic structure is referred to as a semi-group structure over spatial scales

(8)

II-A0e Scale covariance under spatial scaling transformations

If a visual observer looks at the same object from different distances, we would like the internal representations derived from the receptive field responses to be sufficiently similar, so that the object can be recognized as the same object while appearing with a different size on the retina. Specifically, it is thereby natural to require that the receptive field responses should be of a similar form while resized in the image plane.

This corresponds to a requirement of spatial scale covariance under uniform scaling transformations of the spatial domain :

(9)

to hold for some transformation of the scale parameter .

II-A0f Affine covariance under affine transformations

If a visual observer looks at the same local surface patch from two different viewing directions, then the local surface patch may be deformed in different ways onto the different views and with different amounts of perspective foreshortening from the different viewing directions. If we approximate the local deformations caused by the perspective mapping by local affine transformations, then the transformation between the two differently deformed views of the local surface patch can in turn be described by a composed local affine transformation . If we are to use receptive field responses as a basis for higher level visual operations, it is natural to require that the receptive field response of an affine deformed image patch should remain on a similar form while being reshaped by a corresponding affine transformation.

This corresponds to a requirement of affine covariance under general affine transformations :

(10)

to hold for some transformation of the scale parameter.

II-A0g Non-creation of new structure with increasing scale

If we apply the family of transformations for computing representations at coarser scales from representations at finer scales according to (II-A) and (II-A0d), there could be a potential risk that the family of transformations could amplify spurious structures in the input to produce macroscopic amplifications in the representations at coarser scales that do not directly correspond to simplifications of corresponding structures in the original image data. To prevent such undesirable phenomena from occurring, we require that local spurious structures must not be amplified and express this condition in terms of the evolution properties over scales at local maxima and minima in the image intensities as smoothed by the family of convolution kernels : If a point for some scale is a local maximum point in the image plane, then the value at this maximum point must not increase to coarser scales . Similarly, if a point is a local minimum point in the image plane, then the value at this minimum point must not decrease to coarser scales .

Formally, this requirement that new structures should not be created from finer to coarser scales, can be formalized into the requirement of non-enhancement of local extrema, which implies that if at some scale a point is a local maximum (minimum) for the mapping from to , then (see Figure 3):

  • at any spatial maximum,

  • at any spatial minimum.

This condition implies a strong condition on the class of possible smoothing kernels .

Fig. 3: The requirement of non-enhancement of local extrema is a way of restricting the class of possible image operations by formalizing the notion that new image structures must not be created with increasing scale, by requiring that the value at a local maximum must not increase and that the value at a local minimum must not decrease from finer to coarser scales .

Ii-B Time-dependent image data over space-time

To model the computational function of spatio-temporal receptive fields in time-dependent image patterns, we do for a time-dependent spatio-temporal domain first inherit the structural requirements regarding a spatial domain and complement the spatial scale parameter by a temporal scale parameter . In addition, we assume:

II-B0a Scale covariance under temporal scaling transformations

If a similar type of spatio-temporal event occurs at different speeds, faster or slower, it is natural to require that the receptive field responses should be of a similar form, while occurring correspondingly faster or slower.

This corresponds to a requirement of temporal scale covariance under a temporal scaling transformation of the temporal domain :

(11)

to hold for some transformation of the spatio-temporal scale parameters .

II-B0b Galilean covariance under Galilean transformations

If an observer looks at the same object in the world for different relative motions between the object and the observer, it is natural to require that the internal representations of the object should be sufficiently similar so as to enable a coherent perception of the object under different relative motions relative to the observer. Specifically, we may require that the receptive field responses under relative motions should remain on the same form while being transformed in a corresponding way as the relative motion pattern.

If we at any point in space-time locally linearize the possibly non-linear motion pattern by a local Galilean transformation over space-time

(12)

then the requirement of guaranteeing a consistent visual interpretation under different relative motions between the object and the observer can be stated as a requirement of Galilean covariance:

(13)

to hold for some transformation of the spatio-temporal scale parameters .

II-B0c Semi-group structure over temporal scales in the case of a non-causal temporal domain

To ensure that the representations between different spatio-temporal scale levels and should be sufficiently well-behaved internally, we will make use of different types of assumptions depending on whether the temporal domain is regarded as time-causal or non-causal. Over a time-causal temporal domain, the future cannot be accessed, which is the basic condition for real-time visual perception by a biological organism. Over a non-causal temporal domain, the temporal kernels may extend to the relative future in relation to any pre-recorded time moment, which is sometimes used as a conceptual simplification when analysing pre-recorded time-dependent data although not at all realistic in a real-world setting.

For the case of a non-causal temporal domain, we make use of a similar type of semi-group property (II-A0d) as formulated over a purely spatial domain, while extending the semi-group property over both the spatial scale parameter and the temporal scale parameter :

(14)

In analogy with the case of a purely spatial domain, this requirement guarantees that the transformation from any finer spatio-temporal scale level to any coarser spatio-temporal scale level will always be of the same form (algebraic closedness)

(15)

Specifically, this assumption implies that if we are able to establish desirable properties of the family of transformations (to be detailed below), then these relations hold between any pair of spatio-temporal scale levels and with .

II-B0d Cascade structure over temporal scales in the case of a time-causal temporal domain

Since it can be shown that the assumption of a semi-group structure over temporal scales leads to undesirable temporal dynamics in terms of e.g. longer temporal delays for a time-causal temporal domain [37, Appendix A], we do for a time-causal temporal domain instead assume a weaker cascade smoothing property over temporal scales for the temporal smoothing kernel over temporal scales

(16)

where the temporal kernels should for any triplets of temporal scale values and temporal delays , and obey the transitive property

(17)

This weaker assumption of a cascade smoothing property (II-B0d) still ensures that a representation at a coarser temporal scale should with a corresponding requirement of an accompanying simplifying condition on the family of kernels (to be detailed below) constitute a simplification of the representation at a finer temporal scale , while not implying as hard constraints as a semi-group structure.

II-B0e Non-enhancement of local space-time extrema in the case of a non-causal temporal domain

In the case of a non-causal temporal domain, we again build on the notion of non-enhancement of local extrema to guarantee that the representations at coarser spatio-temporal scales should constitute true simplifications of corresponding representations at finer scales Over a spatio-temporal domain, we do, however, state the requirement in terms of local extrema over joint space-time instead of over local extrema over image space. If a point for some scale is a local maximum point over space-time, then the value at this maximum point must not increase to coarser scales . Similarly, if a point is a local minimum point over space-time, then the value at this minimum point must not decrease to coarser scales .

Formally, this requirement of non-creation of new structure from finer to coarser spatio-temporal scales, can be stated as follows: If at some scale a point is a local maximum (minimum) for the mapping from to , then

  • at any spatio-temporal maximum

  • at any spatio-temporal minimum

should hold in any positive spatio-temporal direction defined from any non-negative linear combinations of and . This condition implies a strong condition on the class of possible smoothing kernels .

II-B0f Non-creation of new local extrema or zero-crossings for a purely temporal signal in the case of a non-causal temporal domain

In the case of a time-causal temporal domain, we do instead state a requirement for purely temporal signals, based on the cascade smoothing property (II-B0d). We require that for a purely temporal signal , the transformation from a finer temporal scale to a coarser temporal scale must not increase the number of local extrema or the number of zero-crossings in the signal.

Fig. 4: Spatial receptive fields formed by the 2-D Gaussian kernel with its partial derivatives up to order two. The corresponding family of receptive fields is closed under translations, rotations and scaling transformations, meaning that if the underlying image is subject to a set of such image transformations then it will always be possible to find some possibly other receptive field such that the receptive field responses of the original image and the transformed image can be matched.
Fig. 5: Spatial receptive fields formed by affine Gaussian kernels and directional derivatives of these, here using three different covariance matrices , and corresponding to the directions , and of the major eigendirection of the covariance matrix and with first- and second-order directional derivatives computed in the corresponding orthogonal directions , and . The corresponding family of receptive fields is closed under general affine transformations of the spatial domain, including translations, rotations, scaling transformations and perspective foreshortening (although this figure only illustrates variabilities in the orientation of the filter, thereby disregarding variations in both the size and the degree of elongation). This closedness property implies that receptive field responses computed from different views of a smooth local surface patch can be perfectly matched, if the transformation between the two views can be modelled as a local affine transformation.

Iii Idealized receptive field families

Iii-a Spatial image domain

Based on the above assumptions in Section II-A, it can be shown [13] that when complemented with certain regularity assumptions in terms of Sobolev norms, they imply that a spatial scale-space representation as determined by these must satisfy a diffusion equation of the form

(18)

for some positive semi-definite covariance matrix and some translation vector . In terms of convolution kernels, this corresponds to Gaussian kernels of the form

(19)

which for a given and a given satisfy (III-A). If we additionally require these kernels to be mirror symmetric through the origin, then we obtain affine Gaussian kernels

(20)

Their spatial derivatives constitute a canonical family for expressing receptive fields over a spatial domain that can be summarized on the form

(21)
Fig. 6: Distribution of affine Gaussian receptive fields corresponding to a uniform distribution on a hemisphere regarding zero-order smoothing kernels. In the most idealized version of the theory, one can think of all affine receptive fields with their directional derivatives in preferred directions aligned to the eigendirections of the covariance matrix as being present at any position in the image domain. When restricted to a limited number of receptive fields in an actual implementation, there is also an issue of distributing a fixed number of receptive fields over the spatial coordinates and the filter parameters and .

Incorporating the fact that spatial derivatives of these kernels are also compatible with the assumptions underlying this theory, this does specifically for the case of a two-dimensional spatial image domain lead to spatial receptive fields that can be compactly summarized on the form

(22)

where

  • denote the spatial coordinates,

  • denotes the spatial scale,

  • denotes a spatial covariance matrix determining the shape of a spatial affine Gaussian kernel,

  • and denote orders of spatial differentiation,

  • , denote spatial directional derivative operators in two orthogonal directions and aligned with the eigenvectors of the covariance matrix ,

  • is an affine Gaussian kernel with its size determined by the spatial scale parameter and its shape by the spatial covariance matrix .

Figure 5 and Figure 5 show examples of spatial receptive fields from this family up to second order of spatial differentiation. Figure 5 shows partial derivatives of the Gaussian kernel for the specific case when the covariance matrix is restricted to a unit matrix and the Gaussian kernel thereby becomes rotationally symmetric. The resulting family of receptive fields is closed under scaling transformations over the spatial domain, implying that if an object is seen from different distances to the observer, then it will always be possible to find a transformation of the scale parameter between the two image domains so that the receptive field responses computed from the two image domains can be matched. Figure 5 shows examples of affine Gaussian receptive fields for covariance matrices that do not correspond to rescaled copies of the unit matrix. The resulting full family of affine Gaussian derivative kernels is closed under general affine transformations, implying that for two different perspective views of a local smooth surface patch, it will always be possible to find a transformation of the covariance matrices between the two domains so that the receptive field responses can be matched, if the transformation between the two image domains is approximated by a local affine transformation.

In the most idealized version of the theory, one should think of receptive fields for all combinations of filter parameters as being present at every image point, as illustrated in Figure 6 concerning affine Gaussian receptive fields over different orientations in image space and different eccentricities.

Iii-B Spatio-temporal image domain

Over a non-causal spatio-temporal domain, corresponding arguments as in Section III-A lead to a similar form of diffusion equation as in Equation (III-A), while expressed over the joint space-time domain and with interpreted as a local drift velocity. After splitting the composed affine Gaussian spatio-temporal smoothing kernel corresponding to (III-A) while expressed over the joint space-time domain into separate smoothing operations over space and time, this leads to zero-order spatio-temporal receptive fields of the form [13, 19]:

(23)

After combining that result with the results from corresponding theoretical analysis for a time-causal spatio-temporal domain in [13, 21], the resulting spatio-temporal derivative kernels constituting the spatio-temporal extension of the spatial receptive field model (III-A) can be reparametrised and summarized on the following form (see [13, 19, 20, 21]):

(24)

where

  • denote the spatial coordinates,

  • denotes time,

  • denotes the spatial scale,

  • denotes the temporal scale,

  • denotes a local image velocity,

  • denotes a spatial covariance matrix determining the shape of a spatial affine Gaussian kernel,

  • and denote orders of spatial differentiation,

  • denotes the order of temporal differentiation,

  • and denote spatial directional derivative operators in two orthogonal directions and aligned with the eigenvectors of the covariance matrix ,

  • is a velocity-adapted temporal derivative operator aligned to the direction of the local image velocity ,

  • is an affine Gaussian kernel with its size determined by the spatial scale parameter and its shape determined by the spatial covariance matrix ,

  • denotes a spatial affine Gaussian kernel that moves with image velocity in space-time and

  • is a temporal smoothing kernel over time corresponding to a Gaussian kernel in the case of non-causal time or a cascade of first-order integrators or equivalently truncated exponential kernels coupled in cascade according to (26) over a time-causal temporal domain.

This family of spatio-temporal scale-space kernels can be seen as a canonical family of linear receptive fields over a spatio-temporal domain.

For the case of a time-causal temporal domain, the result states that truncated exponential kernels of the form

(25)

coupled in cascade constitute the natural temporal smoothing kernels. These do in turn lead to a composed temporal convolution kernel of the form

(26)

and corresponding to a set of first-order integrators coupled in cascade (see Figure 7).

Fig. 7: Electric wiring diagram consisting of a set of resistors and capacitors that emulate a series of first-order integrators coupled in cascade, if we regard the time-varying voltage as representing the time varying input signal and the resulting output voltage as representing the time varying output signal at a coarser temporal scale. According to the theory of temporal scale-space kernels for one-dimensional signals (Lindeberg [38, 21]; Lindeberg and Fagerström [39]), the corresponding equivalent truncated exponential kernels are the only primitive temporal smoothing kernels that guarantee both temporal causality and non-creation of local extrema (alternatively zero-crossings) with increasing temporal scale.
Fig. 8: Space-time separable kernels up to order two obtained as the composition of Gaussian kernels over the spatial domain and a cascade of truncated exponential kernels over the temporal domain with a logarithmic distribution of the intermediate temporal scale levels that approximates the time-causal limit kernel (, , , ). The corresponding family of spatio-temporal receptive fields is closed under spatial scaling transformations as well as under temporal scaling transformations for temporal scaling factors that are integer powers of the distribution parameter of the temporal smoothing kernel. (Horizontal axis: space . Vertical axis: time .)
Fig. 9: Velocity-adapted spatio-temporal kernels up to order two obtained as the composition of Gaussian kernels over the spatial domain and a cascade of truncated exponential kernels over the temporal domain with a logarithmic distribution of the intermediate temporal scale levels that approximates the time-causal limit kernel (, , , , ). In addition to spatial and temporal scaling transformations, the corresponding family of receptive fields is also closed under Galilean transformations. (Horizontal axis: space . Vertical axis: time .)

Two natural ways of distributing the discrete time constants over temporal scales are studied in detail in [21, 37] corresponding to either a uniform or a logarithmic distribution in terms of the composed variance

(27)

Specifically, it is shown in [21] that in the case of a logarithmic distribution of the discrete temporal scale levels, it is possible to consider an infinite number of temporal scale levels that cluster infinitely dense near zero temporal scale

(28)

so that a scale-invariant time-causal limit kernel can be defined obeying self-similarity and scale covariance over temporal scales and with a Fourier transform of the form

(29)

Figure 9 and Figure 9 show spatio-temporal kernels over a 1+1-dimensional spatio-temporal domain using approximations of the time-causal limit kernel for temporal smoothing over the temporal domain and the Gaussian kernel for spatial smoothing over the spatial domain. Figure 9 shows space-time separable receptive fields corresponding to image velocity , whereas Figure 9 shows unseparable velocity-adapted receptive fields corresponding to a non-zero image velocity .

The family of space-time separable receptive fields for zero image velocities is closed under spatial scaling transformations for arbitrary spatial scaling factors as well as for temporal scaling transformations with temporal scaling factors that are integer powers of the distribution parameter of the time-causal limit kernel. The full family of velocity-adapted receptive fields for general non-zero image velocities is additionally closed under Galilean transformations, corresponding to variations in the relative motion between the objects in the world and the observer. Given that the full families of receptive fields are explicitly represented in the vision system, this means that it will be possible to perfectly match receptive field responses computed under the following types of natural image transformations: (i) objects of different size in the image domain as arising from e.g. viewing the same object from different distances, (ii) spatio-temporal events that occur with different speed, faster or slower, and (iii) objects and spatio-temporal that are viewed with different relative motions between the objects/event and the visual observer.

If additionally the spatial smoothing is performed over the full family of spatial covariance matrices , then receptive field responses can also be matched (iv) between different views of the same smooth local surface patch.

Iii-C Scale normalisation of spatial and spatio-temporal receptive fields

When computing receptive field responses over multiple spatial and temporal scales, there is an issue about how the receptive field responses should be normalized so as to enable appropriate comparisons between receptive field responses at different scales. Issues of scale normalisation of the derivative based receptive fields defined from scale-space operations are treated in [40, 41, 42] regarding spatial receptive fields and in [21, 37, 43] regarding spatio-temporal receptive fields.

III-C0a Scale-normalized spatial receptive fields

Let and denote the eigenvalues of the composed affine covariance matrix in the spatial receptive field model (III-A) and let and denote directional derivative operators along the corresponding eigendirections. Then, the scale-normalized spatial derivative kernel corresponding to the receptive field model (III-A) is given by

(30)

where denotes the spatial scale normalization parameter of -normalized derivatives and specifically the choice leads to maximum scale invariance in the sense that the magnitude response of the spatial receptive field will be covariant under uniform spatial scaling transformations , provided that the spatial scale levels are appropriately matched .

III-C0b Scale-normalized spatial receptive fields in the case of a non-causal spatio-temporal domain

For the case of a non-causal spatio-temporal domain, where the temporal smoothing operation in the spatio-temporal receptive field model is performed by a non-causal Gaussian temporal kernel , the scale-normalized spatio-temporal derivative kernel corresponding to the spatio-temporal receptive field model (III-B) is with corresponding notation regarding the spatial domain as in (30) given by

(31)

where and denote the spatial and temporal scale normalization parameters of -normalized derivatives and specifically the choice and leads to maximum scale invariance in the sense that the magnitude response of the spatio-temporal receptive field will be invariant under independent scaling transformations of the spatial and the temporal domains , provided that both the spatial and temporal scale levels are appropriately matched .

III-C0c Scale-normalized spatial receptive fields in the case of a time-causal spatio-temporal domain

For the case of a time-causal spatio-temporal domain, where the temporal smoothing operation in the spatio-temporal receptive field model is performed by truncated exponential kernels coupled in cascade (26), the corresponding scale-normalized spatio-temporal derivative kernel corresponding to the spatio-temporal receptive field model (III-B) is given by

(32)

where and denote the spatial and and temporal scale normalization parameters of -normalized derivatives and is the temporal scale normalization factor, which for the case of variance-based normalization is given by

(33)

in agreement with (III-C0b) while for the case of -normalization it is given by [21, Equation (76)]

(34)

with denoting the -norm of the th order scale-normalized derivative of a non-causal Gaussian temporal kernel with scale normalization parameter . In the specific case when the temporal smoothing is performed using the scale-invariant limit kernel (29), the magnitude response will for the maximally scale invariant choice of scale normalization parameters and be invariant under independent scaling transformations of the spatial and the temporal domains for general spatial scaling factors and for temporal scaling factors that are integer powers of the distribution parameter of the scale-invariant limit kernel, provided that both the spatial and temporal scale levels are appropriately matched .

Iii-D Invariance to local multiplicative illumination variations or variations in exposure parameters

The treatment so far has been concerned with modelling receptive fields under natural geometric image transformations, modelled as local scaling transformations, local affine transformations and local Galilean transformations representing the essential dimensions in the variability of a local linearization of the perspective mapping from a local surface patch in the world to the tangent plane of the retina. A complementary issue concerns how to model receptive field responses under variations in the external illumination and under variations in the internal exposure mechanisms of the eye that adapt the diameter of the pupil and the sensitivity of the photoreceptors to the external illumination. In this section, we will present a solution for this problem regarding the subset of intensity transformations that can be modelled as local multiplicative intensity transformations.

To obtain theoretically well-founded handling of image data under illumination variations, it is natural to represent the image data on a logarithmic luminosity scale

(35)

Specifically, receptive field responses that are computed from such a logarithmic parameterization of the image luminosities can be interpreted physically as a superposition of relative variations of surface structure and illumination variations. Let us assume: (i) a perspective camera model extended with (ii) a thin circular lens for gathering incoming light from different directions and (iii) a Lambertian illumination model extended with (iv) a spatially varying albedo factor for modelling the light that is reflected from surface patterns in the world. Then, it can be shown [19, Section 2.3] that a spatio-temporal receptive field response

(36)

of the image data , where represents the spatio-temporal smoothing operator (here corresponding to a spatio-temporal smoothing kernel of the form (III-B)) can be expressed as

(37)

where

  • is a spatially dependent albedo factor that reflects properties of surfaces of objects in the environment with the implicit understanding that this entity may in general refer to points on different surfaces in the world depending on the viewing direction and thus the (possibly time-dependent) image position ,

  • denotes a spatially dependent illumination field with the implicit understanding that the amount of incoming light on different surfaces may be different for different points in the world as mapped to corresponding image coordinates over time ,

  • represents the possibly time-dependent internal camera parameters with the ratio referred to as the effective -number, where denotes the diameter of the lens and the focal distance, and

  • represents a geometric natural vignetting effect corresponding to the factor for a planar image plane, with denoting the angle between the viewing direction and the surface normal of the image plane. This vignetting term disappears for a spherical camera model.

From the structure of Equation (37) we can note that for any non-zero order of spatial differentiation with at least either or , the influence of the internal camera parameters in will disappear because of the spatial differentiation with respect to or , and so will the effects of any other multiplicative exposure control mechanism. Furthermore, for any multiplicative illumination variation , where is a scalar constant, the logarithmic luminosity will be transformed as , which implies that the dependency on will disappear after spatial or temporal differentiation.

Thus, given that the image measurements are performed on a logarithmic brightness scale, the spatio-temporal receptive field responses will be automatically invariant under local multiplicative illumination variations as well as under local multiplicative variations in the exposure parameters of the retina and the eye.

Iv Computational modelling of biological receptive fields

In two comprehensive reviews, DeAngelis et al. [4, 5] present overviews of spatial and temporal response properties of (classical) receptive fields in the central visual pathways. Specifically, the authors point out the limitations of defining receptive fields in the spatial domain only and emphasize the need to characterize receptive fields in the joint space-time domain, to describe how a neuron processes the visual image. Conway and Livingstone [22] and Johnson et al. [23] show results of corresponding investigations concerning spatio-chromatic receptive fields.

In the following, we will describe how the above derived idealized functional models of linear receptive fields can be used for modelling the spatial, spatio-chromatic and spatio-temporal response properties of biological receptive fields. Indeed, it will be shown that the derived idealized functional models lead to predictions of receptive field profiles that are qualitatively very similar to all the receptive field types presented in (DeAngelis et al. [4, 5]) and schematic simplifications of most of the receptive fields shown in (Conway and Livingstone [22]) and (Johnson et al. [23]).

  
  
Fig. 10: Computational modelling of space-time separable receptive field profiles in the lateral geniculate nucleus (LGN) as reported by DeAngelis et al. [4] using idealized spatio-temporal receptive fields of the form according to Equation (III-B) with the temporal smoothing function modelled as a cascade of first-order integrators/truncated exponential kernels of the form (26): (left) a “non-lagged cell” modelled using first-order temporal derivatives, (right) a “lagged cell” modelled using second-order temporal derivatives. Parameter values with and : (a) :  degrees,  ms. (b) :  degrees,  ms. (Horizontal dimension: space . Vertical dimension: time .)

Fig. 11: Computational modelling the spatial component of receptive fields in the LGN using the Laplacian of the Gaussian: (left) Receptive fields in the LGN have approximately circular center-surround responses in the spatial domain, as reported by DeAngelis et al. [4]. (right) In terms of Gaussian derivatives, this spatial response profile can be modelled by the Laplacian of the Gaussian , here with in units of degrees of visual angle.
Fig. 12: Spatio-chromatic receptive field response of a double-opponent neuron as reported by Conway and Livingstone [22, Figure 2, Page 10831], with the colour channels L, M and S essentially corresponding to red, green and blue, respectively. (From these L, M and S colour channels, corresponding red/green and yellow/blue colour-opponent channels can be formed from the differences between L to M and between L+M to S.)

Fig. 13: Idealized spatio-chromatic receptive fields over the spatial domain corresponding to the application of the Laplacian operator to positive and negative red/green and yellow/blue colour opponent channels, respectively. These receptive fields can be seen as idealized models of the spatial component of double-opponent spatio-chromatic receptive fields in the LGN.

Iv-a Spatial and spatio-temporal receptive fields in the LGN

Regarding visual receptive fields in the lateral geniculate nucleus (LGN), DeAngelis et al. [4, 5] report that most neurons (i) have approximately circular center-surround organization in the spatial domain and that (ii) most of the receptive fields are separable in space-time. There are two main classes of temporal responses for such cells: (i) a “non-lagged cell” is defined as a cell for which the first temporal lobe is the largest one (Figure 11(left)), whereas (ii) a “lagged cell” is defined as a cell for which the second lobe dominates (Figure 11(right)).

When using a time-causal temporal smoothing kernel, the first peak of a first-order temporal derivative will be strongest, whereas the second peak of a second-order temporal derivative will be strongest (see [21, Figure 2]). Thus, according to this theory, non-lagged LGN cells can be seen as corresponding to first-order time-causal temporal derivatives, whereas lagged LGN cells can be seen as corresponding to second-order time-causal temporal derivatives.

The spatial response, on the other hand, shows a high similarity to a Laplacian of a Gaussian, leading to an idealized receptive field model of the form [19, Equation (108)]

(38)

Figure 11 shows a comparison between the spatial component of a receptive field in the LGN with a Laplacian of the Gaussian. This model can also be used for modelling spatial on-center/off-surround and off-center/on-surround receptive fields in the retina. Figure 11 shows results of modelling space-time separable receptive fields in the LGN in this way, using a cascade of truncated exponential kernels of the form (26) for temporal smoothing over the temporal domain.

Regarding the spatial domain, the model in terms of spatial Laplacians of Gaussians is closely related to differences of Gaussians, which have previously been shown to constitute a good approximation of the spatial variation of receptive fields in the retina and the LGN (Rodieck [29]). This property follows from the fact that the rotationally symmetric Gaussian satisfies the isotropic diffusion equation

(39)

which implies that differences of Gaussians can be interpreted as approximations of derivatives over scale and hence to Laplacian responses. Conceptually, this implies very good agreement with the spatial component of the LGN model (38) in terms of Laplacians of Gaussians. More recently, Bonin et al. [44] have found that LGN responses in cats are well described by difference-of-Gaussians and temporal smoothing complemented by a non-linear contrast gain control mechanism (not modelled here).

Fig. 14: Computational modelling of a receptive field profile over the spatial domain in the primary visual cortex (V1) as reported by DeAngelis et al. [4, 5] using affine Gaussian derivatives: (middle) Receptive field profile of a simple cell over image intensities as reconstructed from cell recordings, with positive weights represented as green and negative weights by red. (left) Stylized simplification of the receptive field shape. (right) Idealized model of the receptive field from a first-order directional derivative of an affine Gaussian kernel