# Provably scale-covariant networks from oriented quasi quadrature measures in cascade
^{†}^{†}thanks: In Proc. SSVM 2019: Scale Space and Variational Methods
in Computer Vision, Springer LNCS volume 11603, pages 328–340. The support from the Swedish Research Council
(contract 2018-03586) is gratefully acknowledged.

###### Abstract

This article presents a continuous model for hierarchical networks based on a combination of mathematically derived models of receptive fields and biologically inspired computations. Based on a functional model of complex cells in terms of an oriented quasi quadrature combination of first- and second-order directional Gaussian derivatives, we couple such primitive computations in cascade over combinatorial expansions over image orientations. Scale-space properties of the computational primitives are analysed and it is shown that the resulting representation allows for provable scale and rotation covariance. A prototype application to texture analysis is developed and it is demonstrated that a simplified mean-reduced representation of the resulting QuasiQuadNet leads to promising experimental results on three texture datasets.

## 1 Introduction

The recent progress with deep learning architectures has demonstrated that hierarchical feature representations over multiple layers have much higher potential compared to approaches based on single layers of receptive fields. A limitation of current deep nets, however, is that they are not truly scale covariant. A deep network constructed by repeated application of compact or kernels, such as AlexNet [1], VGG-Net [2] or ResNet [3], implies an implicit assumption of a preferred size in the image domain as induced by the discretization in terms of local or kernels of a fixed size. Thereby, due to the non-linearities in the deep net, the output from the network may be qualitatively different depending on the specific size of the object in the image domain, as varying because of e.g. different distances between the object and the observer. To handle this lack of scale covariance, approaches have been developed such as spatial transformer networks [4], using sets of subnetworks in a multi-scale fashion [5] or by combining deep nets with image pyramids [6]. Since the size normalization performed by a spatial transformer network is not guaranteed to be truly scale covariant, and since traditional image pyramids imply a loss of image information that can be interpreted as corresponding to undersampling, it is of interest to develop continuous approaches for deep networks that guarantee true scale covariance or better approximations thereof.

The subject of this article is to develop a continuous model for capturing non-linear hierarchical relations between features over multiple scales in such a way that the resulting feature representation is provably scale covariant. Building upon axiomatic modelling of visual receptive fields in terms of Gaussian derivatives and affine extensions thereof, which can serve as idealized models of simple cells in the primary visual cortex [7, 8, 9], we will propose a functional model for complex cells in terms of an oriented quasi quadrature measure. Then, we will combine such oriented quasi quadrature measures in cascade, building upon the early idea of Fukushima [10] of using Hubel and Wiesel’s findings regarding receptive fields in the primary visual cortex [11] to build a hierarchical neural network from repeated application of models of simple and complex cells.

We will show how the scale-space properties of the quasi quadrature primitive in this representation can be theoretically analyzed and how the resulting hand-crafted network becomes provably scale and rotation covariant, in such a way that the multi-scale and multi-orientation network commutes with scaling transformations and rotations over the spatial image domain. Experimentally, we will investigate a prototype application to texture classification based on a substantially mean-reduced representation of the resulting QuasiQuadNet.

## 2 The quasi quadrature measure over a 1-D signal

Consider the scale-space representation of a 1-D signal defined by convolution with Gaussian kernels and with scale-normalized derivatives according to [12].

### Quasi quadrature in 1-D.

Motivated by the fact that the first-order derivatives primarily respond to the locally odd component of the signal, whereas the second-order derivatives primarily respond to the locally even component of a signal, it is natural to aim at a differential feature detector that combines locally odd and even components in a complementary manner. By specifically combining the first- and second-order scale-normalized derivative responses in a Euclidean way, we obtain a quasi quadrature measure of the form

(1) |

as a modification of the quasi quadrature measures previously proposed and studied in [12, 13], with the scale normalization parameters and of the first- and second-order derivatives coupled according to and to enable scale covariance by adding derivative expressions of different orders only for the scale-invariant choice of . This differential entity can be seen as an approximation of the notion of a quadrature pair of an odd and even filter as more traditionally formulated based on a Hilbert transform, while confined within the family of differential expressions based on Gaussian derivatives.

Figure 1 shows the result of computing this quasi quadrature measure for a Gaussian peak as well as its first- and second-order derivatives. As can be seen, the quasi quadrature measure is much less sensitive to the position of the peak compared to e.g. the first- or second-order derivatives. Additionally, the quasi quadrature measure also has some degree of spatial insensitivity for a first-order derivative (a local edge model) and a second-order derivative.

### Determination of .

To determine the weighting parameter between local second-order and first-order information, let us consider a Gaussian blob with spatial extent given by as input model signal. By using the semi-group property of the Gaussian kernel , the quasi quadrature measure can be computed in closed form

(2) |

By determining the weighting parameter such that it minimizes the overall ripple in the squared quasi quadrature measure for a Gaussian input

(3) |

we obtain

(4) |

which in the special case of choosing corresponds to . This value is very close to the value derived from an equal contribution condition in [13, Eq. (27)] for the special case of choosing .

### Scale selection properties.

To analyze the scale selection properties of the quasi quadrature measure, let us consider the result of using Gaussian derivatives of orders 0, 1 and 2 as input signals, i.e., for .

For the zero-order Gaussian kernel, the scale-normalized quasi quadrature measure at the origin is given by

(5) |

For the first-order Gaussian derivative kernel, the scale-normalized quasi quadrature measure at the origin is

(6) |

whereas for the second-order Gaussian derivative kernel, the scale-normalized quasi quadrature measure at the origin is

(7) |

By differentiating these expressions with respect to scale, we find that for a zero-order Gaussian kernel the maximum response over scale is assumed at

(8) |

whereas for first- and second-order derivatives, respectively, the maximum response over scale is assumed at

(9) |

In the special case of choosing , these scale estimates correspond to

(10a-c) |

Thus, for a Gaussian input signal, the selected scale level will for the most scale-invariant choice of using reflect the spatial extent of the blob, whereas if we would like the scale estimate to reflect the scale parameter of first- and second-order derivatives, we would have to choose . An alternative motivation for using finer scale levels for the Gaussian derivative kernels is to regard the positive and negative lobes of the Gaussian derivative kernels as substructures of a more complex signal, which would then warrant the use of finer scale levels to reflect the substructures of the signal ((10b) and (10c)).

## 3 Oriented quasi quadrature modelling of complex cells

In this section, we will consider an extension of the 1-D quasi quadrature measure (1) into an oriented quasi quadrature measure of the form

(11) |

where and denote directional derivatives of an affine Gaussian scale-space representation [14, ch. 15] of the form and , and with denoting the variance of the affine Gaussian kernel (with )

(12) |

in direction , preferably with the orientation aligned with the direction of either of the eigenvectors of the composed spatial covariance matrix , with

(13) |

normalized such that the main eigenvalue is equal to one.

### Affine Gaussian derivative model for linear receptive fields.

According to the normative theory for visual receptive fields in Lindeberg [8, 9], directional derivatives of affine Gaussian kernels constitute a canonical model for visual receptive fields over a 2-D spatial domain. Specifically, it was proposed that simple cells in the primary visual cortex (V1) can be modelled by directional derivatives of affine Gaussian kernels, termed affine Gaussian derivatives, of the form

(14) |

Figure 2 shows an example of the spatial dependency of a colour-opponent simple cell that can be well modelled by a first-order affine Gaussian derivative over an R-G colour-opponent channel over image intensities. Corresponding modelling results for non-chromatic receptive fields can be found in [8, 9].

### Affine quasi quadrature modelling of complex cells.

Figure 3 shows functional properties of a complex cell as determined from its response properties to natural images, using a spike-triggered covariance method (STC), which computes the eigenvalues and the eigenvectors of a second-order Wiener kernel (Touryan et al. [16]). As can be seen from this figure, the shapes of the eigenvectors determined from the non-linear Wiener kernel model of the complex cell do qualitatively agree very well with the shapes of corresponding affine Gaussian derivative kernels of orders 1 and 2. Motivated by this property and theoretical and experimental motivations for modelling receptive field profiles of simple cells by affine Gaussian derivatives, we propose to model complex cells by a possibly post-smoothed (spatially pooled) oriented quasi quadrature measure of the form (11)

(15) |

where represents an affine covariance matrix in direction for computing directional derivatives and represents an affine covariance matrix in the same direction for integrating pointwise affine quasi quadrature measures over a region in image space.

The pointwise affine quasi quadrature measure can be seen as a Gaussian derivative based analogue of the energy model for complex cells as proposed by Adelson and Bergen [17] and Heeger [18]. It is closely related to a proposal by Koenderink and van Doorn [19] of summing up the squares of first- and second-order derivative responses and nicely compatible with results by De Valois et al. [20], who showed that first- and second-order receptive fields typically occur in pairs that can be modelled as approximate Hilbert pairs.

The addition of a complementary post-smoothing stage as determined by the affine Gaussian weighting function is closely related to recent results by Westö and May [21], who have shown that complex cells are better modelled as a combination of two spatial integration steps.

By choosing these spatial smoothing and weighting functions as affine Gaussian kernels, we ensure an affine covariant model of the complex cells, to enable the computation of affine invariants at higher levels in the visual hierarchy.

The use of multiple affine receptive fields over different shapes of the affine covariance matrices and can be motivated by results by Goris et al. [22], who show that there is a large variability in the orientation selectivity of simple and complex cells. With respect to this model, this means that we can think of affine covariance matrices of different eccentricity as being present from isotropic to highly eccentric. By considering the full family of positive definite affine covariance matrices, we obtain a fully affine covariant image representation able to handle local linearizations of the perspective mapping for all possible views of any smooth local surface patch.

## 4 Hierarchies of oriented quasi quadrature measures

Let us in this first study disregard the variability due to different shapes of the affine receptive fields for different eccentricities and assume that . This restriction enables covariance to scaling transformations and rotations, whereas a full treatment of affine quasi quadrature measures over all positive definite covariance matrices would have the potential to enable full affine covariance.

An approach that we shall pursue is to build feature hierarchies by coupling oriented quasi quadrature measures (11) or (15) in cascade

(16) | |||

(17) |

where we have suppressed the notation for the scale levels assumed to be distributed such that the scale parameter at level is for some , e.g., . Assuming that the initial scale-space representation is computed at scale , such a network can in turn be initiated for different values of , also distributed according to a geometric distribution.

This construction builds upon an early proposal by Fukushima [10] of building a hierarchical neural network from repeated application of models of simple and complex cells [11], which has later been explored in a hand-crafted network based on Gabor functions by Serre et al. [23] and in the scattering convolution networks by Bruno and Mallat [24]. This idea is also consistent with a proposal by Yamins and DiCarlo [25] of using repeated application of a single hierarchical convolution layer for explaining the computations in the mammalian cortex. With this construction, we obtain a way to define continuous networks that express a corresponding hierarchical architecture based on Gaussian derivative based models of simple and complex cells within the scale-space framework.

Each new layer in this model implies an expansion of combinations of angles over the different layers in the hierarchy. For example, if we in a discrete implementation discretize the angles into discrete spatial orientations, we will then obtain different features at level in the hierarchy. To keep the complexity down at higher levels, we will for in a corresponding way as done by Hadji and Wildes [26] introduce a pooling stage over orientations

(18) |

and instead define the next successive layer as

(19) |

to limit the number of features at any level to maximally . The proposed hierarchical feature representation is termed QuasiQuadNet.

### Scale covariance.

A theoretically attractive property of this family of networks is that the networks are provably scale covariant. Given two images and that are related by a uniform scaling transformation for some , their corresponding scale-space representations and will be equal and so will the scale-normalized derivatives based on if the spatial positions are related according to and the scale levels according to [12, Eqns. (16) and (20)]. This implies that if the initial scale levels and underlying the construction in (16) and (17) are related according to , then the first layers of the feature hierarchy will be related according to [13, Eqns. (55) and (63)]. Higher layers in the feature hierarchy are in turn related according to

(20) |

and are specifically equal if . This means that it will be possible to perfectly match such hierarchical representations under uniform scaling transformations.

### Rotation covariance.

Under a rotation of image space by an angle , for , the corresponding feature hierarchies are in turn equal if the orientation angles are related according to ()

(21) |

## 5 Application to texture analysis

In the following, we will use a substantially reduced version of the proposed quasi quadrature network for building an application to texture analysis.

If we make the assumption that a spatial texture should obey certain stationarity properties over image space, we may regard it as reasonable to construct texture descriptors by accumulating statistics of feature responses over the image domain, in terms of e.g mean values or histograms. Inspired by the way the SURF descriptor [27] accumulates mean values and mean absolute values of derivative responses and the way Bruno and Mallat [24] and Hadji and Wildes [26] compute mean values of their hierarchical feature representations, we will initially explore reducing the QuasiQuadNet to just the mean values over the image domain of the following 5 features

(22) |

These types of features are computed for all layers in the feature hierarchy (with ), which leads to a 4000-D descriptor based on uniformly distributed orientations in , 4 layers in the hierarchy delimited in complexity by directional pooling for with 4 initial scale levels .

KTH-TIPS2b | CUReT | UMD | |
---|---|---|---|

FV-VGGVD [28] (SVM) | 88.2 | 99.0 | 99.9 |

FV-VGGM [28] (SVM) | 79.9 | 98.7 | 99.9 |

MRELBP [29] (SVM) | 77.9 | 99.0 | 99.4 |

FV-AlexNet [28] (SVM) | 77.9 | 98.4 | 99.7 |

mean-reduced QuasiQuadNet LUV (SVM) | 78.3 | 98.6 | |

mean-reduced QuasiQuadNet grey (SVM) | 75.3 | 98.3 | 97.1 |

ScatNet [24] (PCA) | 68.9 | 99.7 | 98.4 |

MRELBP [29] | 69.0 | 97.1 | 98.7 |

BRINT [30] | 66.7 | 97.0 | 97.4 |

MDLBP [31] | 66.5 | 96.9 | 97.3 |

mean-reduced QuasiQuadNet LUV (NNC) | 72.1 | 94.9 | |

mean-reduced QuasiQuadNet grey (NNC) | 70.2 | 93.0 | 93.3 |

LBP [32] | 62.7 | 97.0 | 96.2 |

ScatNet [24] (NNC) | 63.7 | 95.5 | 93.4 |

PCANet [33] (NNC) | 59.4 | 92.0 | 90.5 |

RandNet [33] (NNC) | 56.9 | 90.9 | 90.9 |

The second column in Table 1 shows the result of applying this approach to the KTH-TIPS2b dataset [35] for texture classification, consisting of 11 classes (“aluminum foil”, “cork”, “wool”, “lettuce leaf”, “corduroy”, “linen”, “cotton”, “brown bread”, “white bread”, “wood” and “cracker”) with 4 physical samples from each class and photos of each sample taken from 9 distances leading to 9 relative scales labelled “2”, …, “10” over a factor of 4 in scaling transformations and additionally 12 different pose and illumination conditions for each scale, leading to a total number of images. The regular benchmark setup implies that the images from 3 samples in each class are used for training and the remaining sample in each class is used for testing over 4 permutations. Since several of the samples from the same class are quite different from each other in appearance, this implies a non-trivial benchmark which has not yet been saturated.

When using nearest-neighbour classification on the mean-reduced grey-level descriptor, we get 70.2 % accuracy, and 72.1 % accuracy when computing corresponding features from the LUV channels of a colour-opponent representation. When using SVM classification, the accuracy becomes 75.3 % and 78.3 %, respectively. Comparing with the results of an extensive set of other methods in Liu et al. [34], out of which a selection of the better results are listed in Table 1, the results of the mean-reduced QuasiQuadNet are better than classical texture classification methods such as locally binary patterns (LBP) [32], binary rotation invariant noise tolerant texture descriptors [30] and multi-dimensional local binary patterns (MDLBP) [31] and also better than other handcrafted networks, such as ScatNet [24], PCANet [33] and RandNet [33]. The performance of the mean-reduced QuasiQuadNet descriptor does, however, not reach the performance of applying SVM classification to Fischer vectors of the filter output in learned convolutional networks (FV-VGGVD, FV-VGGM [28]).

By instead performing the training on every second scale in the dataset (scales 2, 4, 6, 8, 10) and the testing on the other scales (3, 5, 7, 9), such that the benchmark does not primarily test the generalization properties between the different very few samples in each class, the classification performance is 98.8 % for the grey-level descriptor and 99.6 % for the LUV descriptor.

The third and fourth columns in Table 1 show corresponding results of texture classification on the CUReT [36] and UMD [37] texture datasets, with random equally sized partitionings of the images into training and testing data. Also for these datasets, the performance of the mean-reduced descriptor is reasonable compared to other methods.

## 6 Summary and discussion

We have presented a theory for defining hand-crafted hierarchical networks by applying quasi quadrature responses of first- and second-order directional Gaussian derivatives in cascade. The purpose behind this study has been to investigate if we could start building a bridge between the well-founded theory of scale-space representation and the recent empirical developments in deep learning, while at the same time being inspired by biological vision. The present work is intended as an initial work in this direction, where we propose the family of quasi quadrature networks as a new baseline for hand-crafted networks with associated provable covariance properties under scaling and rotation transformations.

By early experiments with a substantially mean-reduced representation of the resulting QuasiQuadNet, we have demonstrated that it is possible to get quite promising performance on texture classification, and comparable or better than other hand-crafted networks, although not reaching the performance of learned CNNs. By inspection of the full non-reduced feature maps, which could not be shown here because of the space limitations, we have also observed that some representations in higher layers may respond to irregularities in regular textures (defect detection) or corners or end-stoppings in regular scenes.

Concerning extensions of the approach, we propose to: (i) complement the computation of quasi quadrature responses by divisive normalization [38] to enforce a competition between multiple feature responses, (ii) explore the spatial relationships in the full feature maps that are suppressed in the mean-reduced representation and (iii) incorporate learning mechanisms.

## References

- [1] Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS. (2012) 1097–1105
- [2] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR. (2015) arXiv:1409.1556.
- [3] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proc. Computer Vision and Pattern Recognition (CVPR 2016). (2016) 770–778
- [4] Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. In: NIPS. (2015) 2017–2025
- [5] Cai, Z., Fan, Q., Feris, R.S., Vasconcelos, N.: A unified multi-scale deep convolutional neural network for fast object detection. In: ECCV. Volume 9908 of Springer LNCS. (2016) 354–370
- [6] Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR. (2017)
- [7] Koenderink, J.J., van Doorn, A.J.: Generic neighborhood operators. IEEE-TPAMI 14 (1992) 597–605
- [8] Lindeberg, T.: Generalized Gaussian scale-space axiomatics comprising linear scale-space, affine scale-space and spatio-temporal scale-space. J. Math. Im. Vis. 40 (2011) 36–81
- [9] Lindeberg, T.: A computational theory of visual receptive fields. Biol. Cyb. 107 (2013) 589–635
- [10] Fukushima, K.: Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cyb. 36 (1980) 193–202
- [11] Hubel, D.H., Wiesel, T.N.: Brain and Visual Perception. Oxford Univ. Pr. (2005)
- [12] Lindeberg, T.: Feature detection with automatic scale selection. Int. J. Comp. Vis. 30 (1998) 77–116
- [13] Lindeberg, T.: Dense scale selection over space, time and space-time. SIAM Journal on Imaging Sciences 11 (2018) 407–441
- [14] Lindeberg, T.: Scale-Space Theory in Computer Vision. Springer (1993)
- [15] Johnson, E.N., Hawken, M.J., Shapley, R.: The orientation selectivity of color-responsive neurons in Macaque V1. J. Neurosci. 28 (2008) 8096–8106
- [16] Touryan, J., Felsen, G., Dan, Y.: Spatial structure of complex cell receptive fields measured with natural images. Neuron 45 (2005) 781–791
- [17] Adelson, E., Bergen, J.: Spatiotemporal energy models for the perception of motion. JOSA A 2 (1985) 284–299
- [18] Heeger, D.J.: Normalization of cell responses in cat striate cortex. Visual Neuroscience 9 (1992) 181–197
- [19] Koenderink, J.J., van Doorn, A.J.: Receptive field families. Biol. Cyb. 63 (1990) 291–298
- [20] Valois, R.L.D., Cottaris, N.P., Mahon, L.E., Elfer, S.D., Wilson, J.A.: Spatial and temporal receptive fields of geniculate and cortical cells and directional selectivity. Vis. Res. 40 (2000) 3685–3702
- [21] Westö, J., May, P.J.C.: Describing complex cells in primary visual cortex: A comparison of context and multi-filter LN models. J. Neurophys. 120 (2018) 703–719
- [22] Goris, R.L.T., Simoncelli, E.P., Movshon, J.A.: Origin and function of tuning diversity in Macaque visual cortex. Neuron 88 (2015) 819–831
- [23] Serre, T., Wolf, L., Bileschi, S., Riesenhuber, M., Poggio, T.: Robust object recognition with cortex-like mechanisms. IEEE-TPAMI 29 (2007) 411–426
- [24] Bruna, J., Mallat, S.: Invariant scattering convolution networks. IEEE-TPAMI 35 (2013) 1872–1886
- [25] Yamins, D.L.K., DiCarlo, J.J.: Using goal-driven deep learning models to understand sensory cortex. Nature Neuroscience 19 (2016) 356–365
- [26] Hadji, I., Wildes, R.P.: A spatiotemporal oriented energy network for dynamic texture recognition. In: ICCV. (2017) 3066–3074
- [27] Bay, H., Ess, A., Tuytelaars, T., van Gool, L.: Speeded up robust features (SURF). CVIU 110 (2008) 346–359
- [28] Cimpoi, M., Maji, S., Vedaldi, A.: Deep filter banks for texture recognition and segmentation. In: CVPR. (2015) 3828–3836
- [29] Liu, L., Lao, S., Fieguth, P.W., Guo, Y., Wang, X., Pietikäinen, M.: Median robust extended local binary pattern for texture classification. IEEE-TIP 25 (2016) 1368–1381
- [30] Liu, L., Long, Y., Fieguth, P.W., Lao, S., Zhao, G.: BRINT: Binary rotation invariant and noise tolerant texture classification. IEEE-TIP 23 (2014) 3071–3084
- [31] Schaefer, G., Doshi, N.P.: Multi-dimensional local binary pattern descriptors for improved texture analysis. In: ICPR. (2012) 2500–2503
- [32] Ojala, T., Pietikäinen, M., Maenpaa, T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE-TPAMI 24 (2002) 971–987
- [33] Chan, T.H., Jia, K., Gao, S., Lu, J., Zeng, Z., Ma, Y.: PCANet: A simple deep learning baseline for image classification? IEEE-TIP 24 (2015) 5017–5032
- [34] Liu, L., Fieguth, P., Guo, Y., Wang, Z., Pietikäinen, M.: Local binary features for texture classification: Taxonomy and experimental study. Pattern Recognition 62 (2017) 135–160
- [35] Mallikarjuna, P., Targhi, A.T., Fritz, M., Hayman, E., Caputo, B., Eklundh, J.O.: The KTH-TIPS2 database. KTH Royal Institute of Technology (2006)
- [36] Varma, M., Zisserman, A.: A statistical approach to material classification using image patch exemplars. IEEE-TPAMI 31 (2009) 2032–2047
- [37] Xu, Y., Yang, X., Ling, H., Ji, H.: A new texture descriptor using multifractal analysis in multi-orientation wavelet pyramid. In: CVPR. (2010) 161–168
- [38] Carandini, M., Heeger, D.J.: Normalization as a canonical neural computation. Nature Reviews Neuroscience 13 (2012) 51–62