# Convolutional neural networks with

extra-classical receptive fields

###### Abstract

Convolutional neural networks (CNNs) have had great success in many real-world applications and have also been used to model visual processing in the brain. However, these networks are quite brittle – small changes in the input image can dramatically change a network’s output prediction. In contrast to what is known from biology, these networks largely rely on feedforward connections, ignoring the influence of recurrent connections. They also focus on supervised rather than unsupervised learning. To address these issues, we combine traditional supervised learning via backpropagation with a specialized unsupervised learning rule to learn lateral connections between neurons within a convolutional neural network. These connections have been shown to optimally integrate information from the surround, generating extra-classical receptive fields for the neurons in our new proposed model (CNNEx). Models with optimal lateral connections are more robust to noise and achieve better performance on noisy versions of the MNIST and CIFAR-10 datasets. Resistance to noise can be further improved by combining our model with additional regularization techniques such as dropout and weight decay. Although the image statistics of MNIST and CIFAR-10 differ greatly, the same unsupervised learning rule generalized to both datasets. Our results demonstrate the potential usefulness of combining supervised and unsupervised learning techniques and suggest that the integration of lateral connections into convolutional neural networks is an important area of future research.

Convolutional neural networks with

extra-classical receptive fields

Brian Hu Allen Institute for Brain Science Seattle, WA 98109 brianh@alleninstitute.org Stefan Mihalas Allen Institute for Brain Science Seattle, WA 98109 stefanm@alleninstitute.org

noticebox[b]Preprint. Work in progress.\end@float

## 1 Introduction

The visual response of a neuron is traditionally characterized by its classical receptive field (RF). However, such a picture of response tuning is incomplete as neurons can also integrate contextual information from other sources such as their surround. Contextual modulation refers to the ability of visual stimuli far outside the classical RF of a neuron to modulate the activity of the neuron. Examples of contextual modulation include surround suppression Bair_etal03 (); Jones_etal01 (), contour integration Hess_etal03 (); Li_etal06 (), and figure-ground segmentation Zhou_etal00 (); Lamme_etal95 (). These phenomena cannot be simply explained by feedforward mechanisms, but instead suggest an influence from extra-classical RFs. Contextual modulation is thought to be mediated in part by lateral connections between neurons in the same visual area Angelucci_Bressloff06 (). Experimental and computational modeling studies suggest that both excitatory and inhibitory cell types play an important role in this process Ko_etal11 (); Jiang_etal15 (); Lee_etal16 ().

The field of deep learning has traditionally focused on feedforward models of visual processing, and these models have been used to describe neural activity in the ventral stream of humans and other primates Cadieu_etal14 (); Gucclu_vanGerven15 (); Yamins_Dicarlo16 (); Wang_Cottrell17 (). Deep learning is a form of supervised learning that relies on backpropagation of a global error signal, but whether the brain actually uses backpropagation is controversial given the requirement for large amounts of labeled data and non-local updates to synaptic weights (for recent reviews on the connection between deep learning and neuroscience, see Marblestone_etal16 (); Kietzmann_etal17 ()). While convolutional neural networks have resulted in many practical successes Gu_etal17 (), they can be highly susceptible to adversarial examples. In one extreme case, the change of a single pixel within the input image can with high confidence change the output prediction of the network Su_etal17 (). In contrast, visual processing in the brain makes use of recurrent connections, including top-down and lateral connections, which may provide some level of immunity to these adversarial attacks (for recent results on human adversarial examples, see Elsayed_etal18 ()). More recently, convolutional neural networks that include recurrent connections have also been proposed Spoerer_etal17 ().

However, most of these models still largely rely on supervised learning. The brain is able to build rich internal representations of information with little to no labeled data, which is a form of unsupervised learning. Recent work proposed Bayes optimal context integration as a canonical cortical computation, showing that optimal lateral connections can be learned using a modified Hebbian learning rule Iyer_Mihalas17 (). We extend this work by incorporating these types of lateral connections learned in an unsupervised manner into convolutional neural networks, which are trained in a supervised manner. We first train convolutional neural networks using standard backpropagation techniques. After training, we learn the optimal lateral connections between neurons within a layer in an unsupervised manner. We then test our models on two standard computer vision datasets, MNIST MNIST () and CIFAR-10 CIFAR (). When applying different noise perturbations to the input images, the optimal lateral connections improve the overall performance and robustness of these networks. Our results suggest that incorporating lateral connections within convolutional neural networks is an important area of future research.

Original | ||
---|---|---|

AWGN | ||

SPN | ||

(A) MNIST | (B) CIFAR-10 |

## 2 Methods

### 2.1 Image datasets: MNIST and CIFAR-10

We trained and evaluated our models on two image datasets: MNIST MNIST () and CIFAR-10 CIFAR (). MNIST contains grayscale images (28x28 pixels) of handwritten digits (10 classes, for the digits 0-9). MNIST contains a total of 70K images, split into a training set (60K images) and a test set (10K images). We used 10% of the training data (6K images) for validation. CIFAR-10 contains color images (32x32 pixels) of objects from ten different classes (e.g. car, ship, etc.) CIFAR-10 contains a total of 60K images, again split into a training set (50K images) and a test set (10K images). We again used 10% of the training data (5K images) for validation.

To test the generalization of our models under noise perturbations, we added two types of noise to the original images: additive white Gaussian noise (AWGN) and salt-and-pepper noise (SPN). The mean of the AWGN was set to zero and the standard deviation varied in increasing levels of . For the SPN, the fraction of noisy pixels varied in increasing levels of . The addition of noise can be viewed as a random, non-targeted adversarial attack, which changes the input image in such a way that it will be classified incorrectly. The degree of misclassification is dependent on the noise level. Example stimuli from each dataset (original and noisy images) are shown in Figure 2.

### 2.2 Network architecture and training

We used a simple network architecture to study the influence of optimal lateral connections in convolutional neural networks. The baseline network consisted of two convolutional (conv) layers with the ReLU nonlinearity, each followed by a max-pooling (maxpool) layer with a 2x2 pooling window, which effectively downsamples the input by a factor of 2. Following the two convolutional layers are two fully connected (FC) layers, with the final output passed through a soft-max nonlinearity for the 10 classes in each dataset. We used the same baseline model architecture for both MNIST and CIFAR-10 (Table 1). We also used the same set of hyperparameters for training both models, namely stochastic gradient descent with a learning rate of 0.01 and a momentum value of 0.5. We used a minibatch size of 64 and trained our models for a total of 10 epochs. We trained 10 different instantiations of each model using different random seeds to ensure the robustness of our results. All experiments were performed using Pytorch (0.3.1) on a NVIDIA GTX 1080 Ti GPU.

Model | Network architecture | ||||||||
---|---|---|---|---|---|---|---|---|---|

CNN | conv5-10 | maxpool | conv5-20 | maxpool | FC-50 | FC-10 | soft-max | ||

CNNEx | conv5-10 | conv7-10 | maxpool | conv5-20 | conv3-20 | maxpool | FC-50 | FC-10 | soft-max |

### 2.3 Optimal lateral connections

After the initial phase of supervised learning, we freeze the feedforward synaptic weights of the network. The classical receptive field response of a neuron representing feature in layer at image location , given image can be represented by the activation of a standard artificial neuron model:

(1) |

where represents a nonlinear activation function, represents a bias term, represents features, represents image locations, and are the feedforward synaptic weights from layer to layer .

We then apply optimal lateral connections within the first two convolutional layers of the network. Here, we drop the superscript, since the proposed lateral connections are intracortical and occur within the same layer. The lateral connections are between neurons with the same feature (within-channel) and neurons with different features (between-channel) over a fixed spatial extent. The activity of a neuron representing feature at image location , given image can then be written as:

(2) |

where represents the full response of the neuron with contributions from extra-classical receptive fields, represents the classical receptive field response of the neuron, represents a hyperparameter that tunes the strength of the lateral connections, and are the synaptic weights from surrounding neurons. The lateral connections have a modulatory effect on the feedforward response, and setting is equivalent to a model with no lateral connections.

The synaptic weights are learned in an unsupervised manner using the following rule:

(3) |

where is the synaptic weight between each pair of features located at and spans a set of images. We used the same set of training images originally shown to the network during supervised training to learn the optimal lateral connections. However, for this phase of learning, we do not need the image labels, as our method is completely unsupervised. It is important to note that this formula differs from a Hebbian learning rule, in that only the covariance between the feedforward responses of neurons leads to changes in the lateral connections. A more detailed derivation of the above equations can be found in Iyer_Mihalas17 ().

### 2.4 Network regularization

To understand how other commonly used regularization techniques could impact model performance on noisy images, we also tested the use of weight decay ( regularization) and dropout. For our experiments, we chose a weight decay value of 0.005 and a dropout fraction of 0.5. Weight decay acted on all non-bias parameters of the model, while dropout was applied after each convolutional layer in the baseline model, as well as after the first fully connected layer. We also tested the combination of these regularization techniques with optimal lateral connections.

### 2.5 Validation and testing

Optimal lateral connections had a spatial extent of 7x7 pixels in the first convolutional layer and 3x3 pixels in the second convolutional layer. We did not include any self-connections, so these were all set to zero. We chose the optimal lateral connection hyperparameters for each of the two convolutional layers based on a coarse grid search over the parameter range using the validation dataset. We did not use lateral connections for the two fully-connected layers. We report final accuracies of each model on the original dataset and for all levels of the two different types of noise perturbations. All final results are averages over each of the 10 pre-trained models with different random seeds.

(A) MNIST |

(B) CIFAR-10 |

## 3 Results

### 3.1 Learned optimal lateral connections

We show example learned optimal lateral connections between different filters in the first convolutional layer for both MNIST and CIFAR-10 (Figure 3). For both datasets, we find optimal lateral connections containing both excitatory and inhibitory synaptic weights, which are balanced on average. In particular, we find excitatory weights between cells with similar tuning properties and inhibitory weights between cells with different selectivities, which is consistent with predictions from Hebbian plasticity. Interestingly, although the image statistics between MNIST and CIFAR-10 are vastly different, some of the learned optimal connections are qualitatively similar, emphasizing properties such as contour integration which may be beneficial in the context of noise or occlusion.

### 3.2 Accuracy on deep learning models

Models | Original | AWGN | SPN | |||||||||

- | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | ||

MNIST | CNN | 98.77 | 98.68 | 98.31 | 96.87 | 91.58 | 80.96 | 97.40 | 92.00 | 80.04 | 64.00 | 46.94 |

CNNEx | 97.13 | 97.10 | 96.81 | 95.96 | 93.73 | 88.95 | 96.09 | 93.92 | 88.61 | 78.55 | 63.30 | |

CNN (wd) | 98.57 | 98.48 | 98.18 | 97.37 | 94.48 | 87.21 | 97.70 | 95.08 | 87.25 | 73.57 | 56.25 | |

CNNEx (wd) | 97.53 | 97.44 | 97.17 | 96.50 | 94.93 | 91.55 | 96.75 | 95.10 | 91.49 | 84.01 | 70.58 | |

CNN (d) | 97.44 | 97.34 | 96.96 | 96.25 | 94.47 | 90.48 | 96.24 | 94.00 | 88.86 | 79.86 | 66.08 | |

CNNEx (d) | 96.63 | 96.61 | 96.36 | 95.80 | 94.74 | 92.65 | 95.92 | 94.45 | 91.86 | 86.57 | 76.64 | |

CNN (wd+d) | 97.22 | 97.07 | 96.79 | 96.24 | 95.02 | 92.43 | 96.46 | 94.89 | 91.55 | 84.71 | 72.71 | |

CNNEx (wd+d) | 96.97 | 96.88 | 96.65 | 96.12 | 95.14 | 93.37 | 96.32 | 95.02 | 92.88 | 88.25 | 78.93 | |

CIFAR-10 | CNN | 59.60 | 51.53 | 36.56 | 26.00 | 20.59 | 17.68 | 37.87 | 26.11 | 20.31 | 17.23 | 15.28 |

CNNEx | 58.04 | 52.20 | 38.97 | 28.34 | 22.25 | 18.86 | 40.17 | 28.38 | 21.96 | 18.18 | 15.94 | |

CNN (wd) | 57.62 | 51.41 | 38.68 | 28.15 | 21.94 | 18.39 | 40.02 | 28.26 | 21.63 | 17.79 | 15.57 | |

CNNEx (wd) | 56.57 | 51.89 | 40.85 | 30.61 | 23.95 | 19.95 | 42.20 | 30.83 | 23.67 | 19.16 | 16.48 | |

CNN (d) | 43.22 | 42.12 | 38.95 | 34.22 | 29.50 | 25.64 | 39.30 | 34.30 | 29.29 | 24.70 | 20.84 | |

CNNEx (d) | 43.68 | 42.51 | 39.20 | 34.36 | 29.65 | 25.76 | 39.70 | 34.57 | 29.55 | 24.78 | 20.85 | |

CNN (wd+d) | 43.05 | 42.52 | 40.35 | 36.87 | 32.55 | 28.85 | 40.64 | 37.14 | 32.87 | 28.07 | 23.47 | |

CNNEx (wd+d) | 43.12 | 42.54 | 40.38 | 36.87 | 32.53 | 28.83 | 40.67 | 37.15 | 32.86 | 28.05 | 23.46 | |

We tested our trained models with and without optimal lateral connections on the original MNIST and CIFAR-10 test datasets, as well as on these datasets with the addition of noise. For the MNIST dataset, we find that both the baseline network and the network with optimal lateral connections achieve high accuracy on the original test images (97-99%). We also find that performance decreases gradually with increasing noise levels. In general, accuracy is lower for the SPN images compared to the AWGN images, suggesting that SPN images may be more difficult for the baseline model to handle. Our results show that optimal lateral connections improve model performance at higher levels of AWGN (standard deviations above 0.3) and also at higher levels of SPN (fraction of changed pixels above 0.1). Our results also show that the combination of optimal lateral connections with additional regularization techniques such as dropout or weight decay often resulted in even better performance at high noise levels. For example, at the highest noise levels, models with optimal lateral connections and regularized by both weight decay and dropout achieved the highest accuracy (93.37% on AWGN images and 78.93% on SPN images). Interestingly, the relative difference in accuracies between models with and without optimal lateral connections decreased with additional regularization (e.g. on the SPN images, 16% difference for models without regularization and 6% difference for models with weight decay and dropout).

For the CIFAR-10 dataset, the baseline model achieves an accuracy around 60%. We find a slight decrease in the performance of the model with optimal lateral connections compared to the baseline model on the original set of images (which we also saw on the MNIST dataset). Models with optimal lateral connections again outperform the baseline models for both types of noise and at different noise levels. However, the increase in performance with optimal lateral connections is relatively small (1-2%) on the CIFAR-10 dataset compared to the MNIST dataset. We also find that for the model which is regularized by both weight decay and dropout, optimal lateral connections do not provide much additional benefit on the CIFAR-10 dataset. This may be due to our models being underfit, as even our best performing model does not achieve close to state-of-the-art test accuracies on the CIFAR-10 dataset. Our results are summarized in Table 2.

## 4 Discussion

Our proposed model adds two novel contributions to traditional convolutional neural networks when considered together: 1) the incorporation of recurrent lateral connections modeling the influence of extra-classical receptive fields, and 2) the ability to learn these connections in a completely unsupervised manner.

The vast majority of deep neural networks are feedforward in nature, although recurrent connections have been added to convolutional neuronal networks Spoerer_etal17 (); Liang_Hu15 (). Recurrent connections have also been used to implement different visual attention mechanisms Mnih_etal14 (); Li_etal17 (). However, these networks are still all trained in a supervised manner. An exception are ladder networks, which have been proposed as a means to combine supervised and unsupervised learning in deep neural networks Rasmus_etal15 (). However, different from our approach, ladder networks use noise injection to introduce an unsupervised cost function based on reconstruction of the internal activity of the network. Our model instead relies on a modified Hebbian learning rule which learns the optimal connections between features within each layer based solely on the activations of these neurons.

The proposed computation carried out by optimal lateral connections can be mapped to a collection of cell types found in the cortical microcircuit Jiang_etal15 (). Connections between pyramidal cells often show like-to-like connectivity, and in our model, the strength of these connections is proportional to the correlation in the activations of these neurons. This mapping also suggests two forms of inhibition - local normalization of excitatory neuronal activity in a patch (corresponding to the classical receptive field) and inhibition arising from the surround (extra-classical receptive fields) - we attribute these to parvalbumin and somatostatin-expressing interneurons, respectively.

Neurons are inherently noisy, and their responses can vary even to the same stimulus. These neurons are embedded in cortical circuits that must perform computations in the absence of information, such as under visual occlusion. Optimal lateral connections can provide additional robustness to these networks by allowing for integration of information from multiple sources. This type of computation is also potentially useful for applications in which artificial neurons are not simulated with high fidelity, e.g. in neuromorphic computing.

We chose a relatively simple network architecture as a proof-of-concept for our model. As such, we did not achieve state-of-the art performance on either image dataset. This accuracy could be further improved by either fine-tuning models with optimal lateral connections or using deeper model architectures with more parameters. Future experiments will also have to test the scalability of learning optimal lateral connections on more complex network architectures and larger image datasets (e.g. ImageNet), and whether these connections provide any benefit against noise or other types of perturbations such as adversarial images.

### Acknowledgements

We wish to thank the Allen Institute founder, Paul G. Allen, for his vision, encouragement, and support. We also thank Ram Iyer for helpful discussions.

## References

- [1] Wyeth Bair, James R Cavanaugh, and J Anthony Movshon. Time course and time-distance relationships for surround suppression in macaque v1 neurons. Journal of Neuroscience, 23(20):7690–7701, 2003.
- [2] HE Jones, KL Grieve, W Wang, and AM Sillito. Surround suppression in primate v1. Journal of neurophysiology, 86(4):2011–2028, 2001.
- [3] RF Hess, A Hayes, and DJ Field. Contour integration and cortical processing. Journal of Physiology-Paris, 97(2-3):105–119, 2003.
- [4] Wu Li, Valentin Piëch, and Charles D Gilbert. Contour saliency in primary visual cortex. Neuron, 50(6):951–962, 2006.
- [5] Hong Zhou, Howard S Friedman, and Rüdiger Von Der Heydt. Coding of border ownership in monkey visual cortex. Journal of Neuroscience, 20(17):6594–6611, 2000.
- [6] Victor A.F. Lamme. The neurophysiology of figure-ground segregation in primary visual cortex. Journal of Neuroscience, 15(2):1605–1615, 1995.
- [7] Alessandra Angelucci and Paul C Bressloff. Contribution of feedforward, lateral and feedback connections to the classical receptive field center and extra-classical receptive field surround of primate v1 neurons. Progress in brain research, 154:93–120, 2006.
- [8] Ho Ko, Sonja B Hofer, Bruno Pichler, Katherine A Buchanan, P Jesper Sjöström, and Thomas D Mrsic-Flogel. Functional specificity of local synaptic connections in neocortical networks. Nature, 473(7345):87, 2011.
- [9] Xiaolong Jiang, Shan Shen, Cathryn R Cadwell, Philipp Berens, Fabian Sinz, Alexander S Ecker, Saumil Patel, and Andreas S Tolias. Principles of connectivity among morphologically defined cell types in adult neocortex. Science, 350(6264):aac9462, 2015.
- [10] Wei-Chung Allen Lee, Vincent Bonin, Michael Reed, Brett J Graham, Greg Hood, Katie Glattfelder, and R Clay Reid. Anatomy and function of an excitatory network in the visual cortex. Nature, 532(7599):370, 2016.
- [11] Charles F Cadieu, Ha Hong, Daniel LK Yamins, Nicolas Pinto, Diego Ardila, Ethan A Solomon, Najib J Majaj, and James J DiCarlo. Deep neural networks rival the representation of primate it cortex for core visual object recognition. PLoS computational biology, 10(12):e1003963, 2014.
- [12] Umut Güçlü and Marcel AJ van Gerven. Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream. Journal of Neuroscience, 35(27):10005–10014, 2015.
- [13] Daniel LK Yamins and James J DiCarlo. Using goal-driven deep learning models to understand sensory cortex. Nature neuroscience, 19(3):356, 2016.
- [14] Panqu Wang and Garrison W Cottrell. Central and peripheral vision for scene recognition: a neurocomputational modeling exploration. Journal of vision, 17(4):9–9, 2017.
- [15] Adam H Marblestone, Greg Wayne, and Konrad P Kording. Toward an integration of deep learning and neuroscience. Frontiers in Computational Neuroscience, 10:94, 2016.
- [16] Tim Christian Kietzmann, Patrick McClure, and Nikolaus Kriegeskorte. Deep neural networks in computational neuroscience. bioRxiv, page 133504, 2017.
- [17] Jiuxiang Gu, Zhenhua Wang, Jason Kuen, Lianyang Ma, Amir Shahroudy, Bing Shuai, Ting Liu, Xingxing Wang, Gang Wang, Jianfei Cai, et al. Recent advances in convolutional neural networks. Pattern Recognition, 2017.
- [18] Jiawei Su, Danilo Vasconcellos Vargas, and Sakurai Kouichi. One pixel attack for fooling deep neural networks. arXiv preprint arXiv:1710.08864, 2017.
- [19] Gamaleldin F Elsayed, Shreya Shankar, Brian Cheung, Nicolas Papernot, Alex Kurakin, Ian Goodfellow, and Jascha Sohl-Dickstein. Adversarial examples that fool both human and computer vision. arXiv preprint arXiv:1802.08195, 2018.
- [20] Courtney J Spoerer, Patrick McClure, and Nikolaus Kriegeskorte. Recurrent convolutional neural networks: a better model of biological object recognition. Frontiers in psychology, 8:1551, 2017.
- [21] Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
- [22] Ramakrishnan Iyer and Stefan Mihalas. Cortical circuits implement optimal context integration. bioRxiv, 2017.
- [23] Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
- [24] Ming Liang and Xiaolin Hu. Recurrent convolutional neural network for object recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3367–3375, 2015.
- [25] Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. Recurrent models of visual attention. In Advances in neural information processing systems, pages 2204–2212, 2014.
- [26] Zhichao Li, Yi Yang, Xiao Liu, Feng Zhou, Shilei Wen, and Wei Xu. Dynamic computational time for visual attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1199–1209, 2017.
- [27] Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semi-supervised learning with ladder networks. In Advances in Neural Information Processing Systems, pages 3546–3554, 2015.