Recurrent Connectivity Aids Recognition of Partly Occluded Objects
Feedforward convolutional neural networks are the prevalent model of core object recognition. For challenging conditions, such as occlusion, neuroscientists believe that the recurrent connectivity in the visual cortex aids object recognition. In this work we investigate if and how artificial neural networks can also benefit from recurrent connectivity. For this we systematically compare architectures comprised of bottom-up (B), lateral (L) and top-down (T) connections. To evaluate performance, we introduce two novel stereoscopic occluded object datasets, which bridge the gap from classifying digits to recognizing 3D objects. The task consists of recognizing one target object occluded by multiple occluder objects. We find that recurrent models perform significantly better than their feedforward counterparts, which were matched in parametric complexity. We show that for challenging stimuli, the recurrent feedback is able to correctly revise the initial feedforward guess of the network. Overall, our results suggest that both artificial and biological neural networks can exploit recurrence for improved object recognition.
The primate visual system is able to correctly identify an object within 200 ms of its presentation (thorpe1996speed; potter1976shortterm). Given this rapidness, object recognition has been widely assumed to be a purely feedforward process (dicarlo2012visualobject). This idea has been corroborated by the recent success of feedforward convolutional neural networks in the realm of object recognition (krizhevsky2012imagenet; lecun2015deeplearning). However, both anatomical and physiological evidence point to the importance of recurrent connectivity within this process. The densities of feedforward and recurrent connections in the ventral visual pathway are comparable in magnitude (felleman1991distributed; sporns2004small), and electrophysiological experiments have shown that processing of object information evolves in time, beyond what would normally be attributed to a feedforward process (cichy2014resolving; brincat2006dynamic; rajaei2019beyond). In particular, (johnson2005recognition; tang2014spatiotemporal) observed that recognition of degraded or occluded objects produces delayed behavioural and neural responses, which are believed to be a result of competitive processing within lateral recurrent connections (adesnik2010lateral). Late phase IT response patterns have been shown to be reliably predicted only by shallow recurrent networks and very deep feedforward networks (kar2019evidence).
For partially occluded images, (smith2010nonstimulated; tang2018completion; tang2014spatiotemporal) suggest that recurrent top-down connections are able to reconstruct missing information. Whether object recognition in artificial neural networks can benefit from recurrent connections is less clear, however. Early investigations of this question used highly restricted datasets, where artificial inputs were partly faded out or masked (spoerer2017recurrent; oreilly2013recurrent). Under natural conditions, however, occlusion is highly dependent on viewing angle and primates perceive it stereoscopically with two eyes. Thus we here developed two novel occluded image datasets that capture the full range of disparity and perspective cues for both natural (handwritten digits) and computer rendered (full 3-D objects) stimuli.
We test and compare a range of recurrent convolutional neural networks with different connection properties. Assuming the naming scheme of (spoerer2017recurrent), we distinguish bottom-up (B), top-down (T) and lateral (L) connections. Bottom-up and top-down correspond to information processing from lower and higher regions, whereas lateral connections process information within the same region of the ventral visual hierarchy. To test whether recurrent networks outperform their feedforward counterparts in a natural occlusion scenario, the network architectures were tasked with classifying objects under different levels of occlusion. Our results show significant performance gains for recurrent architectures. Finally, we explore how recurrent connections shape the networks’ predictions by analyzing how the probability distribution over class labels evolves with time, providing evidence that feedback connections are effective in diminishing the effects of occluders.
To investigate the effects of occlusion in object recognition we present two novel stereoscopic image data sets.
Occluded Stereo Multi-MNIST
Occluded Stereo Multi-MNIST (OS-MNIST) is a stereoscopic occluded digit recognition dataset. We chose digit recognition because it is deemed a solved problem in computer vision, yet poses many of the challenges present in a realistic occlusion scenario. The use of MNIST digits additionally encourages the network to learn a representation that generalizes to different variants of a particular class (lecun1998gradient). Contrary to past studies (oreilly2013recurrent; smith2010nonstimulated; tang2014spatiotemporal; wyatte2012limits), occlusion is generated by overlaying the target digit with other digit instances in a pseudo-3D environment. Every image of OS-MNIST contains three MNIST digit instances. Occlusion is generated by overlaying digits on top of each other as shown in Fig. 1 A. The target object, i.e. the hindmost digit, is centered in the middle of the square canvas. Additional digits are then sequentially placed on top of the target object. These occluding objects remain fixed along the y-axis as if standing on a surface 5 cm below the viewer. The x-coordinate is drawn from a uniform distribution. The size of the digits was scaled to give the impression of objects with 20 cm height placed at different depths. We assumed a distance of 50 cm from the target object to the viewer, and 10 cm less for every added object. Images for the left and right eye were taken given an interocular distance of 6.8 cm. For each MNIST digit we generated 10 random occluder combinations of the remaining classes, resulting in a total of images for training and images for testing. All stimuli were rendered at pixels. The generative source code for OS-MNIST will be made available.
Occluded Stereo YCB-Objects
The Occluded Stereo YCB-Object image dataset (OS-YCB) constitutes the second novel dataset for stereoscopic occluded object recognition. OS-YCB contains stereo image stimuli of common household objects occluding each other. We chose 79 objects from an assortment of items suitable for robotics applications, i.e. the YCB-Object set (calli2015benchmarking; calli2015ycb). For each image, we placed three virtual 3D objects according to Fig. 1 and using a repurposed robot simulator as a stereoscopic camera (metta2010icub). The dataset is set to be publicly released alongside this publication.
In line with our other dataset, OS-MNIST, the objects are placed at a distance of 50 cm from the viewer on a plane 5 cm below line of sight, see Fig. 1 A. All objects are placed in upright position and turned by a random yaw angle, again encouraging the model to learn a representation that generalizes to different properties of a particular class. A background was chosen to simulate a context with natural image statistics. The occlusion percentage of each image is defined as the ratio of occluded to visible pixels averaged over the two stereo images.
We generated images per object and occlusion percentage (20, 40, 60, 80 %) resulting in stereo images. The occluders were chosen in a way that no two instances of one class would appear in the same image. The images were rendered at , and pixels. For all experiments conducted we used a pixel central crop of the version. The data were randomly split into 80 % training and 20 % testing data.
2.2 Network Models
The three aforementioned connection types enable four basic network architectures as shown in Fig. 1 B: Bottom-up connection only (B), bottom-up and top-down connections (BT), bottom-up and lateral connections (BL), and bottom-up, lateral, and top-down connections (BLT). As lateral and top-down connections introduce cycles into the computational graph, these models represent recurrent neural networks and allow for information to be retained within a layer or to flow back into previous layers.
Each of the models consists of an input layer, two hidden recurrent layers and an output layer. Bottom-up and lateral connections are implemented as convolutional layers (lecun1998gradient) with a stride of followed by a maxpooling operation with a stride of . Top-down connections are implemented as transposed convolutions (zeiler2010deconvolutional) with output stride to match the input size of the convolutional layer that came before it. Each of the recurrent network models is unrolled and trained for four time steps by backpropagation (rumelhart1986learning). When reporting accuracy, the output at the last unrolled time step available for the particular architecture is used. Recurrent network models naturally have more learnable parameters than their feedforward counterparts, due to their increased connectivity. To compensate for this, we introduce two additional feedforward models B-F and B-K as in (spoerer2017recurrent). B-F doubles the number of convolutional filters in each of the hidden layers from 32 to 64. B-K has larger convolutional kernels compared to of the standard B model. The larger kernel size effectively increases the number of connections that each unit has and makes B-K a more appropriate model for control. The additional feature maps in B-F on the other hand alter the representational power of the model. The number of learnable parameters for each of the architectures can be found in Tab. 1.
|Hidden layer units|
|Image channels||Number of learnable parameters (OS-MNIST, 10 classes)|
|Image channels||Number of learnable parameters (OS-YCB, 80 classes)|
The central element of the investigated models is the hidden recurrent convolutional layer. The inputs to these layer are denoted . This notation represents the vectorized input of a patch centered on location in layer computed at time step across all feature maps indexed by . Thus an input stimulus presented to the network is denoted as . The activation of a hidden recurrent layer can then be written as
where is the vectorized convolutional kernel at feature map in layer for bottom-up (B), lateral (L), and top-down (T) connections, respectively. Each of these kernels only is active for architectures using the particular connection-type and is otherwise set to zero. Note that the lateral and top-down connections depend on values of the previous time step, thus we define the inputs to be a vector of zeroes for , where there would be no preceding time step. Top-down connections are only present between the two hidden layers (Fig. 1 B).
Following the flow of information, the of the hidden layer are batch-normalized (ioffe2015batch). This technique normalizes an activation using the mean and standard deviation over a mini-batch of activations and adds multiplicative and additive noise.
where and are additional learnable parameters.
The normalized activations then are passed on to rectified linear units (ReLU, )
and go through local response normalization (LRN, )
with , , and . Inspired by lateral inhibition LRN induces competition for large activities amongst the closest features within a spatial location (krizhevsky2012imagenet). Finally, the output of the hidden layer can be written as
After passing the second hidden layer the activations are relayed to a fully-connected segment with one output unit per class and a softmax activation layer, defined as:
This final activation function normalizes the sum over output units to one and therefore makes the output interpretable as the probability distribution over all possible classes.
The labels to be predicted by the network are encoded as one-hot vectors. To quantify the mismatch between the networks’ output and the the target label we compute the cross-entropy cost-function summed across all time steps and all output units:
The network parameters are adapted using adam (kingma2014adam) with an initial learning rate of to perform gradient descent. Unless stated otherwise training occurred for 25 epochs with mini-batches of size 400. Bottom-up weights were initialized with a truncated normal distribution with , and all other weights with ,
2.3 Model Performance Metrics
The different models were evaluated in terms of classification accuracy averaged across the test set. We use pair-wise McNemar’s tests (mcnemar1947note) to compare test performances with each other. McNemar’s test makes use of the variability in performance across stimuli for statistical inference and thus does not require repeated training (dietterich1998approximate). This method enables us to evaluate and compare a variety of different models in a computationally efficient manner. As multiple comparisons increase the risk of false positives, we control the false discovery rate (FDR, the expected proportion of false positives among the positive outcomes) at 0.05 using a Bonferroni-type correction procedure developed in (benjamini1995controlling).
3.1 Model Performance
Every considered model was trained for 25 epochs on OS-MNIST and on OS-YCB with all four occlusion percentage levels combined. Training was conducted using a NVIDIA Tesla K40c GPU. Fig. 2 depicts the error-rate for the models trained with monocular (A, C) and stereoscopic (B, D) input.
While the task to be accomplished by the network is the same for both datasets, OS-YCB offers 79 possible classes compared to only 10 in OS-MNIST. Additionally, OS-YCB requires the models to recognize 3D objects shown from different angles, making it the more complex of the two datasets. Given that we didn’t change the architectural complexity, better overall performance was to be expected for OS-MNIST. The error rates for OS-MNIST (mono., range: .087 – .126) and OS-YCB (mono., range: .182 – .212) confirm this assumption.
Overall, recurrent architectures show increased performance on the given task compared to feedforward networks of near-equal complexity. Significant differences (FDR = 0.05) for OS-MNIST with monocular inputs can be attested for all combinations except (B-K, B-F), and (BL, BLT), .
The lower left square, highlighted by a white line, indicates that all but two pair-wise tests between feedforward and recurrent models show a significant performance gain for recurrent architectures. Only for OS-YCB data (Fig. 2 C, D) BT does not significantly outperform B-F, . For stereoscopic input B-F outperforms BT, however BLT still performs best. Notably, B-K performs significantly worse than B-F questioning the benefits of the increased kernel size, .
Qualitatively, we observe similarities when comparing OS-MNIST and OS-YCB performances: Recurrent networks reliably outperform their non-recurrent counterparts and BT produces the highest error-rates amongst recurrent models. Notably, the relative decrease in error-rate from feedforward to recurrent models is elevated for the stereoscopic case in OS-MNIST. However, significant differences between BL and BLT can only be observed for the more complex OS-YCB data. Also, the relative performance of B-F is substantially better for the OS-YCB dataset. When trained separately on the four subsets of OS-YCB with specific occlusion percentages, we observe the same overall patterns. Error rates rise with higher occlusion percentages, but the BLT model consistently ranks highest in classification performance, see Tab. 2.
3.2 Impact of Recurrent Connectivity
The softmax output indicates how confident the network “feels” about each class being the target. Thus, evaluating the output of BLT at each time step yields insight into how recurrent connections revise the networks’ belief over time. In fact, we observe that wrong initial guesses at can be corrected at later time steps and correct initial guesses are reinforced. Examples of this can be found in Fig. 3 A: While the network initially estimates the target to be 8, the final output is the correct class 9 (left panel). The average softmax output over each target is depicted in Fig. 3 B. The visualization reveals that the probabilities assigned to incorrect classes decrease over time and also shows that the model has discovered systematic similarities between digits 3, 5 and 8, and digits 4 and 1.
We investigated whether recurrent connectivity benefits occluded object recognition. Previous attempts at answering this question have been limited by using very simplistic and unnatural stimuli. On the one hand, the stimuli used by (spoerer2017recurrent) were computer rendered digits without any variability in individual digit appearance. On the other hand, the stimuli used by (oreilly2013recurrent; tang2014spatiotemporal) only blurred out image parts rather than introducing occluding objects. To overcome these limitations, we introduced two novel datasets that capture the natural variability of object appearance and a range of disparity and perspective cues, namely the Occluded Stereo-Multi MNIST (OS-MNIST) and the Occluded Stereo YCB-Objects (OS-YCB) datasets. We demonstrated that feedback connections significantly improve occluded object recognition for these much more complex datasets, providing strong evidence for a general benefit of recurrence for occluded object recognition. Furthermore, recurrent architectures similar to the ones presented here have been shown to also outperform parameter matched control models when no occlusion is present (liang2015recurrent; han2018deeppredictive), suggesting a rather general benefit of recurrence for object recognition tasks. This would be consistent with biological observations of how object information in the brain unfolds over time during recognition (brincat2006dynamic).
In our experiments, the fully recurrent model comprising lateral and top-down connections (BLT) performed best in all runs. The BL model came in second, while the BT performed worst, suggesting that lateral connections are particularly important for the observed performance advantage.
A second finding is that for stereoscopic input we observed higher recognition rates. This is likely due to the fact that the additional image introduces a new perspective of the scene, potentially revealing additional information about the target. Furthermore, for the stereoscopic input, the target is presented at zero disparity, while the occluder objects are not, which provides the network with a cue regarding what part of the input should be ignored. Qualitatively, the results of the statistical network comparisons resemble the ones obtained for monocular stimuli. Interestingly, however, the relative performance difference between recurrent and feedforward models was usually higher for stereoscopic stimuli. This suggests that the recurrent connections are effective in utilizing the additional cues provided by the binocular presentation of the scene.
During training, we consistently observed that the sum of recurrent weights (lateral and top-down) became slightly negative. We hypothesize that this bias towards negative weights might also contribute to inhibiting or discounting occluders. With the network’s dynamics being determined by the ReLU activation function a slight bias towards inhibitory weights might also be important for keeping activations centered around the non-linearity. Finally, our analysis revealed that recurrent connections revise the network’s output over time, sometimes correcting an incorrect initial output after the first feedforward pass through the network, providing further evidence for the effectiveness of recurrence.
Evidently, any recurrent computation could also be performed by an appropriately unfolded (and therefore deeper) feedforward network. The recurrent network can be viewed as equivalent to such a deeper feedforward version, with certain weights constrained to be identical. Thus, recurrence implies a form of weight sharing in the temporal domain similar to how convolutional layers implement a form of weight sharing in the spatial domain. We speculate that this is the chief reason for the observed performance gains of recurrent networks (liao2016residual).
In conclusion, we have shown that recurrent neural network architectures show significant advantages on complex occluded object recognition problems. Given their improved performance and greater biological plausibility they deserve more thorough analysis.
This work was supported by the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 713010 (GOAL-Robots, Goal-based Open-ended Autonomous Learning Robots).