MuPNet: Multi-modal Predictive Coding Network for Place Recognition by Unsupervised Learning of Joint Visuo-Tactile Latent Representations
Extracting and binding salient information from different sensory modalities to determine common features in the environment is a significant challenge in robotics. Here we present MuPNet (Multi-modal Predictive Coding Network), a biologically plausible network architecture for extracting joint latent features from visuo-tactile sensory data gathered from a biomimetic mobile robot. In this study we evaluate MuPNet applied to place recognition as a simulated biomimetic robot platform explores visually aliased environments. The F1 scores demonstrate that its performance over prior hand-crafted sensory feature extraction techniques is equivalent under controlled conditions, with significant improvement when operating in novel environments.
Place recognition is an important ability for autonomous systems that navigate and interact with their environment. The core requirements for a successful place recognition, such as evaluating the similarity between scenes and comparing them to a set of internal representations, have been extensively researched in recent computer vision and robotics literature. The recent advances in visual sensors, computer vision and deep learning research have shifted the focus of previous research on place recognition towards using vision as the primary sensory modality [lowry2015visual]. However, the main challenge such as recognizing places in changing, cluttered or aliased environments is yet to be fully solved and is an ongoing research [chen2014convolutional, kendall2015posenet].
Previous works such as [milford2013brain] have shown that fusing multiple sensory modalities which complement each other improves place recognition and simultaneous localization and mapping (SLAM) performance, especially in cluttered and aliased environments. In such environments, tactile sensors can interact with the surrounding in close range and help discern ambiguous visual landmarks and prevent wrong place recognitions [struckmeier2019vita]. Recently, new tactile sensors have pushed the precision of spatio-temporal acuity to the level of a human finger tip [yuan2017gelsight, Oddo2011roughness, lepora2015super] and it has been shown that tactile sensing can be used for close range object recognition [salman2018whisker, Bauza2019TactileMA]. For obtaining information about the geometry of objects, bio-mimetic rat whiskers are capable to provide a robust measure of surface proximity, information from such arrays can be used to determine surface form, texture, compliance and friction [kim2007biomimetic, pearson2007whiskerbot, solomon2006biomechanics]. However, extraction of features from these sensors is typically performed using hand-crafted features, which has been found inferior in the context of visual processing, where convolutional neural network architectures are used to learn the features. Moreover, the existing features represent typically only visual or tactile signatures, and the typical approach to combine them by weighting does not address their correlations.
We propose a new biologically plausible feature extraction method called MuPNet (Multi-modal Predictive Coding Network) that implicitly fuses visual and tactile information into a single feature. The method uses neurobiologically plausible predictive coding illustrated in Fig. 1 to infer latent visuo-tactile representations of the sensory input. Using three settings, we empirically study the robustness of place recognition with the developed features compared to hand-crafted sensory pre-processing techniques adopted in prior works [struckmeier2019vita]. Although this study concerns place recognition using vision and touch, we contend that MuPNet is applicable to learning the joint latent representations from any co-incident multi-sensory input.
The main contributions of the work are: (i) extension of predictive coding to multi-modal sensory information; (ii) the method for biologically plausible visuo-tactile feature extraction; and (iii) demonstration of improved robustness of place recognition compared to prior works when faced with contextual changes.
Previous research such as [lee2019making] have inferred representations of multiple sensory modalities using a hierarchical autoencoder with individual encoder/decoder blocks for each sensory modality. One of the main differences between predictive coding and existing machine learning models like an autoencoder is the direction in which information and errors propagate. An autoencoder consists of an encoder and a decoder which together form a feedforward network which is trained end-to-end using error-backpropagation. However, error backpropagation is biologically implausible [lillicrap2016random] and predictive coding is a biologically plausible alternative. During inference autoencoders, propagate information sequentially towards the output layer in the network whereas in predictive coding all layers in the network parallelly transmit information only towards the input layer (without any further propagation across layers). For learning, autoencoders require a backward-pass through the network from output to input layer whereas in predictive coding each layer parallelly transmits prediction errors towards the multi-sensory module as shown in Fig. 1. Furthermore, in autoencoders, neuronal activity in intermediate layers is derived from the feedforward propagation of the input. In predictive coding, the neuronal activity in each layer is initialized randomly and then adapted such that it best represents the features of a given multi-modal input. Thus, the predictive coding architecture can infer representations without an explicit encoding block.
Ii Multi-modal feature extraction
We begin this section by presenting an existing hand-crafted baseline that has been proposed for bio-inspired SLAM. We then continue by presenting the predictive coding based features.
Ii-a Hand-crafted baseline
ViTa-SLAM [struckmeier2019vita] is a visuo-tactile extension to the vision-only RatSLAM [ball2013openratslam] and tactile-only WhiskerRatSLAM [salman2018whisker] methods. ViTa-SLAM extracts visual and tactile features independently as illustrated in Fig. LABEL:fig:ViTaSLAM_Structure showing a block diagram of ViTa-SLAM with the place-recognition front-end highlighted by the dashed square. The visual feature is an intensity profile represented as a vector and the tactile data are represented using a point feature histogram and a slope distribution array [struckmeier2019vita]. Distance between features is defined as a weighted combination of differences as
are scaling factors to normalize the respective distances based on their standard deviations.
This hand-crafted approach for combining coincident visuo-tactile sensory information was demonstrated to be beneficial in determining place within visually aliased environments. However, the generality was limited between different environments and experimental conditions for the following reasons:
The scaling factors to normalize the error components shown in Eq. (2) had to be determined empirically.
A large number of parameters related to feature extraction and pre-processing had to be tuned.