A Sensorimotor Perspective on Grounding the Semantic of Simple Visual Features
Abstract
In Machine Learning and Robotics, the semantic content of visual features is usually provided to the system by a human who interprets its content. On the contrary, strictly unsupervised approaches have difficulties relating the statistics of sensory inputs to their semantic content without also relying on prior knowledge introduced in the system. We proposed in this paper to tackle this problem from a sensorimotor perspective. In line with the Sensorimotor Contingencies Theory, we make the fundamental assumption that the semantic content of sensory inputs at least partially stems from the way an agent can actively transform it. We illustrate our approach by formalizing how simple visual features can induce invariants in a naive agent’s sensorimotor experience, and evaluate it on a simple simulated visual system. Without any a priori knowledge about the way its sensorimotor information is encoded, we show how an agent can characterize the uniformity and edgeness of the visual features it interacts with.
1 Introduction
Artificial visual perception has made great progress in the last few years, in particular thanks to the development of large images databases and neural network architectures. For three years, computer vision algorithms have even surpassed human performance in classification tasks on specific databases [1] Despite these impressive achievements, current artificial vision systems still exhibit important limitations. As exemplified by the existence of adversarial examples, their highperformances can prove surprisingly brittle [2]. But, even more concerning for the Developmental Robotics community, another strong limitation of these systems is their lack of autonomy. Indeed, current efficient machine learning systems are supervised. They rely on humans to collect and preprocess the adequate taskrelated data, to define a suitable network architecture, and most importantly to provide an interpretation of the semantic content of the visual scene in the form of labels or rewards.
The question of how to build a completely autonomous artificial vision system remains open. In particular, how can a robot create or discover the semantic content of the visual input it receives. Unsupervised approaches of the problem have been proposed [3], but they still eventually require a human to interpret the patterns that have been statistically extracted from the data. The grounding of semantics is a deep philosophical question that has arguably been investigated for centuries, and that roboticists have practically bumped into since the early years of artificial intelligence. Quite evidently, it will not be solved easily. We can nonetheless start addressing the problem by looking at simple problems that could provide some insight on how to solve more complex ones, and in particular on how humans perceive their environment autonomously.
We follow such an approach and investigate the grounding of the perception of simple visual features.
We base our study on the SensoriMotor Contingencies Theory (SMCT), a theory of perception that was introduced with a particular consideration for visual experience [4]. This theory suggests that the subjective experience of perception emerges from regularities in our sensorimotor flow. More precisely, it argues that perception does not come directly from the processing of passive sensory inputs, but from the knowledge of the way one’s actions would transform these sensory inputs.
This philosophical perspective has multiple interesting consequences for robotics, and in particular autonomous and developmental systems. It suggests that a robot can acquire perceptive abilities by actively exploring its environment and identifying regularities in its sensorimotor experience. But more interestingly, it suggests that the subjective perceptive experiences themselves can be characterized by the properties of the sensorimotor regularities they are associated with.
A typical example of this idea is the one of a line, or more generally an edge. There is a sensorimotor regularity when one looks at an edge: regardless of the way the sensorimotor information is encoded, actions that move the eye generate sensory variations, except when the eye moves along the edge. This specific sensorimotor invariant characterizes the visual input through the way one can interact with it, and independently from the static properties of the visual input itself.
In this paper, we propose to investigate the practical relevance of this philosophical claim by evaluating the sensorimotor invariance associated with simple visual features. To do so, we propose a mathematical formalization of the problem, as well as simple simulations of an agent exploring its environment with a small retinalike sensor (See Fig. 1).
Previous works have developed approaches inspired by the SMCT. They studied different components of perceptive experience such as space [5], color [6], objects [7], field of view [8], tactile space [9], or auditory space [10], or containment [11]. Despite some of them being in part related to visual experience, none directly addresses the problem of characterizing visual features. Nonetheless, a similar approach has previously been proposed in [12]. Our work differs from it in that we propose a mathematical formalization of the visual sensorimotor invariances, instead of casting the problem in a Reinforcement Learning framework which relies on a handdesigned reward function.
In the following sections, we introduce a mathematical formalization of the problem, propose a method to identify sensorimotor invariants induced by simple visual features, and evaluate it on simple experiments. Finally we discuss our results and the practicality and limitations of the approach.
2 Problem
In our study, we avoid as much as possible any bias that is usually introduced in the processing of sensorimotor experiences. To do so, we consider naive (tabula rasa) agents which do not have any a priori knowledge about their environment, nor the sensorimotor apparatus they use to explore it. The agent itself is considered to be the information processing system which only access the environment indirectly through the interface formed by its physical sensors and motors. (see Fig. 1). As a consequence, the agent has to estimate the environment’s properties by looking at the instantaneous sensory state and motor states that it receives and generates. We respectively define them as:
where is the individual sensation produced by the th sensor, and is the individual motor command sent to the th motor. We denote and the vector spaces in which and live.
Although this formalization is relatively general, we limit in this paper our study to visual sensations and assume that the agent is equipped with a type of visual sensor. This way, each sensation can be thought of as produced by an individual cell (cone or rod) in a retina or a camera (pixel). The topological organization of those elementary sensors, as well as the way they encode the information are however unspecified.
Similarly, we assume that the motor commands correspond to displacements of the sensor in the visual scene, locally akin to translations in the plane.
Based on this basic formalism, we address the question of the identification of properties which could characterize sensory inputs.
2.1 Passive approach
In the absence of prior knowledge, or external inputs (label, reward), the common way to address the problem is to perform a statistical analysis of a collection of static sensory inputs . This way, one can for instance estimate the probability of occurrence of a sensory input. One can also evaluate the correlation between the different components of the sensory state . This type of approach leads to the extraction of sensory statistics which can be very useful for bootstrapping the solving of computer vision tasks [13]. For instance, analysis images from the internet, it can create features, or representations, specific to ”cats” and ”faces”, as presented in [14]. Yet, a human is still required to interpret the semantic content of these statistical representations. For instance, such a system can capture the fact that an oriented edge or a uniform input are highly probable, but cannot make explicit in what way each of them is particular.
In order to estimate semantic content from static sensory inputs, one needs to incorporate prior knowledge into the system. For instance, it is possible to evaluate the ’uniformity’ of a visual input if the excitation function of each sensor is provided. This way, one can trace back the local state of the environment captured by the th sensor from the sensation , and compare the states to see how much they differ from one another. Another example is the possibility to estimate the presence of an edge if the excitation functions are known, and if the topological organization of the sensors is known. In such a case, one can evaluate if two linearly separable sets of pixels encode two different environmental states . Besides the prior knowledge about the sensory apparatus that these evaluations require, it is important to notice that they are defined by a human who deems them meaningful. The sensory inputs in themselves do not exhibit particular interesting properties for the agent. For instance, given different unspecified excitation functions , a uniform visual input would be encoded as a vector of different sensations which has no more semantic content for the agent than a random visual input would have for a human.
2.2 Active approach
As suggested by the SMCT, a sensorimotor approach is possible to characterize visual inputs. Unlike the typical computer vision approach which relies on a collection of static sensory inputs, it takes into account the spatiotemporality of the data and the link between the motor and sensory streams. Adding this motor component to the problem expends the space (now sensorimotor) in which the data can be analyzed. In particular, it is possible to look at how actions transform sensory inputs. For an autonomous agent that needs to act in the world, a strong argument can be made that regularities in the way actions can transform sensations are more useful to extract than passive regularities in the sensory states only.
We denote the unspecified sensorimotor function which maps the motor commands to the sensory states:
(1) 
It is parametrized by which represents the state of the environment the agent is currently interacting with, that we also refer to as visual input in the context of this paper. This visual input is to be distinguished from the sensory input that generates. The mapping is characteristic of the visual input , as the agent’s sensorimotor experience varies depending on . In particular, some mappings can induce invariants in the sensorimotor experience.
We hypothesize that simple visual features can be characterized through their associated sensorimotor invariants. As mentioned in Section 1, an edge is for example associated with a specific sensorimotor invariance: regardless of the way sensory and motor information are encoded, there is a set of motor commands which leave this sensory input unchanged. Moreover, the orientation of the edge is characterized by the motor commands to which the sensory input is invariant. Similarly, a uniform visual input exhibits this kind of invariance for any motor command.
Formally, the function exhibits such a pointwise invariant if there exists an such that:
(2) 
where is arbitrarily considered as a reference motor state
2.3 Experimental setup
In the following sections, we investigate how a naive agent can extract sensorimotor invariants of the type of Eq. 2 and characterize simple visual features this way. To do so, we propose a simple experiment simulating the exploration of visual inputs by an agent (see Fig. 1).
The environment explored by the agent consists of visual inputs . They can be conceptualized as functions which take as input a position in the plane, and generates an output denoted :
(3) 
We arbitrarily define the functions such that their potential output space is limited to .
Typically, a grayscale image captured by a camera would be a sampling of such a function over a regular grid, with each pixel taking values in .
The size of each visual input is set to units
The agent is equipped with a generic visual sensor made up of elementary cells spread in a disk of diameter units. For each cell, the direction and distance to the center of the disk are randomly drawn from uniform distributions at the beginning of the simulation, yielding a nonhomogeneous topological organization of the overall visual sensor. Each cell captures the local visual input
(4) 
at the position of the cell in the plane. The excitation function of each cell is independently defined as an arbitrary continuous function:
(5) 
with as fixed parameters drawn at the beginning of the simulation. It prevents a direct comparison of the different sensations .
The agent is equipped with motors which respectively control the horizontal and vertical displacements of the sensor in the plane. The sensor can be moved continuously in the environment, which effectively changes the position of each cell to .
The motor exploration of each visual input is limited to a disk of units diameter. For each displacement of the sensor, the direction and amplitude are randomly drawn from uniform distributions. This choice is arbitrary and could be replaced by any other sampling of the motor space.
During the motor exploration, we assume that the agent is recentered on the visual input after each movement. This constraint can seem artificial, but it ensures that only a single environmental state is locally explored and characterized. It is to be seen as a simplification of a more realistic scenario in which the agent iteratively moves around and encounters a different environmental state at each iteration. The constrained motor exploration we propose here is simply the collection of all the sensorimotor experiences that such a free agent would get when encountering a specific visual input .
3 Linear sensorimotor function
In any nontrivial system, the function is very complex. It implicitly embodies all the unknown properties of the visual input and the agent’s sensorimotor apparatus. We can however study it locally to extract its potential invariants.
3.1 Linear approximation
Let’s assume that the function is smooth (in the mathematical analysis sense), and reexpress it locally as a linear function:
(6) 
where is a matrix, and is a bias vector of size . Assuming that the number of sensors is at least equal to the number of motors , and that the motors are independent, the rank of is at most equal to . It can however be smaller if there exists a direction in along which does not induce any sensory change in .
In practice, the agent does not have access to , but only to and . Nonetheless, if we denote the image subspace of , the intrinsic dimension of is equal to the rank of . It is thus possible to randomly generate a matrix of samples in to create a sampling matrix of size containing the resulting sensory variations:
and to perform a singular value decomposition (SVD) of :
(7) 
where is an unitary matrix, is an unitary matrix, and is a diagonal matrix with the singular values of in decreasing order. The number of significant (nonnull) singular values correspond to the intrinsic dimension of , and thus to the rank of . Moreover, the first columns of (rightsingular vectors) associated with those significant singular values correspond to the combinations of motor samples in which induce sensory changes, while other columns of R correspond to the combinations of samples in which leave the sensory state invariant. In other words, based on the sampling , we can estimate the rank of associated with a visual input , as well as the potential motor commands which leave the sensory state invariant. Those properties can be used to intrinsically characterize the visual input .
3.2 Experiment 1
In the first experiment, we simulate linear sensorimotor mappings . To do so, we create visual features such that is linear. As illustrated in Fig 2a, they correspond to gradients with various orientations, slopes, and biases. We also ensure that all excitation functions are linear by drawing and from a uniform distribution , but setting to . Examples of the resulting sensory encoding of visual features are illustrated in Fig. 2b.
The agent explore each visual input with random motor commands. The sensory sampling generated this way is analyzed through a SVD. Potential sensorimotor invariances are estimated by looking at the number of significant singular values :

none means the sensory input is invariant to any motor command,

one means the sensory input is invariant to one direction in the motor space,

two means the sensory input is not invariant to any motor command.
Note that because the motor space is D, no more than two can be significant. For a given nonsignificant , the motor direction associated with the related invariance is:
(8) 
Results of the simulation are presented in Fig. 2c.
For all visual inputs, only one singular value is significant. This result is expected as the visual gradients all exhibit one invariant. Moreover, tends towards for visual inputs which are close to uniform.
We can also see that the direction of the invariance is correctly estimated via the SVD.
Despite having no information about the encoding of its sensorimotor information, the agent is thus able to characterize the ’uniformity’ visual inputs via the value of , and their ’edgeness’ via the value of and its corresponding rightsingular vector .
Note that, due to the linearity of the visual input, there cannot exist two simultaneous dimensions of variations for . As a consequence, the second singular value is always nonsignificant.
The last panel of Fig. 2 displays a larger set of visual patches characterized by the agent. They are organized horizontally accordingly to their estimated uniformity, and vertically according to their estimated direction of motor invariance. We can see that the agent can build this way a topological representation of the visual inputs which reflect their invariances.
4 Extension to nonlinear functions
Although mathematically convenient to manipulate, the linear approximation of by rarely stands in realistic scenarios. For real visual interactions, the sensorimotor mapping can be strongly nonlinear. The linear method proposed to characterize sensory inputs can however be extended to a nonlinear setting by taking inspiration from differential geometry.
4.1 Analyzing the sensory manifold
The subspace spanned by a smooth nonlinear function is a manifold whose intrinsic dimension can be thought of as a nonlinear extension of the rank of . Instead of estimating the rank of the sample matrix produced by a nonlinear function, one can thus estimate its intrinsic dimension. Numerous methods could be considered to perform such an estimation. We propose in this work to use the Curvilinear Component Analysis (CCA) [?] to project the data in spaces of lower dimensions , and to monitor the projection error to determine the smallest dimension for which the error is nonsignificant. Let denote the projection of in dimension , and be the projection error in dimension . The intrinsic dimension of is estimated as:
(9) 
Note that it is impossible to estimate with this method.
To determine the motor commands to which the nonlinear function might be invariant, one has to estimate its zero set. This is a more complicated problem as there is no common way to determine the zero set of unspecified nonlinear functions. In this work we propose to use , the optimal lowdimensional projection of , as a linear approximation of the unfolded manifold. Like in the linear case, we can then perform a SVD of :
(10) 
and determine the motor combinations that do not generate sensory changes by looking at the motor combinations defined by the rightsingular vectors in associated with nonsignificant singular values.
4.2 Experiment 2
In the second experiment, we simulate nonlinear sensorimotor mappings . To do so, we create nonlinear visual feature by generating simple gradients and passing them through functions with random slopes and biases. As illustrated in Fig. 3a, they correspond to sharper edges for which the visual input is not linear with regards to the position . Moreover, non linear excitation functions are generated by independently drawing their parameters from a normal distribution . Examples of the resulting sensory encoding of visual features are illustrated in Fig. 3b.
As in the previous experiment, the agent explore each visual input with motor commands. The resulting sampling is then projected in low dimension via a CCA with . Potential sensorimotor invariances are identified by looking at the estimated intrinsic dimension of the sensory manifold, and the rightsingular vector associated with the first nonsignificant dimension .
Results of the simulation are presented in Fig. 3c. For all visual inputs in the simulation, the intrinsic dimension is estimated equal to by the agent, whereas exhibits a greater number of significant . The nonlinear analysis of the manifold’s dimensionality is thus conclusive as all visual features exhibit at least one invariant. Note that the intrinsic dimension of might be equal to for some uniform visual inputs, but our dimension estimation method is unable to detect it (see Eq. (9)). Uniformity can however be estimated by looking at , as displayed in Fig. 3d. Moreover, we can see that the direction of the invariance in the motor space is correctly estimated via the SVD of . The naive agent is thus able to characterize the uniformity of the visual features, as well as their edgeness and orientation, even when the sensorimotor mapping is nonlinear. Figure 3d displays a larger set of visual patches organized according to their uniformity and orientation. We can see that the agent can build a topological representation of the visual inputs characterizing their invariances.
5 Discussion
We presented in this work a mathematical formalization and a preliminary experimental evaluation of the sensorimotor characterization of simple visual features. In line with the SMCT, we propose that visual inputs can be characterized, without any a priori knowledge, by looking at the properties of the sensorimotor regularities they induce. With the formalization and simple simulation proposed in this paper, we have shown how a naive agent can characterize their uniformity and edgeness by locally exploring visual inputs and detect their potential sensorimotor invariants. Based on those invariants, the agent can internally build its own lowdimensional topological representation of the visual inputs it encounters in the environment. In contrast with typical passive analysis of sensory inputs, this representation intrinsically informs the agent on its ability to transform (or not) the related sensory input; a knowledge that would be directly useful for planning future actions. Such a sensorimotor characterization of visual inputs also seems to lead to basic abstraction. For instance, the uniformity of visual inputs can be characterized independently from their intensity (light or dark). Similarly, edges between areas of different intensity can be clustered in more abstract groups based on their orientation.
Despite these encouraging preliminary results, many challenges need to be overcome before a complete formalization of the grounding of visual experience is proposed. Firstly, the system simulated in this work is simplistic. It only represents a local interaction between a simple visual feature and what would correspond to a small receptive field in our large field of view. Understanding how this sensorimotor approach can be scaled up to multiple receptive fields observing a visual scene in parallel is a natural question to investigate in the future. Some preliminary work has already been proposed in this direction [8]. This will also naturally raise the question of motor actions of greater amplitude, and how information can circulate between receptive fields to deal with greater displacements. Secondly, sensory ambiguity has not been addressed in this work. Indeed, it is possible that multiple environmental states generate the same sensory experience for a given motor state. This means that the agent’s sensory experience is potentially ambiguous, and that a probabilistic extension of the current formalism is necessary. Given a sensory input in its receptive field, the agent would thus estimate a distribution over the probable associated environmental states , that it could disambiguate by performing a motor action or collecting information from surrounding receptive fields. Thirdly, we assumed the existence of perfect pointwise invariants in our formalism. However, real sensorimotor interactions can be noisy and break this assumption. This problem could be tackled by instead looking for setwise invariants, for which the set corresponds to a small neighborhood around the sensory state . This way, invariants could be identified by looking for motor actions which map a set of noisy data to itself. Finally, a long term goal is to investigate how the unsupervised capturing of sensorimotor invariants can be coupled with a planning or reinforcement learning module in order to perform guided interactions with the world. In particular, one would need to formalize how the simple abstraction induced by our approach is beneficial for guiding an agent towards its goal.
Footnotes
 Given the type of visual interaction we consider, we can assume that any visual input can be experienced when .
 Any unit of length could be considered, as all distances in the simulation are relative.
References
 K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
 I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” arXiv preprint arXiv:1412.6572, 2014.
 T. Lesort, N. DíazRodríguez, J.F. Goudou, and D. Filliat, “State representation learning for control: An overview,” arXiv preprint arXiv:1802.04181, 2018.
 J. K. O’Regan and A. Noë, “A sensorimotor account of vision and visual consciousness,” Behavioral and brain sciences, vol. 24, no. 5, pp. 939–973, 2001.
 A. V. Terekhov and J. K. O’Regan, “Space as an invention of biological organisms,” arXiv preprint arXiv:1308.2124, 2013.
 C. Witzel, F. Cinotti, and J. K. O’regan, “What determines the relationship between color naming, unique hues, and sensory singularities: Illuminations, surfaces, or photoreceptors?” Journal of vision, vol. 15, no. 8, pp. 19–19, 2015.
 A. Maye and A. K. Engel, “A discrete computational model of sensorimotor contingencies for object perception and control of behavior,” in Robotics and Automation (ICRA), 2011 IEEE International Conference on. IEEE, 2011, pp. 3810–3815.
 A. Laflaquiere, “Grounding the experience of a visual field through sensorimotor contingencies,” Neurocomputing, vol. 268, pp. 142–152, 2017.
 V. Marcel, S. Argentieri, and B. Gas, “Building a sensorimotor representation of a naive agents tactile space,” IEEE Transactions on Cognitive and Developmental Systems, vol. 9, no. 2, pp. 141–152, 2017.
 M. Bernard, P. Pirim, A. de Cheveigné, and B. Gas, “Sensorimotor learning of sound localization from an auditory evoked behavior,” in Robotics and Automation (ICRA), 2012 IEEE International Conference on. Ieee, 2012, pp. 91–96.
 N. Hay, M. Stark, A. Schlegel, C. Wendelken, D. Park, E. Purdy, T. Silver, D. S. Phoenix, and D. George, “Behavior is everything–towards representing concepts with sensorimotor contingencies,” 2018.
 Y. Choe, “Motor systems role in grounding, development, and recognition in vision,” 2010.
 D. Erhan, Y. Bengio, A. Courville, P.A. Manzagol, P. Vincent, and S. Bengio, “Why does unsupervised pretraining help deep learning?” Journal of Machine Learning Research, vol. 11, no. Feb, pp. 625–660, 2010.
 Q. V. Le, “Building highlevel features using large scale unsupervised learning,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 8595–8598.