Identification of Invariant Sensorimotor Structures as a Prerequisite for the Discovery of Objects
Perceiving the surrounding environment in terms of objects is useful for any general purpose intelligent agent. In this paper, we investigate a fundamental mechanism making object perception possible, namely the identification of spatio-temporally invariant structures in the sensorimotor experience of an agent. We take inspiration from the Sensorimotor Contingencies Theory to define a computational model of this mechanism through a sensorimotor, unsupervised and predictive approach. Our model is based on processing the unsupervised interaction of an artificial agent with its environment. We show how spatio-temporally invariant structures in the environment induce regularities in the sensorimotor experience of an agent, and how this agent, while building a predictive model of its sensorimotor experience, can capture them as densely connected subgraphs in a graph of sensory states connected by motor commands. Our approach is focused on elementary mechanisms, and is illustrated with a set of simple experiments in which an agent interacts with an environment. We show how the agent can build an internal model of moving but spatio-temporally invariant structures by performing a Spectral Clustering of the graph modeling its overall sensorimotor experiences. We systematically examine properties of the model, shedding light more globally on the specificities of the paradigm with respect to methods based on the supervised processing of collections of static images.
Humans flexibly interpret their rich sensorimotor experience of the world in terms of objects in the environment. In that respect, we assume that this ability to discover, identify, and manipulate objects is required for any general purpose intelligent robot. Despite great progress in object detection  or classification  in the last few years, the computer vision community still lacks a clear formalization of the problem of autonomous object identification by an artificial agent. Understanding the fundamental nature of objects and their perception is a core philosophical question that we do not pretend to fully address in this work. Rather, we focus on a specific property that we assume plays an important role in the above question: the spatio-temporal invariance of objects. More precisely, we propose to investigate a mechanism assumed to be fundamental for autonomous object perception, namely the unsupervised identification of invariant spatio-temporal structures in the sensorimotor flow of an agent.
Perception, and in particular artificial perception, is traditionally considered as a passive process in which the sensory state obtained through sensors is projected onto higher-level representations, which in turn inform higher-level cognitive processes which generate actions. This perspective has however been challenged by multiple philosophers and neuroscientists who claim that perceptive experience emerges from internal predictive modeling of the sensorimotor interaction with the environment [3, 4, 5, 6]
Our work fits in with such a predictive and sensorimotor description of perception. It is based on two prominent theories, namely the Sensorimotor Contingencies Theory (SMCT)  and Predictive Coding [8, 9]. The former claims that perception is based not only on sensory information but also on the knowledge of regularities in the way an agent’s actions can transform its sensory inputs. The latter suggests that the brain hierarchically builds a predictive model of the causes of its sensory experience. The two viewpoints align nicely when considering that regularities in the sensorimotor flow can be used as support for a predictive model .
In this framework, we focus on an elementary property of objects and we study how this property can be exploited to contribute to their discovery by extracting regularities in the sensorimotor experience of an artificial agent. Namely, we assume that objects have an intrinsic structure which is spatio-temporally invariant, and limited in space. In that respect, we assume on the one hand that the intrinsic properties of objects, such as shape, size, or appearance, are preserved across time and space. On the other hand, being limited in space simply means that the objects are smaller than the world explored by the agent.
This spatio-temporal stability of objects implies structure in the sensorimotor experience an agent has when interacting with them. This way, observing one part of a known object, the agent can predict what would be observed on other parts of this object. For example, seeing one side of a tomato, it can predict what the other side of the tomato would look like, as put forward through the concept of perceptual presence in . According to the SMCT, this property is constitutive of the experience of objects .
In this paper, we propose a minimalistic simulation in which an agent visually explores in a random way an environment containing spatio-temporally invariant structures. We assume having a spatio-temporally invariant structure is one generic property of objects, but it may not be the only one. Hence we refer to identifying these spatio-temporally invariant structures as identifying proto-objects in the rest of the paper. Admittedly, as long as the decisions of our agent are random and its actions only consist of visual exploration, we may consider our work from the perspective of pattern identification in signal processing (see e.g. ). However, we present this work from an agent-based perspective for three reasons. First, in our framework, an agent is generically âthat which actsâ: it is sufficient that it produces actions to be considered as an agent. Second, in Section 3.3.6, we investigate a case where the agent actively rotates objects in its environment. Third, the case of an agent deciding which future action is optimal according to a goal is an important step in our future work agenda.
In our simulations, the world explored by the agent can change in two ways. First, the proto-objects, while keeping their internal structure, can move randomly in the world, or even be introduced/removed. Second, the rest of the environment can itself change randomly. Importantly, despite these changes, the world is statistically invariant enough so that the agent is able to partially explore it between two successive changes. This setup, illustrated in Figure 1, can intuitively be interpreted as having proto-objects that can move in the environment, and can be encountered in different contexts. Our model is minimalistic in the sense that we assume no prior knowledge on the world or on the agent itself, neither on its spatial structure, on the environment structure, nor on the proto-objects. The naive agent follows a random exploration policy, and interacts in a generic way with an external environment through an interface of uninterpreted sensorimotor information [13, 14]. In line with Predictive Coding, we propose a method for the agent to build a sensorimotor predictive model of its exploratory experience, and to identify the sensorimotor regularities induced by the proto-objects. More precisely, we model the sensorimotor experience as a weighted multigraph in which the nodes correspond to sensory states, and each pair of states is linked by several edges representing different motor commands. The weight of each edge corresponds to the conditional probability of the corresponding sensorimotor transition. Regularities in the sensorimotor interaction with the environment should then appear as stronger connections between some pairs of nodes. In particular, we hypothesize that the presence of proto-objects should induce the presence of some densely intra-connected subgraphs that the agent can identify as its own experience of these proto-objects. The representation of these regularities can then be used by the agent for counterfactual prediction, which makes the identification of proto-object a worthy objective.
The paper is organized as follows. In Section 2, we describe a simple simulation to illustrate the approach, as well as a computational method to identify the sensorimotor regularities induced by proto-objects. In Section 3, the results produced by the method applied to the simulated system are thoroughly presented. Additional experiments are also designed to highlight the properties and limitations of the approach. Finally, in Section 4, we discuss the benefit of our paradigm with regards to the perception of objects. We also consider the future steps that would extend the current illustrative simulation towards more complex and realistic setups. This work is a direct extension to the preliminary results presented in [15, 16].
In this section, we introduce a simplistic simulation in which an agent explores an environment containing proto-objects. We then propose a method to process its sensorimotor experience and identify the regularities induced by these structures.
The simulation we propose consists in an agent exploring an environment containing proto-objects . The environment is a two-dimensional square gridworld of fixed size discrete elements, or ”pixels”. Each pixel can takevalues in , and is initialized randomly at the beginning of the simulation.
At each time step, the environment can change with a probability , in which case the values of all its pixels are randomly redrawn. At the beginning of the simulation, proto-objects are created in the environment. They correspond to sets of contiguous pixels drawn from the same distribution as pixels of the environment, but which keep the same internal structure during the whole simulation. They are of minimum size and maximum size , and do not necessarily have a square shape, as illustrated in Figure 1.
During the simulation, the proto-objects are moved in the environment with a probability at each time step. Furthermore, they can be independently removed from the environment with a probability .
If present, the proto-objects pixels occlude those of the environment, and also potentially occlude each other, as illustrated in Figure 1.
Note that an agent cannot distinguish proto-objects from the environment simply based on a single sensory input, since the pixels that constitute them are drawn from the same distribution. They only differ in the spatio-temporal consistency that proto-objects maintain in contrast with the environment during the simulation.
The agent observes this two-dimensional world with a limited sensor, which is a patch window, through which it receives sensory inputs. It can move its sensor anywhere in the environment, using motor commands. At each time step, the sensorimotor input of the agent contains a sensory input (which is a 9-dimensional vector of pixels) and a motor command (which is a 2-dimensional vector, representing the horizontal and vertical components of the sensor displacement in the visual scene). Together with the sensory input experienced after performing , these sensorimotor experiences form this sensorimotor experience forms a sensorimotor transition triplets triplet .
At each time step of the simulation, the agent moves its sensor by randomly picking a new position in the environment (possibly the same as the current one), and stores the experienced sensorimotor triplet. Since the environment and the proto-objects change with a lower probability than the agent, the latter can sensor position, the agent can statistically explore their content over several time steps and extract the regularities they induce.
2.2 Processing method
We now describe the way the agent processes its sensorimotor experience in order to identify proto-objects in the environment. First, the data are compacted by a clustering step. Then, the sensorimotor transitions are stored in a three-dimensional tensor, representing a statistical model of the agent’s sensorimotor experience. This tensor is analyzed to extract densely connected subgraphs.
2.2.1 Storing of the sensorimotor experience
The agent interacts with the environment during steps and its sensorimotor experience is stored and processed off-line. We store the empirical conditional probabilities of each sensorimotor triplet experienced by the agent in a three-dimensional tensor . In , and correspond to the row and the column respectively, while is a one-dimensional encoding of the movement performed at time and corresponds to the depth in the tensor. However, in order to limit the size of and the computational cost of the simulation, the representation of the sensory experience is compacted beforehand by clustering together similar sensations. We use a simple k-means algorithm to perform this clustering, where the number of clusters is arbitrarily set to . These clusters group together the sensory inputs considered by the agent to build its predictive model, as illustrated in Figure 2. In the following, the resulting centroids produced by the k-means clustering algorithm are called ”states”. The number of possible movements the agent can perform in this environment is . Thus, the size of the tensor is .
2.2.2 Densely connected subgraph identification
The tensor can be seen as an approximation of the weighted graph mentioned in Section 1, in which the weights of the multiple edges between two nodes are the conditional probabilities of the corresponding transition, labeled by the action. We want to identify densely connected subgraphs of sensory states in this graph.
To do so, we propose to use Spectral Clustering [17, 18], which requires the definition of a similarity between each pair of nodes in the graph.
Intuitively, two nodes will be considered similar if a transition between them is experienced with a high enough probability. In order to define the similarity, we first filter out some transitions lacking statistical relevance, by discarding rows such that the movement has been performed less than times while experiencing state . Then, for other triplets, let be the subset of sensorimotor transitions: , where is a threshold set to . We define the sensorimotor similarity between each pair of states as:
Applying this method to all pairs of states, we derive the 2D sensorimotor similarity matrix . Finally, since similarities are usually defined for undirected graphs, we make symmetric by averaging it with its transpose. This procedure is formally summarized in Algorithm 1. We then apply Spectral Clustering to the graph defined by the similarity . Spectral Clustering is a graph clustering method that is often used when the relation between the nodes of the graph is quantified by a general measure of similarity, that is not necessary a distance. To define the clusters, the eigenvectors of the Laplacian of the graph are computed. A change of representation is then performed by building new vectors with the components of the main eigenvectors of the Laplacian. A regular clustering is then performed in the space corresponding to these new vectors, yielding the final clusters. More details can be found in [17, 18].
2.2.3 Extracting the number of clusters
Spectral Clustering requires the specification of the returned number of clusters. Since we wish to introduce as little supervision as possible in our algorithm, we propose to automatically determine it. There is no universal criterion to automatically determine the relevant number of clusters in a general situation, and most criteria are heuristics . We propose to use the cut gap criterion . The cut gap is identified by finding a knee in the curve of the normalized cut as a function of the number of clusters. Consider a graph clustered in clusters, forming a clustering denoted . Given and a graph similarity between each pair of nodes and , the normalized cut is a measure of the quality of . The lower , the better the clustering: if is very low, it means that the clusters are very weakly connected between each other. It is defined as:
where the numerator is the cut between clusters and , which is a measure of the strength of the connection between and the rest of the graph. The denominator is the degree of , which represents the ”weight” of the cluster in the graph. Having low terms encourages clusters to be weakly interconnected, while having high terms favors large clusters, which prevents from yielding trivial isolated outliers as clusters. Thus, the normalized cut leads to a compromise between these two tendencies. In order to find the optimal number of clusters , we automatically detect the largest which leads to a low . To do so, we also compute the second order finite difference of as a function of ,
and we take the value that yields the maximum result, that is: Thus, the minimal value that can be returned by this criterion is 2.
2.2.4 Visualizing predictions from the tensor
We can also use as a predictive model of the agent’s sensorimotor experience. When it receives a certain sensory input, it can use the tensor to try to predict the next sensory input for each possible motor command. More formally, say that the agent experiences state at time . For each motor command , the agent has learned a conditional probability distribution on the next state, , and it can use this distribution to make predictions.
However, the visualization of the predictive model is non-trivial, since the predictions of two distinct motor commands may overlap each other, given that the receptive field of the agent is made of several pixels. In order to illustrate some predictions below, we use a mixture of the distributions, in the following way. Let us consider a pixel position out of the scope of the agent’s sensor. Given that the size of this sensor is pixels, motor commands predict the future value of . We manually average the predictions of these motor commands by computing the weighted average of the states predicted with most certainty by each of these movements. This process requires some knowledge on the sensorimotor structure of the agent, but it is used for illustration purposes only, and not by the agent itself.
We now present and analyze the experimental results of the simulations. We then explore alternative setups where the performance of the algorithm is more variable, in order to illustrate the robustness of the approach, but also its limitations.
3.1 Subgraphs extracted from the predictive model
As a reminder, in the simulation, two proto-objects are placed in the environment, with probabilities , , and of movement of the proto-objects, absence of the objects proto-objects, and of change of the environment, respectively. Figure 3a) presents the normalized cut as a function of , as well as the second order finite difference . The curve of presents a knee at , and the maximal value of the finite difference is attained for . Thus, subgraphs have been identified through Spectral Clustering. This is a good result, since we expect 3 subgraphs to emerge: two subgraphs corresponding to both proto-objects and a third subgraph for corresponding to the environment. Figure 3 b) shows the similarity matrix, whose lines and columns have been reordered to group together sensory states belonging to the same cluster. The color of each entry in the matrix corresponds to the similarity between both states. Thus, we can visually see that two subgraphs are strongly connected and a third subgraph is weakly connected.
One can also remark that the probabilities on the diagonal of the matrix are higher than elsewhere. This reflects the stability of the world: there is always a non-zero probability that the agent will encounter two identical sensory states consecutively. Let be the probability that the agent receives the same sensory input after a time step, given that the agent did not move. If neither of the proto-objects move and the environment did not change, the agent will receive the same input. These conditions being independentbut not simultaneously necessary, we can write that . This explains the high values found on the diagonal of the similarity matrix.
3.2 Sensorimotor prediction
As explained in 2.2.4, the three-dimensional tensor built by the agent can be used as a predictive model of its sensorimotor experience. We illustrate it in Figure 3 c), for three input states. To clarify visualization, the size of each pixel depends on the probability of each prediction : the largest predicted pixels in the figure are the ones predicted with most certainty. In order to compare the prediction with a ground truth, we also show the ground-truth proto-objects introduced in the environment. If the current state was categorized in one of the densely connected clusters, the model successfully reconstructs the total structure of the corresponding proto-object from the small patch it receives: this is the case for instance for states and . On the contrary, for a state categorized in the third, weakly connected cluster, the model predicts no future sensory state with certainty: this happens for instance for state . For such a naÃ¯ve agent, the objects are thus structures that allow sensorimotor prediction.
3.3 Additional experiments
We propose additional experiments to illustrate the properties and limits of the simulation, the overall approach, and the computational method letting the agent discover proto-objects from its sensorimotor flow.
3.3.1 Importance of the motor flow
In order to illustrate the importance of taking the motor commands into account for discovering proto-objects, we propose a similar processing of the experience of the agent, where motor commands are not recorded by the agent. Instead of the sensorimotor similarity , we derive through Algorithm 1 a sensory similarity:
where is the probability of transitioning from state to , regardless of the motor command. Spectral Clustering of this similarity matrix leads to the results shown in Figure 4. The agent is no longer able to detect the correct number of clusters. No clear knee in the cut curve is detected and the cut gap criterion returns two clusters, that is the default outcome of our method when it does not find a cut. In Figure 4 b), we see that this ”sensory” similarity matrix, even reorganized, does not display any densely connected subgraph. In our setting, the agent is thus unable to extract the structure of proto-objects without using its motor commands.
3.3.2 Influence of the number of proto-objects
We now propose to study the influence of the meta-parameters of the simulation on the results. We first investigate the impact of the number of proto-objects introduced in the environment on the identification of the densely connected subgraphs. The results are shown in Figure 5 a). For values up to 4, the number of proto-objects is correctly estimated and the clusters are well defined and densely connected. As the number of proto-objects increases, it becomes harder to detect the correct number of proto-objects . If the number is very large, the sensorimotor experience of the agent contains too much randomness and is poorly predictable, since the proto-objects constantly occlude each other in a random order. As a consequence, the probability of consistently experiencing sensorimotor regularities associated with a given proto-objects becomes very low. Here, we see that for , the Spectral Clustering algorithm does not yield well defined clusters. Note that if the environment was bigger, this overlapping problem would arise for a greater number of proto-objects. It must also be noted that our simulation is not sophisticated enough to properly deal with object occlusions in a consistent way, as a 3D simulation taking the perspective of the agent into account would do. Better dealing with these occlusion issues is left for future work as it requires tackling more difficult questions about memory and the perception of space.
3.3.3 Influence of the probabilities , , and
We investigate the impact of the probability of displacement of proto-objects , , on the result of the clustering. We run the simulation for several values of between and , and we show the results in Figure 5 b). The difficulty of proto-objects discovery increases with their probability of movement. This result is expected because the discovery of proto-objects depends on the probabilities of sensorimotor regularities implied by their structure. These regularities vanish when the expected structure cannot be statistically differentiated from randomness, which happens when the proto-objects never keep the same position between time steps. Intuitively, this means that if the world around us were to change constantly, we would not be able to discover objects.
We also investigate the impact of the probability of updating the environment on the result of Spectral Clustering. The simulation is run with ranging from to , with results presented in Figure 5 c). When is high, for instance when , the diagonal of the matrix does not contain high probabilities anymore, since the environment changes too frequently. Although optimal for our simulation, this setup is not realistic considering our own sensorimotor experience, where an environment with no spatio-temporal structure at all is rarely encountered. Another special case arises when , which means that the environment never changes. Then, the sensorimotor experience while interacting with the environment is completely predictable and the environment should be identified as a third proto-object, as illustrated in the first column of Figure 5 d). Figure 6 shows sensorimotor predictions for . Since the environment never changes, this specific setup highlights sensory ambiguity as one potential limitation of the simulation. Indeed, it is possible for a sensory state to appear in multiple proto-objects, or multiple times in a single object, making it ambiguous. The probability of such a situation is low in the standard setup of the simulation due to the limited size of the proto-objects. However, when , the whole environment appears as a big proto-object, which significantly increases the probability of encountering ambiguous sensory states. Spectral Clustering is robust to this kind of ambiguity, as it assigns the sensory state to one cluster only, but we can see in the third panel of Figure 6 that ambiguity can interfere with sensorimotor prediction. Indeed the reference sensory input seems to appear twice in the constant environment.
As a consequence, the sensory prediction of the agent is a mixture of two contributions that overlap. The pixels which correspond to an ambiguous prediction are highlighted in pink. To disambiguate such a situation, the agent would need to have a memory, or a way to hierarchically extract contexts from its sensorimotor experience, as proposed in .
Finally, we analyze the effect of varying the probability of the proto-objects being absent in the environment. To do so, we run the simulation with changing values of and show the results in Figure 5 d). Intuitively, the identification of densely connected subgraphs is easier when the proto-objects are present at each time step. On the contrary, it becomes harder when is high, since the sensorimotor regularities associated with proto-objects are encountered with less consistency.
3.3.4 Rigidly linked proto-objects
Here we illustrate a property of our definition of proto-objects as spatio-temporally invariant structures. We run a simulation where two proto-objects are rigidly linked: they move together and thus keep their relative spatial position constant during exploration. In Figure 7, we see that the agent extracts only one densely connected subgraph. This experience of the agent with two linked proto-objects is thus interpreted as an interaction involving a single proto-object, as the agent extracts a single densely connected graph of sensorimotor transitions. Indeed, the agent looks for sensorimotor regularities without having a notion of spatial contiguity. Thus it does not distinguish the two components of the linked proto-objects . Intuitively this suggests that if we were to live in a world where objects are made of several rigidly linked but disconnected parts, we might interpret them as single entities.
3.3.5 Identical proto-objects
We investigate the special case where both proto-objects in the environment are identical instances of the same proto-object . We run the standard simulation where the second proto-objects is a copy of the first one, and show the results in Figure 8. The agent extracts a single densely connected subgraph. This is expected as the agent cannot separate the sensory inputs coming from one instance of the proto-objects from the inputs coming from the other instance. The method can only distinguish types of proto-objects , but not identical instances. A possible solution to separate inputs coming from different instances would be to have a memory and a notion of position in the environment, which the agent does not currently have.
3.3.6 Agent rotating the proto-objects
An important aspect of our approach is that the extraction of proto-objects from the environment should not depend on their visual appearance, which means that it does not depend on their pattern of pixels. Additionally, the actions performed by the agent could be of any nature, meaning they are not limited to sensor movements. In order to illustrate these properties, we run a simulation where the agent can move its sensor and also rotate the proto-objects . This action has no effect on the pixels of the environment, but has the consequence of rotating both proto-objects by 90 degrees . Thus, such a rotation changes the appearance of the proto-objects and the set of sensory inputs that the agent can receive by interacting with the proto-objects is larger than when it cannot rotate them. Results of this simulation are presented in Figure 9. After exploration and processing of the sensorimotor data, two densely connected subgraphs are still correctly extracted from the experience of the agent. However, it appears that the clusters are slightly less densely connected than in previous simulations. This might come from the k-means clustering step, since the sensory inputs are distributed differently in the input space, and from the larger number of possible movements.
This shows that if the agent performs non-spatial actions, it can still extract structure induced in its sensorimotor flow by the presence of invariant proto-objects . More generally, any type of action could be performed to learn any structure in the interaction with the world, as long as its effect on the sensory flow of the agent generates some statistical regularities, such as changing the light projected to the global scene, resulting in different pixel values.
3.3.7 Small proto-object
We propose a last simulation in which the proto-objects are smaller than the receptive field of the agent. Results are shown in Figure 10. No densely connected subgraph is detected by the agent, and it is not able to predict pixels outside the scope of its own receptive field. Since the proto-objects are smaller than the receptive field, the states obtained after the k-means clustering cannot represent the proto-objects accurately, because they also represent pixels that come from the randomly changing environment. Thus, it is likely that these states mix together sensory inputs coming from proto-objects with sensory inputs coming from the environment. Hence, the sensorimotor structure induced by the presence of proto-objects in the world is blurred. Thus, proto-objects smaller than the receptive field cannot be discovered by the agent. A possible way to overcome this limitation could be to consider a set of smaller receptive fields and to process them collectively. This is left for future work.
In this work, we addressed object discovery from a sensorimotor perspective. Taking inspiration from SMCT and predictive coding, we defined proto-objects as spatio-temporally invariant structures, that an autonomous agent can detect through regularities in its sensorimotor experience when interacting with its environment. More precisely, the agent discovers such proto-objects by collecting sensorimotor transitions and clustering together sensory states according to a sensorimotor similarity, which we derived from a statistical analysis of those transitions. We illustrated the method by applying it to simplistic simulations and outlined some limitations. We now discuss the specificities of our approach with respect to the standard computer vision paradigm and other related work, we highlight some key properties of the model and we point to future work given the limitations we highlighted.
4.1 Specificities of the paradigm
In the standard computer vision paradigm, the problem of object identification is generally tackled in a supervised way by training a representation learning algorithm, for instance Deep Convolutional Neural Networks .These algorithms are trained on a large database of static images containing objects, where the identity of the object is provided as a label (see for instance the well-known ImageNet database ). Labelling such databases requires a large human effort which can be mitigated by using semi-supervised or transfer learning approaches, without fundamentally changing the underlying object perception paradigm. In this paradigm, identifying an object consists in extracting from a collection of static images of the same object some invariant set of visual features which are sufficient for discriminating this object from any other. From an engineering point of view, this paradigm is quite efficient as it provides a working solution for many concrete applications. From a more fundamental standpoint, it captures some important aspects of perception in terms of invariant visual features which are not captured in our work. But this paradigm goes with some issues, as revealed for instance by failure on adversarial examples . Another well-known issue is that relying on external labels makes the agent limited to the recognition of objects present in the database. In that respect, using unsupervised learning methods is mandatory if one wishes to design a truly autonomous learning agent. Our work reveals a third issue. Indeed, any approach processing static images individually cannot extract any object from our simulations since the distribution of pixel values is the same in proto-objects and in the environment. Thus in our work, we are not interested in the visual features characterizing the appearance of an object, but rather in its spatio-temporal consistency.
Thus our approach is focused on a property of objects that is orthogonal to the one captured by the standard computer vision paradigm. Instead of focusing on the extraction discriminative spatial features in static images, we focus on extracting spatio-temporally invariant patterns in the sensorimotor flow of the agent. Our approach has several assets. First, it is unsupervised, as opposed to most approaches to the problem of objects detection and classification outlined above. The agent relies neither on externally provided labels nor on rewards, and does not solve a specific task. It discovers the presence of proto-objects, fundamentally driven by the prediction of its sensorimotor experience, and without knowing the structure of these proto-objects in advance. The agent has prior knowledge neither on its sensory structure, nor on the environment, and not even on the structure of the proto-objects: their number, sizes, shapes, appearances, and positions are unknown.
Importantly, the interaction of the agent with the environment does not have to be spatial: the actions performed do not have to be spatial displacements, such as the translations of the sensor as used in the simulations, and the agent does not need to know its spatial position. More generally, the actions performed by the agent and their effects in the environment can be of any nature, as long as they remain consistent in time. As an example, we have shown in Figure 9 that the agent can extract proto-objects performing actions that modify their appearance by rotating them. The determination of the class of actions that are necessary and sufficient to build an artificial object perception system following our approach is an important and open question, left for future work.
4.2 Related work
There are other approaches to the problem of artificial perception that exploit either unsupervised learning, the temporal information in the sensory flow of an agent, or the interaction between an agent and its environment.
Unsupervised learning algorithms typically capture statistical structure in the data in order to compress them, hopefully creating more abstract representations . Despite some interesting attempts around generative models , it is still unclear how such statistical method applied to static images could lead to the development of a complete autonomous perceptual system. Most of the time, although pretraining a neural network in an unsupervised way can be used to bootstrap a supervised learning system , the representations built this way are interpreted a posteriori by a human.
There are some implementations of unsupervised learning which exploit the temporal link between two successive images. As an example, in , tracked patches in a video stream are constrained to have similar internal representations. In , the built representations are used to predict future states, whereas in  they are used to define a probability over the trajectories of pixels in an image.
Other approaches to the problem of artificial perception claim that in order to build a truly perceiving agent, it is essential to take its actions into account. Instead of exploiting a mere sensory flow, the actions performed by the agent are processed in parallel. These approaches have been gathered under the term Interactive Perception . Namely, the actions are used to learn representations consistent with ego-motion in , or to predict ego-motion from two successive images in . In  representations that allow the prediction of the next image conditioned on the agent’s action are learned, while the effect of a physical action on an object is learned in . Both motor and sensory information have also been considered to build state representations consistent with robotic priors in .
Our approach is in line with these three paradigms: we process the temporal information between sensory inputs and the interaction of an agent with its environment, through unsupervised learning and a drive for prediction. Compared to the works previously cited, the specificity of ours is that we focus on the identification of spatio-temporally invariant structures from the sensorimotor flow of the agent.
Learning sensorimotor transition triplets share some similarity with learning triplets in the affordance learning literature [34, 35], but these triplets are learned based on lower level modules extracting independent visual features for objects, effects, and eventually actions. In that respect, our positioning is more radical than most works in this literature, since we do not call upon such low-level feature extraction.
Other attempts have also been made to propose a computational model allowing for a sensorimotor grounding of knowledge for an artificial agent. One of the closest works with respect to ours in terms of investigating the nature of perception is . In this work, the authors try to demonstrate how a naive agent may extract useful concepts from its sensorimotor experience. However, their concept learning framework assumes that there exists a separate reward function for each concept, an assumption that we consider too strong. In former works investigating the sensorimotor grounding of knowledge, such as  and , an external reinforcement signal was also used. In , a large amount of semantics is associated a priori with the actions performed by the agent and with its sensory stimulation, putting this work at a different level of abstraction. Finally, in  an agent uses a model of its sensorimotor interaction with the world in order to optimise at the same time its own structure (the parameters of the body of the agent) and the model itself . However, this work does not propose a mechanism to process the sensorimotor flow of the agent in order to build more abstract knowledge, like our agent does when learning to identify proto-objects as subgraphs in its general sensorimotor experience. In , an agent learns to predict the effect of its actions on its sensorimotor flow, depending on previous actions and states, learning a model which is very similar to ours. However, while the agent can learn by random exploration, the experimental setup contains no randomness, and the possible actions performed by the agent and its sensory inputs are defined at a more abstract level than ours. Importantly, clustering together sensory states to identify proto-objects is absent from these works, and robustness to randomness in the environment is not studied.
4.3 Limitations and future work
Despite its versatility, the approach we presented in this paper also suffers from multiple limitations. As revealed in the experiments, our algorithm is not able to handle the case where proto-objects are smaller than the receptive field. In the real world, however, proto-objects appear smaller to us than our field of view. As a consequence, instead of a single receptive field, several elementary receptive fields could be used in combination to define a visual field, as is the case in our own visual system. This should also open possibilities to tackle the problem of distinguishing multiple instances of the same proto-objects, and to reduce the ambiguity of a visual scene. Some preliminary results in this direction have already been published . Instead of considering a small sensor moving in the environment, one could also imagine having a larger sensor with an attention mechanism focusing on a small part of it.
Besides, the implementation presented here was intended to illustrate fundamental mechanisms making it possible to extract proto-objects, but would not scale to a more realistic setting. In a real-life context, the quantification of the sensorimotor experience of the agent would need a way larger amount of memory and computation time, making the method intractable. A more relevant way to process the sensorimotor experience might require an algorithm able to directly process the sensorimotor data without a preliminary k-means clustering stage. It should be rather clear from Section 4.1 that combining some properties from the standard computer vision paradigm with ours is the way to go in order to address the discovery of objects in real world environments. As an immediate example, using a neural network taking the sensory states and motor commands as input, and predicting the next sensory input could be a promising alternative to the initial k-means clustering stage. However, given their very different nature and underlying assumptions, combining both paradigms into a more general framework is a difficult problem which will require careful examination in the future. Finally, learning a more compact representation of the sensorimotor experience, with a tool such as a deep neural network instead of a graph might make it possible to compare our approach to common benchmarks used in the computer vision community.
Finally, when the complexity of the problem increases, or in order to deal with locally ambiguous sensory input, a hierarchical processing of the experience might be necessary. On the one hand, it has been shown that the hierarchical processing of information is probably one of the reasons of the success of deep networks [43, 22]. On the other hand, from a more biological point of view, it has been shown that biological brains are organized hierarchically , while the interpretation of the reasons for a hierarchical processing have been investigated but are still subject to debate [45, 46]. A proto-object or even an object could then be detected through a hierarchy of features. This approach should also be followed to tackle the problem of ambiguity, by exploiting sequences of transitions in order to define contexts, instead of exclusively exploiting instantaneous transitions .
Conflict of Interest Statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
NLH implemented the model, designed and performed the experiments and wrote the paper. AL and OS designed the experiments and wrote the paper.
This work was supported by the Association Nationale Recherche Technologie (ANRT) through a CIFRE convention (contract №2016/0946).
-  Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You Only Look Once: Unified, Real-Time Object Detection. arXiv preprint arXiv:1506.02640v5, 2015.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
-  Hermann Helmholtz. Handbuch der physiologischen Optik. Monatshefte für Mathematik und Physik, 7(1):A60–A61, 1896.
-  J J Gibson. The Ecological Approach to Visual Perception. Houghton Mifflin- Boston, 1979.
-  Karl Friston, James Kilner, and Lee Harrison. A free energy principle for the brain. Journal of Physiology Paris, 100(1-3):70–87, 2006.
-  Andy Clark. Whatever next? Predictive brains, situated agents, and the future of cognitive science. Behavioral and Brain Sciences, 36(3):181–204, 2013.
-  Kevin O’Regan and Alva Noe. A sensorimotor account of vision and visual consciousness. Behavioral and Brain Sciences, 24:939–1031, 2001.
-  Rajesh P N Rao and Dana H. Ballard. Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-field effects. Nature Neuroscience, 2(1):79–87, 1999.
-  Rajesh P N Rao and Dana H. Ballard. Probabilistic models of attention based on iconic representations and predictive coding. Neurobiology of Attention, (July):553–561, 2005.
-  Anil K Seth. A predictive processing theory of sensorimotor contingencies: Explaining the puzzle of perceptual presence and its absence in synesthesia. Cognitive neuroscience, 5(2):97–118, 2014.
-  J. Kevin O’Regan and Alva Noë. What it is like to see: A sensorimotor theory of perceptual experience. Synthese, 129(1):79–103, 2001.
-  A. K. Jain, R. P. W. Duin, and Jianchang Mao. Statistical pattern recognition: a review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1):4–37, Jan 2000.
-  Donald D Hoffman. The Interface Theory of Perception: Natural Selection Drives True Perception To Swift Extinction. Scientific reports, 24(1):1–1, 2015.
-  Ensor Rafael, Adeel Razi, Thomas Parr, Michael Kirchhoff, and Karl J. Friston. Biological self-organisation and Markov blankets. bioRxiv, pages 1–21, 2017.
-  Alban Laflaquière and Nikolas Hemion. Grounding object perception in a naive agent’s sensorimotor experience. 5th Joint International Conference on Development and Learning and Epigenetic Robotics, ICDL-EpiRob 2015, pages 276–282, 2015.
-  Nikolas J. Hemion. Context discovery for model learning in partially observable environments. In 2016 Joint IEEE International Conference on Development and Learning and Epigenetic Robotics, ICDL-EpiRob 2016, pages 294–299, 2017.
-  Ulrike Von Luxburg. A Tutorial on Spectral Clustering A Tutorial on Spectral Clustering. Statistics and Computing, 17(March):395–416, 2006.
-  Marina Meila. Spectral Clustering: a Tutorial for the 2010’s. Handbook of cluster analysis, page 753, 2015.
-  Jiuxiang Gu, Zhenhua Wang, Jason Kuen, Lianyang Ma, Amir Shahroudy, Bing Shuai, Ting Liu, Xingxing Wang, and Gang Wang. Recent Advances in Convolutional Neural Networks. arXiv preprint arXiv:1512.07108v6, pages 1–14, 2015.
-  Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
-  Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
-  Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation Learning: A Review and New Perspectives. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(8):1798–1828, 2013.
-  Carl Doersch, Abhinav Gupta, and Alexei a Efros. Unsupervised Visual Representation Learning by Context Prediction. arXiv preprint arXiv:arXiv:1505.05192v3, pages 1422–1430, 2015.
-  Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, and Sammy Bengio. Why Does Unsupervised Pre-training Help Deep Learning ? Journal of Machine Learning Research, 11:625–660, 2010.
-  Xiaolong Wang and Abhinav Gupta. Unsupervised Learning of Visual Representations Using Videos. arXiv preprint arXiv:1505.00687v2, 2015.
-  Carl Vondrick, Antonio Torralba, and Hamed Pirsiavash. Anticipating Visual Representations from Unlabeled Video. arXiv preprint arXiv:1504.08023v2, 2015.
-  Jacob Walker, Carl Doersch, Abhinav Gupta, and Martial Hebert. An uncertain future: Forecasting from static images using variational autoencoders. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), volume 9911 LNCS, pages 835–851, 2016.
-  Jeannette Bohg, Karol Hausman, Bharath Sankaran, Oliver Brock, Danica Kragic, Stefan Schaal, and Gaurav Sukhatme. Interactive Perception: Leveraging Action in Perception and Perception in Action. arXiv preprint arXiv:1604.03670, pages 1–18, 2016.
-  Dinesh Jayaraman and Kristen Grauman. Learning image representations equivariant to ego-motion. IEEE International Conference on Computer Vision, (Iccv):1413–1421, 2015.
-  Pulkit Agrawal, Joao Carreira, and Jitendra Malik. Learning to see by moving. In Proceedings of the IEEE International Conference on Computer Vision, volume 2015 Inter, pages 37–45, 2015.
-  Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard Lewis, Satinder Singh, and Ann Arbor. Action-Conditional Video Prediction using Deep Networks in Atari Games. arXiv preprint arXiv:1507.08750, 2015.
-  Lerrel Pinto, Dhiraj Gandhi, Yuanfeng Han, Yong-lae Park, and Abhinav Gupta. The Curious Robot: Learning Visual Representations via Physical Interactions. arXiv preprint arXiv:1604.01360v2, 2016.
-  Rico Jonschkowski and Oliver Brock. Learning state representations with robotic priors. Autonomous Robots, 39(3):407–428, 2015.
-  L. Montesano, M. Lopes, A. Bernardino, and J. Santos-Victor. Learning object affordances: From sensory–motor coordination to imitation. IEEE Trans. on Robotics, 24(1):15–26, 2008.
-  Philipp Zech, Simon Haller, Safoura Rezapour Lakani, Barry Ridge, Emre Ugur, and Justus Piater. Computational models of affordance in robotics: a taxonomy and systematic classification. Adaptive Behavior, 25(5):235–271, 2017.
-  Nicholas Hay, Michael Stark, Alexander Schlegel, Carter Wendelken, Dennis Park, Eric Purdy, Tom Silver, D. Scott Phoenix, and Dileep George. Behavior is everything–towards representing concepts with sensorimotor contingencies. In AAAI, 2018.
-  Marco Dorigo and Marco Colombetti. Robot shaping: developing autonomous agents through learning. Artificial Intelligence, 71(2):321–370, 1994.
-  Christian Scheier and Rolf Pfeifer. Classification as Sensory-Motor Coordination A Case Study on Autonomous Agents. European Conference on Artificial Life, pages 657–667, 1995.
-  P R Cohen, M S Atkin, T Oates, and C R Beal. NEO: Learning Conceptual Knowledge by SensorimotorInteraction with an Environment. Proceedings of the First International Conference on Autonomous Agents, pages 170–177, 1997.
-  Ralf Der, Ulrich Steinmetz, and Frank Pasemann. Homeokinesis-a new principle to back up evolution with learning. Computational intelligence for modelling, control, and automation, 55:43–47, 1999.
-  Alexander Maye and Andreas K. Engel. A discrete computational model of sensorimotor contingencies for object perception and control of behavior. Proceedings - IEEE International Conference on Robotics and Automation, pages 3810–3815, 2011.
-  Alban Laflaquière. Autonomous Grounding of Visual Field Experience through Sensorimotor Prediction. arXiv preprint arXiv:1608.01127, 2016.
-  Henry W. Lin, Max Tegmark, and David Rolnick. Why Does Deep and Cheap Learning Work So Well? Journal of Statistical Physics, 168(6):1223–1247, 2017.
-  Dharmendra S Modha and Raghavendra Singh. Network architecture of the long-distance pathways in the macaque brain. Proceedings of the National Academy of Sciences, 107(30):13485–13490, 2010.
-  Antonio R Damasio. Time-locked multiregional retroactivation: A systems- level proposal for the neural substrates of recall and recognition. Cognition, 33(1):25–62, 1989.
-  Joaquin M. Fuster. The cognit: A network model of cortical representation. International Journal of Psychopsychology, 60(2):125–132, 2006.