1

The evolution of representation in simple cognitive networks

Lars Marstaller, Arend Hintze & Christoph Adami

Department of Cognitive Science, Macquarie University, Sydney, Australia

Microbiology & Molecular Genetics, Michigan State University, East Lansing, MI

BEACON Center for the Study of Evolution in Action, Michigan State University, East Lansing, MI

Computer Science & Engineering, Michigan State University, East Lansing, MI
^{1}^{1}1These authors contributed equally.

Keywords: Information, Representation, Cognition, Evolution

Abstract

Representations are internal models of the environment that can provide guidance to a behaving agent, even in the absence of sensory information. It is not clear how representations are developed and whether or not they are necessary or even essential for intelligent behavior. We argue here that the ability to represent relevant features of the environment is the expected consequence of an adaptive process, give a formal definition of representation based on information theory, and quantify it with a measure . To measure how changes over time, we evolve two types of networks—an artificial neural network and a network of hidden Markov gates—to solve a categorization task using a genetic algorithm. We find that the capacity to represent increases during evolutionary adaptation, and that agents form representations of their environment during their lifetime. This ability allows the agents to act on sensorial inputs in the context of their acquired representations and enables complex and context-dependent behavior. We examine which concepts (features of the environment) our networks are representing, how the representations are logically encoded in the networks, and how they form as an agent behaves to solve a task. We conclude that should be able to quantify the representations within any cognitive system, and should be predictive of an agent’s long-term adaptive success.

## 1 Introduction

The notion of representation is as old as cognitive science itself (see, e.g., Chomsky, 1965; Newell and Simon, 1972; Fodor, 1975; Johnson-Laird and Wason, 1977; Marr, 1982; Pinker, 1989; Pitt, 2008), but its usefulness for Artificial Intelligence (AI) research has been doubted (Brooks, 1991). In his widely cited article “Intelligence without representation”, Brooks argued instead for a subsumption architecture where the autonomous behavior producing components (or layers) of the cognitive system directly interface with the world and with each other rather than with a central symbol processor dealing in explicit representations of the environment. In particular, inspired by the biological path to intelligence, Brooks argued that AI research needs to be rooted in mobile autonomous robotics and a direct interaction between action and perception. Echoing Moravec (1984), he asserted that the necessary elements for the development of intelligence are mobility, acute vision, and the ability to behave appropriately in a dynamic environment (Brooks, 1991). This architecture achieved insect-level intelligence and Brooks argued that a path to higher level AI could be forged by incrementally increasing the complexity of subsumption architecture.

However, 20 years after advocating such a radical departure from the classical approach to AI, the subsumption approach seems to have stalled as well. We believe that the reason for the lack of progress does not lie in the attempt to base AI research in mobile autonomous robots, but that instead representations (also sometimes called “internal models”, (Craik, 1943; Wolpert et al., 1995; Kawato, 1999)) are key to complex adaptive behavior. Indeed, while representation-free robotics has made some important strides (Nolfi, 2002), it is limited to problems that are not “representation hungry” (Clark, 1997), i.e., problems that do not require past information or additional (external) knowledge about the current context. In addition, the technical difficulty of developing a subsumption architecture increases with the number of layers or subsystems. This problem of subsumption architecture mirrors the difficulties of classic representational AI approaches to build accurate and appropriate models of the world.

An alternative approach to engineering cognitive architectures and internal models is evolutionary robotics (Nolfi and Floreano, 2000). Instead of designing the structure or functions of a control architecture, principles of Darwinian evolution are used to create complex networks that interface perception and action in non-obvious and often surprising ways. Such structures can give rise to complex representations of the environment that are hard to engineer and equally hard to analyze (see, e.g. Floreano and Mondada,1996). Evolved representations provide context, are flexible, and can be readjusted given new stimuli that contradict the current assumptions. Representations can be updated during the lifetime or over the course of evolution and thus are able to handle even new sensory input (Bongard et al., 2006). We argue that as robots evolve to behave appropriately (and survive) in a dynamic and noisy world, representations of the environment emerge within the cognitive apparatus, and are integrated with the perceived sensory data to create intelligent behavior–using not only the current state of the environment but crucially taking into account historical data (memory) as well.

To test this hypothesis and make internal representations of evolved systems accessible to analysis, we propose a new information-theoretic measure of the degree to which an embodied agent represents its environment within its internal states and show how the capacity to represent environmental features emerges over thousands of generations of simulated evolution. The main idea is that representations encode environmental features because of their relevance for the cognitive system in question (Clark and Toribio, 1994). Hence, for our purposes, representations can be symbolic or sub-symbolic (e.g., neural states) as long as they have a physical basis, i.e., as long as they are encoded in measurable internal states. However, we distinguish representations from sensorial input because sensor inputs cannot provide the same past or external context as internal states. We thus explicitly define representations as that information about relevant features of the environment which is encoded in the internal states of an organism and which goes beyond the information present in its sensors (Haugeland, 1991; Clark, 1997). In particular, this implies that representations can, at time, misrepresent (Haugeland, 1991)–unlike information present in sensors, which always truthfully correlates with the environment. To illuminate the functioning of evolved cognitive systems, we show how it is in principle possible to determine what a representation is about, and how representations form during the lifetime of an agent. We argue that our measure provides a valuable tool to investigate the organization of evolved cognitive systems especially in cases where internal representations are “epistemically opaque”.

## 2 Methods

### 2.1 Information-theoretic measure of representation

Information theory has been used previously to quantify how context can modulate decisions based on sensory input (Phillipps et al., 1994; Phillips and Singer, 1997; Kay and Phillips, 2011). Here, we present an information-theoretic construction that explicitly takes the entropy of environmental states into account. To quantify representation, we first define the relationship between the representing system and the represented environment in terms of information (shared, or mutual, entropy). Information measures the correlation between two random variables, while the entropy is a measure of the uncertainty we have about a random variable in the absence of information (uncertainty is therefore potential information). For a random variable that can take on the states with probabilities , the entropy is given by (Shannon, 1948):

(1) |

where is the number of possible states that can take on.

The information between two random variables characterizes how much the degree of order in one of the variables is predictive of the regularity in the other variable. It can be defined using entropy as the difference between the sum of the entropies of two random variables and [written as and ] and the joint entropy of and , written as :

(2) |

In Eq. (2), and are the probability distributions for the random variables and respectively [that is, ], while is the joint probability distribution of the (joint) random variable . The shared entropy can also be written in terms of a difference between unconditional and conditional entropies, as

(3) |

This definition reminds us that information is that which reduces our uncertainty about a system. In other words, it is that which allows us to make predictions about a system with an accuracy that is higher than when we did not have that information. In Eq. (3), we introduced the concept of a conditional entropy (Shannon, 1948). For example, (read as “ of given ”) is the entropy of when the state of the variable is known, and is calculated as

(4) |

using the conditional probability .

In general, information is able to detect arbitrary correlations between signals or sets of events. We assume here that such correlations instantiate semiotic or information relationships between a representing and represented, and use mutual information to measure the correlation between a network’s internal states and its environment [see also Marstaller et al. (2010)]. So, for example, we could imagine that stands for the states of an environment, whereas is a variable that represents those states of the environment. We need to be careful, however, to exclude from possible representational variables those that are mere images of the environment, such as the trace that the world leaves in an agent’s sensors. Indeed, mere correlations between internal states and the environment are not sufficient to be treated as representational because they could be due to behavior that is entirely reactive (Clark, 1997). Haugeland, for example, understands representation as something that “stands in” for something in the environment, but that is no longer reflected in the perceptual system of the agent (Haugeland, 1991). Indeed, representation should be different from a mere translation: Consider a digital camera’s relationship with its environment. The photo chip guarantees a one-to-one mapping between the environment structure and the camera’s state patterns. But a camera is not able to adapt to its environment. By taking a picture, the camera has not ‘learned’ anything about its environment that will affect its future state. It simply stores what it received through its inputs without extracting information from it, i.e., the camera’s internal states are fully determined by its sensor inputs. Representation goes beyond mere translation because the content, i.e. which feature of the environment is represented, depends on the goals of the system. Not everything is represented in the same way. A camera does not have this functional specification of its internal states.

To rule out trivial representations like a camera’s internal states, we define representation as the shared entropy between environment states and internal states, but given the sensor states, i.e., conditioned on the sensors. Thus, representation is that part of the shared entropy between environment states and internal states that goes beyond what is seen in the sensors (see Fig. 1). For the following, we take , given by its probability distribution , as the random variable to describe environmental states, while describes sensor states. If the internal states of the agent (hidden and output states) are characterized by the random variable with probability distribution , then we define the representation as (for an earlier version, see Marstaller et al.,2010):

(5) |

where the correlation entropy of the three variables , , and [also called “total correlation” (Watanabe, 1960) or “multi-information” (McGill, 1954; Schneidman et al., 2003)] is the amount of information they all three share:

(6) |

In Eq. (5), we introduced the shared conditional entropy between three variables that is defined as the difference between an information that is unshared and one that is shared (with a third system), just as , from Eq. (4). Thus, the representation of the world within internal states is the total correlation between the three, but without what is reflected in about and , respectively [measured by and ]. The relationship between and the entropies of the three variables , , and is most conveniently summarized by an entropy Venn diagram, as in Fig. 1. In these diagrams, a circle is a quantitative measure of the entropy of the associated variable, and the shared entropy between two variables is represented by the intersection of the variables, and so on (see, e.g., Cover and Thomas,1991).

Our information-theoretic definition of representation carries over from discrete variables to continuous variables unchanged, as can be seen as follows. Let , , and be random variables defined with normalized probability density functions , , and . The differential entropy (Cover and Thomas, 1991), defined as

(7) |

where is the support of random variable , is related to the discretized version by noting that

(8) |

where we introduced the discretization

(9) |

in order to define the discretized Shannon entropy . This implies that an -bit quantization of a continuous random variable is approximately (Cover and Thomas, 1991). Let us now assume that the variables , , and are each quantized by , , and bits respectively. Because is then quantized by bits, it follows that , that is, the continuous and discrete variable correlation entropies are (in the limit of sufficiently small ) approximately the same because the discretization correction cancels. The same is true for the informations and , as these are correlation entropies between two variables. Thus, , the differential entropy version of . We stress that while an exact identity between discrete and continuous variable definitions of is only ensured in the limit of vanishing discretization, the cancellation of the correction terms implies that the discrete version is not biased with respect to the continuous version.

defines a relation between a network’s activity patterns and its environment as the result of information processing. yields a positive quantity, measured in bits (if logarithms are taken to base 2). In order to show that this measure of representation reflects functional purpose (Clark, 1997), we evolve cognitive systems (networks) that control the behavior of an embodied agent, and show that fitness, a measure for the agent’s functional prowess, is correlated with . In other words, we show that when the environment (and task) is complex enough, agents react to this challenge by evolving representations of that environment.

### 2.2 Evolution of Active Categorical Perception

We study the evolution of an agent that solves an active categorical perception (ACP) task (Beer, 1996, 2003), but with modifications suggested by van Dartel et al. (2005) (see also van Dartel M.F.,2005). Categorization is thought to be one of the key elements in cognition (see Harnad 1987; Cohen and Lefebvre 2005). In categorical perception, an agent has to partition objects in the world into different discrete categories based on their visual appearance. In active categorical perception, the agent takes an active role in which aspects of the object to view, that is, perception is intimately linked with action. Whether or not this task requires internal representations may depend on the specific nature of the task, and it is in general not simple to determine whether the agent uses internal states to represent the environment, and in particular which features of the world are represented by which computational units (Beer, 2003; Ward and Ward, 2009). In the task studied by Beer (1996; 2003), the agent has to discriminate between circles and diamonds that were falling vertically towards the agent, who could move laterally to change its perception of the object. In the version we study here, the agent has to discriminate between large and small blocks, by catching the small blocks while avoiding the large ones. In order to create visual ambiguity, the agent’s four upward-looking sensors have a blind spot (two non-sensing units) between them. Furthermore, blocks are falling diagonally (from left to right or right to left), so that the agent not only has to categorize, but also to predict (see Fig. 2A).

We evolve active categorical perception in two experiments, using two different control architectures. One is an artificial neural network (ANN), and one is a Markov network, specifically, a network of Hidden Markov Gates (HMGs) (Edlund et al., 2011), described in more detail in section 2.3. The agent is located at the bottom row of a world with periodic boundary conditions, and is able to move horizontally one unit per time step (see Fig. 2A). Note that compared to van Dartel et al. (2005) we doubled the vertical size of the arena in order to give the agent more time to assess the direction of the falling block. The agent has four upward-looking sensors. The state of a sensor is when a block is detected, and if no block is seen. The arrangement of the agent’s sensors does not allow it to unambiguously identify the falling block from three out of eight possible input states (counting the input where no sensors are active). Only the input states , , and are unambiguous. The first classifies a large block that is positioned right above the agent so that the second and the third sensor are in state while the remaining two units of the block are positioned above the two non-sensing units. The other two patterns identify similarly positioned small blocks. All other five input states can be created by either a small block, a big block, or no block at all (see Fig. 2B). At each trial, a block of either small (two units) or large size (four units) falls from top to bottom in 20 time steps. The blocks move continuously downwards and sideways one unit per time step. Blocks either always move to the right or to the left. An object is caught if the position of the block’s units and of the agent’s units at time step 20 overlap in at least one unit.

For the information-theoretic characterization of correlations, we have to assign probabilities to the possible states of the world. Theoretically, a falling block can be in any of 20 different starting positions, large or small, and falling left or right, giving rise to 80 possible experimental initial conditions. While the agent can be in any of 20 initial positions, the periodic boundary conditions ensure that each of them is equivalent, given the 20 initial positions of the falling block. Because there are 20 time steps before the block reaches the bottom row, there are in total 1,600 possible different states the world can be in. We do not expect that all of these states will be discriminated by the agent, so instead we introduce a coarse-graining of the world by introducing four bits that we believe capture salient aspects of the world. We define the environmental (joint) variable to take on states as defined in Table 1.

World state | World character |
---|---|

no sensor activated | |

at least one sensor activated | |

block is to the left of agent | |

block is to the right of agent | |

block is two units (small) | |

block is four units (large) | |

block is moving left | |

block is moving right |

Of course, this encoding reveals a bias in what we, the experimenters, believe are salient states of the world, and certainly underestimates the amount of “discoverable” entropy. However, in hindsight this coarse-graining appears to be sufficient to capture the essential variations in the world, and furthermore lends itself to study which aspects of the world are being represented within the agent’s network controller, by defining representations about different aspects of the world as the representation . Thus, we will study the four representations

(10) | |||||

(11) | |||||

(12) | |||||

(13) |

that represent whether the sensor has been activated [Eq. (10)], whether the block is to the left or the right of the agent [Eq. (11)], if the block is large (size 4) or small (size 2) [Eq. (12)], or whether the block is moving to the left or right [Eq. (13)]. We can also measure how much (measured in bits) of each binary concept is represented in any particular variable. For example, measures how much of the “block is to my left or to my right” concept is encoded in variable 12.

### 2.3 Two Architectures for Cognitive Systems

The agent is controlled by a cognitive system, composed of computational units (loosely referred to as “neurons” from here on) that map sensor inputs into motor outputs. The cognitive system also has neurons that are internal (a hidden layer), which are those neurons that are not part of the input or of the output layer. We further define sensor neurons as those neurons that directly process the input (the input layer) and we define output neurons as those units that do not map to other units in the network or to themselves (the output layer).

Artificial Neural Networks (ANN) with evolvable topology. In our first experiment, the robot’s movements are controlled by an artificial neural network that consists of 16 nodes: four input units (one for each sensor), two output units, and ten hidden units. The states of the input units are discrete with values specifying whether an object is detected or not. The states of the output units (or actuators) are discrete with integer values encoding one of three possible actions: move one unit to the right or left, or do not move (. While the hidden units’ states are continuous with values , when evaluating these states to calculate we discretize them to binary (values below become , every other value becomes ). As discussed earlier, this discretization does not introduce a bias in the value of .

Usually, classic artificial neuronal networks have a fixed topology, i.e., one or more layers and their connections are defined and associated with a weight. In a previous experiment, we found that such fixed topologies lead to approximately constant even as the fitness of the agent increases (data not shown). One way to increase the complexity of the network and the information it represents is to evolve a network’s topology as well as the connection weights. We make the network topology evolvable beyond searching the connection weights by using neuronal gates (NG). A NG can arbitrarily connect nodes of any type (input, hidden, and output nodes) without the fixed layered topology of classic ANNs. Each connection is associated with a certain weight. A NG calculates the sum of the values from a set of incoming nodes via gate , multiplied by the associated weight and applies a sigmoid function to calculate its output

(14) |

where the sum over runs over all the neurons that feed into gate . This value is then propagated to every node this NG is connected to.

To apply a Genetic Algorithm to this system, each ANN is encoded in a genome as follows: A start codon of two loci mark the beginning of a NG, the subsequent two loci encode the NG’s number of inputs and outputs, while two further loci specify the origin of the inputs (which neurons feed into the gate) and the outputs (where the NG writes into). This information is then followed by an encoding of the weights of an -input NG (see Fig. 3). The number of gates in the network can change as it evolves, and is only determined by the number of start codons in the genome. The genomes encoding these ANNs can undergo the same mutational changes as described later in the MB section. In this respect, evolving open-topology ANNs is similar to using evolutionary algorithms (such as NEAT, see, e.g., Stanley and Miikkulainen (2002)) to evolve neural networks with augmented topology.

Markov Brains (MB). In our second experiment, the agent is controlled by a network of 16 nodes (four input, two output, and ten internal nodes, i.e., with the same types and number of nodes as the ANNs) which are connected via Hidden Markov Gates (HMGs, see Edlund et al. 2011). Networks of HMGs (Markov brains or MBs for short) are a type of stochastic Markov network (see, e.g., Koller and Friedman 2009) and are related to the hierarchical temporal memory model of neocortical function (Hawkins and Blakeslee, 2004; George and Hawkins, 2005, 2009) and the HMAX algorithm (Riesenhuber and Poggio, 1999), except that Markov brains need not be organized in a strictly hierarchical manner as their connectivity is evolved rather than designed top-down.

Each HMG can be understood as a finite-state machine that is defined by its input/output structure (Fig. 4A) and a state transition table (Fig. 4B). All nodes in Markov brains are binary, and in principle the HMGs are stochastic, that is, the output nodes fire (that is, are set to state ‘1’) with a probability determined by the state-to-state transition table. Here, each HMG can receive up to four inputs, and distribute signals to up to 4 nodes, with a minimum of one input and one output node (these settings are configurable). For the evolution of the ACP task, we consider only deterministic HMGs (each row of the transition table contains only one value of 1.0 and all other transitions have a probability of 0.0), turning our hidden Markov gates into classical logic gates. In order to apply an evolutionary algorithm, each HMG is encoded in a similar way as the NGs using a genome that specifies the network as a whole. Each locus of the genome is an integer variable . Following a start codon (marking the beginning of a gene, where each gene encodes a single HMG), the next two loci encode the number of inputs and outputs of the gate respectively, followed by a specification of the origin of the inputs, and the identity of the nodes being written to.

For example, for the HMG depicted in Fig. 4, the loci following the start codon would specify ‘3 inputs’, ‘2 outputs’, ’read from 1,2,3’, ‘write to 3,4’. This information is then followed by an encoding of the probabilities of an -input and -output state transition table (see Supplementary Fig. S1 in Edlund et al. 2011 for more details.) For the example given in Fig. 4, the particular HMG is specified by a circular genome with 39 loci (not counting the start). The start codon is universally (but arbitrarily) chosen as the consecutive loci (42,213). Because this combination only occurs by chance once every 65,536 pairs of loci (making start codons rare), we insert four start codons at arbitrary positions into a 5,000 loci initial genome to jump start evolution. Thus, the ancestral genomes of all experiments with Markov brains encode at least 4 HMGs. A set of HMGs encoded in this manner uniquely specifies the Markov brain. The encoding is robust in the sense that mutations that change the input-output structure of an HMG leave the probability table intact, while either adding or removing parts of the table. This flexibility also implies that there is considerable neutrality in the genome, as each gene has 256 loci reserved for the probability table even if many fewer loci are used.

MBs and ANNs differ with respect to the gates connecting the nodes in each network. ANNs use weights, sums, and a function together with continuous variables to compute their actions. In contrast MBs use discrete states and boolean logic to perform their computation. Using a very similar encoding of the topology means that mutations will have a similar effect on the topology of both systems, but different effects on the computations each system performs.

### 2.4 Evolutionary Algorithm

We evolve the two types of networks (ANNs and MBs) using a Genetic Algorithm (GA). A GA can find solutions to problems by using evolutionary search [see, e.g., Michalewicz (1996)]. The GA operates on the specific genetic encoding of the networks’ structure (the genotype), by iterating through a cycle of assessing each network’s fitness in a population of 100 candidates, selecting the successful ones for differential replication, and finally mutating the new candidate pool. When testing a network’s performance in controlling the agent, each network is faced with all 80 possible initial conditions that the world can take on. The fitness is calculated as the fraction of successful actions (the number of large blocks avoided plus the number of small blocks caught) out of 80 tests (a number between zero and 1). For the purpose of selection, we use an exponential fitness measure that multiplies the score by a factor 1.1 for every successful action, but divides the score by 1.1 for every unsuccessful action, or . After the fitness assessment, the genotypes are ranked according to and placed into the next generation with a probability that is proportional to the fitness (roulette wheel selection without elite). After replication, genotypes are mutated. We implemented three different mutational mechanisms that all occur after replication, with different probabiities. A point mutation happens with a probability of per locus, and causes the value at that locus to be replaced by a uniform random number drawn from the interval . There is a 2% chance that we delete a sequence of adjacent loci ranging from 256-512 in size, and a 5% chance that a stretch of 256-512 adjacent loci is duplicated (the size of the sequence to be deleted or duplicated is unformly distributed in the range given). The duplicated stretch is randomly inserted between any two loci in the genome. Duplications and deletions are contrained so that the genome is not allowed to shrink below 1024 sites, and genomes cannot grow beyond 20,000 sites. Because insertions are more likely than deletions, there is a tendency for genomes to grow in size during evolution.

We evolve networks through 10,000 generations, and run 200 replicates of each experiment. Note that the type of gates is different between ANNs (neuronal) and MBs (logic), so the rate of evolution of the two networks cannot be compared directly, because mutations will have vastly different effects with respect to the function of the gates. Thus, the optimal mutation rate differs among networks (Orr, 2000). At the end of each evolutionary run, we reconstruct the evolutionary line of descent (Lenski et al., 2003) of the experiment, by following the lineage of the most successful agent at the end of 10,000 generations backwards all the way to the random ancestor that was used to seed the experiment. This is possible because we do not use cross-over between genotypes in our GA. This line of descent, given by a temporally ordered sequence of genotypes, recapitulates the unfolding of the evolutionary process, mutation by mutation, from ancestor to the evolved agent with high fitness, and captures the essence of that particular evolutionary history. For each of the organisms on each of the 200 lines of descent of any particular experiment, we calculate a number of information-theoretic quantities, among which is how much of the world the agent represents in its brain, using Equation (5).

### 2.5 Extracting probabilities from behavior

The inputs to the information-theoretic measure of representation are the probabilities to observe a particular state , , as well as the joint probabilities describing the probability to observe a state when at the same time another variable takes on the state . For the representation defined by Eq. (5), sensor, internal, and environment variables are distinguished. For any particular organism (an agent that performs the ACP task with an evolved controller), is measured at any point during the evolution by placing the organism into the simulated world and concurrently recording time series data of the states of all 16 controller nodes and the states of the environment. The recordings are then used to calculate the frequency of states. Based on the frequencies of states, all probabilities relevant for the information-theoretical quantities can be calculated, including those that take into account the temporal order of events (for example, the probability that variable takes on state while variable takes on the state ). If a particular state (or combination of states) never occurs, a probability of zero is recorded for that entry (even though in principle the state or combination of states could occur). (and the other information-theoretic quantities introduced in section 3) is calculated for organisms on the evolutionary line of descent making it possible to follow the evolutionary trajectory of from random ancestor to adapted agent. For ANNs that have internal nodes with continuous rather than binary states, a mapping of intervals and is applied before calculation of probabilities.

## 3 Results

To establish a baseline, 100,000 random controllers for each of the two network types were created and the distributions of and fitness values were obtained. This baseline served two purposes: it shows how well randomly generated (unevolved) networks perform, and how much information about the world they represent by chance as well as providing information about the distribution of these values. Random ANNs and MBs were created in the same way by randomly drawing values from a uniform probability distribution of the integers for each of the genome’s loci. Each genome was then sprinkled with 4 start codons at arbitrary positions within the genome.

Fig. 5A shows the distribution of fitness scores for 100,000 random ANNs, and Fig. 5B shows the distribution of their representation . The respecitve distributions for MBs are shown in Fig. 5C for fitness scores and in Fig. 5D for representation. While both systems use different types of gates, ANNs and MBs do not differ with respect to their initial fitness or representation distributions. This shows that random genomes with a high fitness are–as expected–very rare, and need to evolve their functionality in order to perform optimally.

In order to compare the two network architectures, the evolutionary trajectories for fitness and representation were analyzed for the evolutionary line of descent (LOD) as described in section 2.4. The different LODs obtained from following back any other member of the final population quickly coalesce to a single line. Hence, the LOD effectively recapitulates the genetic changes that led from random networks to proficient ones. The development of fitness and representation over evolutionary time in Fig. 6 is averaged over 200 independent replicates. While both ANNs and MBs have on average low fitness at the begin of an evolutionary run (as seen in Fig. 5), MBs become significantly more fit than ANNs, and after 10,000 generations we find that in 18 out of 200 runs MBs have evolved to perfect fitness, while none of the ANNs reached this level (the best ANNs correctly make out of decisions). At the same time, we see that the fitness of the ANNs after 10,000 generations is not stagnating, which suggests that more runtime will allow for further improvement. As previously mentioned, the rate at which fitness is achieved in evolutionary time cannot be compared across architectures because mutations affect the function of the gates differently. While we tentatively explain the difference in performance between ANNs and MBs by their difference in representing the world below, we anticipate that the different network architectures also solve the categorization task very differently. To understand the information dynamics and the strategies employed in more detail, we measured a number of other information-theoretic measures (besides ) (see sections below).

The evolutionary trajectory for representation (see Fig. 6B) is similar to the evolution of fitness (see Fig. 6A), but MBs evolve to a significantly higher value of . We attribute this difference to the difference in fitness between the two types of networks, as the discretization of the continuous ANN variables cannot introduce a bias in . Thus, it appears that an increased representation of the world within an agent’s network controller correlates with fitness. We can test this correlation between fitness and representation at the end of a run for the 200 replicates of MBs and ANNs, and find that fitness and are significantly correlated (Spearman’s , ) for MBs, but not for ANNs (, ). We speculate that because ANNs are forced to compute using a sigmoid function only (effectively implementing a multiple-AND gate) while MBs can use arbitrary logic operations to process data, ANNs struggle to internalize (that is, represent) environmental states. In other words, it appears that the ease of memory-formation is crucial in forming representations, which are then efficiently transformed into fit decisions in MBs (Edlund et al., 2011).

### 3.1 Analysis of Network Structures and Strategies

In order to be successful at the task described, an agent has to perform active categorical perception followed by prediction. In the implementation of the ACP task by Beer (1996; 2003), prediction can be achieved without memory, because once the network has entered the attractor representing a category, the prediction (to move away or to stay) can be directly coupled to the attractor. The task used here can only be achieved using memory (data not shown) and requires the agent to perform categorical perception by comparing sensory inputs from at least two different time points (which also allows a prediction of where the object is going to land).

In order to analyze how information is processed, we calculated the predictive information (Bialek et al., 2001) of the evolved networks, given by the mutual Shannon information between the network’s inputs at time and its outputs at time . Predictive information, defined this way (Ay et al., 2008), measures how much of the entropy of outputs (the firings of motor neurons that control the agent) can be understood in terms of the signals that have appeared in the agent’s sensors just prior to the action. Indirectly, predictive information therefore also indicates the contribution of the hidden nodes of the network. A high predictive information would show that the hidden nodes do not contribute much and that computations are performed mainly by input and output neurons. Using the variable for sensor states and for actuator states, the predictive information can be written in terms of the shared entropy between sensor states at time and motor states at time as

(15) |

where is the probability to observe variable in state , is the probability observe variable in state , etc. Note that and are joint random variables created from the variables of each node, implying that can take on 16 different states while can take on 4 possible states. The probabilities are extracted from time series data as described in section 2.5. Figure 7A shows that over the course of evolution, the predictive information decreases for MBs after an initial increase, but increases slightly overall for ANNs. The drop in predictive information for the MBs indicates that their actions become less dependent on sensor inputs and are driven more by the hidden neurons, while the ANNs actions remain to be predictable by sensor inputs.

To test whether it is the internal states that increasingly guide the agent, the predictive information was subtracted from the entropy of the output states (maximally two bits) to calculate the unpredicted entropy of the outputs, i.e., how much of the motor outputs are uncorrelated to signals from the input:

(16) |

Figure 7B shows that increases over the course of evolution, suggesting that indeed signals other than the sensor readings are guiding the motors. In principle, this increase could be due to an increase in the motor neuron entropy, however, as the latter stays fairly constant we can conclude that the more a network adapts to its environment, the less its outputs are determined by its inputs and the more by its internal states. Again, this effect is stronger for MBs than for ANNs, and suggests that it is indeed the internal states that encode representations that drive the network’s behavior. It is also possible that the motors evolve to react to sensor signals further back in time. Because sensor neurons cannot store information, such a delayed response also has to be processed via internal states. While the absolute value of the predicted information and unpredicted entropy can depend on this time delay, we expect the overall trend of a decreasing coupled with an increasing to be the same as for the one-step predictive information, because the sensorial signal stream itself has temporal correlations, that is, it is non-random.

To quantify the synergy of the network we calculated a measure of information integration called synergistic information. Roughly speaking, synergistic information measures the amount of information that is processed by the network as a whole that cannot be understood in terms of the information-processing of each individual node, i.e., it measures the extent to which the whole network is–informationally–more than the sum of its parts (Edlund et al., 2011):

(17) |

In Eq. (17), measures the amount of information that is processed (across time) by the whole network (the joint random variable composed of each of the node variables), whereas measures how much is processed by node . The negative of Eq. (17) has been used before, to quantify the redundancy of information processing in a neural network (Atick, 1992; Nadal and Parga, 1994; Schneidman et al., 2003). is a special case of the information integration measure (Tononi, 2008; Balduzzi and Tononi, 2008), which is computationally far more complex than because it relies on computing information integration across all possible partitions of a network. , instead, calculates information integration across the “atomic” partition only, that is, the partition where each node is its own part. Figure 7C shows that increases for Markov brains as well as for ANNs, which indicates that both architectures evolve the ability to integrate information to perform the task at hand. We only see a marginal difference between MBs and ANNs in their ability to integrate information, while at the same time MBs are more dependent on internal states and ultimately perform better. This suggests that measuring integrated information in terms of Eq. 17 does not allow inferences about a system’s capacity do memorize. In summary, we observe that MBs evolve to become less dependent on sensorial inputs than ANNs, and in addition, the actions of the MBs become more dependent on internal states than in ANNs. Thus, we conclude that the network properties that measures are indeed representations that the networks create as an adaptive strategy. But how do the networks represent their environments? Which features of the world are represented and form the successful task-solving strategies?

#### 3.1 Epistemically Opaque Strategies

In order to analyze a Markov brain’s function, a number of different tools are available. First, a causal diagram can be generated by drawing an edge between any two neurons that are connected via an HMG. The edges are directed, but note that each edge can in principal perform a different computation. When creating the causal diagram, nodes that are never written into by any other nodes are removed, as they are computationally inert (they remain in their default ‘off’ state). Such nodes can also be identified via a “knock-out” procedure, where the input of each node is forced to either the “0” or “1” state individually. If such a procedure has no effect on fitness, the node is inert.

Fig. 8 shows the causal diagram of an evolved Markov brain that solves the classification task perfectly. One can see that this network uses inputs from the sensors, motors, and memory simultaneously for decisions, fusing the different modalities intelligently (Murphy, 1996).

The causal diagram by itself, however, does not reveal how function is achieved in this network. As each HMG in the present instantiation represents a deterministic logic gate (generally they are stochastic), it is possible to determine the logical rules by which the network transitions from state to state by feeding the state-to-state transition table into a logical analyzer (Rickmann, 2011). The analyzer converts the state transition table into the minimal description of functions in Boolean logic using only NOT, AND (), and OR (). With these functions we can exactly describe each node’s logical influence on other nodes (and possibly itself). For the network depicted in Fig. 8, the logic is given by (here, the numeral represents the node, and its index the state at time or the subsequent time point , while an overbar stands for NOT)

Note that while this logical representation of the network’s dynamics is optimized (and the contribution of inert nodes is removed), it is in general not possible to determine the minimal logic network based on state-to-state transition information only, as finding the minimal logic is believed to be a computationally intractable problem (Kabanets and Cai, 2000). As a consequence, while it is possible to capture the network’s function in terms of a set of logical rules, we should not be surprised that evolution delivers epistemically opaque designs (Humphreys, 2009), that is, designs that we do not understand on a fundamental level. However, the strength of measuring representations with is that our measure is able to capture representations even if they are highly distributed and take part in complex computations. As such, provides a valuable tool to analyze evolved neural networks as can be seen below.

#### 3.2 Concepts and Memory

To understand what representations are acquired (representations about which concepts), we calculated for each property of the environment defined in Eq. (10-13), within each of the key nodes in our example network shown in Fig. 8. For this network, Fig. 9 shows that some nodes prefer to represent given single features or concepts, while others represent several features at the same time. In addition, the degree to which a node represents a certain property changes during the course of evolution. Looking at representation within each individual node, however, only tells part of the story as it is clear that representations are generally “smeared” over several nodes. If this is the case, a pair of nodes (for example) can represent more about a feature than the sum of the representations in each node, i.e., variables can represent synergistically. In order to discover which combination of nodes represents which feature most accurately, a search over all partitions of the network would have to be performed, much like in the search for the partition with minimum information processing in the calculation of a network’s synergistic information processing (Balduzzi and Tononi, 2008).

One can also ask whether brain states represent the environment as it is at the time it is being represented, or whether it represents the environment in its past state, or in other words, we can ask whether representations are about more distant or more proximal events. To answer this, we define temporal representations by including the temporal index of the Markov variables. For example, a representation at the same time point is defined [as implicit in Eq. (5)] as

(18) |

while a representation of events one update prior is defined as

(19) |

i.e., the shared entropy between the internal variables at time and the environmental states at time , given the sensor’s states at time . Naturally, one can define temporal representations about more distant events in the same manner. , and were calculated and averaged over all 80 experiments in each generation (for both ANNs and MBs) over the course of evolution. Figure 10 shows that in both systems is larger than , and is larger still (but note that is smaller, data not shown), and all values increase over evolutionary time similar to (see Fig. 6B). This suggest that both networks evolve a form of memory that reaches further back than just one update.

We suggest that the peak in representation at a time difference of two updates implied by Fig. 10 can be explained by the hierarchical structure of the networks, which have to process the sensorial information through at least two time steps to reach a decision (it takes at least two time steps in order to assess the direction of motion of the block). Decisions have to be made shortly thereafter, however, in order to move the agent to the correct location in time. This further strengthens our view that representations are evolved, and furthermore that they build up during an agent’s lifetime as memory of past events shape the agent’s decisions.

## 4 Conclusions

We defined a quantitative measure of representation in terms of information theory, as the shared entropy between the states of the environment and internal “brain” states, given the states of the sensors. While internal states are necessary in order to encode internal models, not all internal states are representations. Indeed, representational information (which is about the environment) is a subset of the information stored in internal states. Information about the state of other past internal states or sub-states do not count towards , and neither would information about imagined worlds, for example. Testing which nodes of the system contain representations about what aspect of the environment helps us to distinguish between information present in internal states and information that is specifically used as a representation. We applied this measure to two types of networks that were evolved to control a simulated agent in an active categorical peception task. Our experiments showed that the achieved increases with fitness during evolution independently of the system used. We also showed that while the (algorithmic) function of both artificial neural networks and Markov networks is difficult to understand, deterministic Markov networks can be reduced to sets of Boolean logic functions. This logic, however, may be epistemically opaque. While representation increases in networks over evolutionary time, each neuron can represent parts of individual concepts (features of the environment). However, most often concepts are distributed over several neurons and represent synergistically. In addition, representations also form over the lifetime of the agent, increasing as the agent integrates information about the different concepts to reach a decision. Thus, what evolves in Markov and artificial neural networks via Darwinian processes are not the representations themselves, but rather what evolves is the capacity to represent the environment, while the representations themselves are formed as the agent observes and interacts with the environment. We argued that can be measured in continuous (artificial neural networks) as well as discrete systems (Markov networks), which suggests that this measure can be used in more complex and more natural systems. We found that in the implementation used here, Markov networks were able to evolve their ability to form representations more easily than artificial neural networks. Future investigations will show what kind of system is more powerful to make intelligent decisions using representations.

Acknowledgements We thank C. Koch and G. Tononi for extensive discussions about representations, information integration, and qualia. This research was supported in part by the German Federal Ministry of Education and Research, by the Paul G. Allen Family Foundation, by the National Science Foundation’s Frontiers in Integrative Biological Research grant FIBR-0527023, NSF’s BEACON Center for the Study of Evolution in Action under contract No. DBI-0939454, as well as by the Agriculture and Food Research Initiative Competitive Grant no. 2010-65205-20361 from the USDA National Institute of Food and Agriculture. We wish to acknowledge the support of the Berlin-Brandenburg Academy of Sciences, the Michigan State University High Performance Computing Center and the Institute for Cyber Enabled Research.

## References

- Atick (1992) Atick, J. J. (1992). Could information theory provide an ecological theory of sensory processing. Network, 3(2):213–251.
- Ay et al. (2008) Ay, N., Bertschinger, N., Der, R., Guettler, F., and Olbrich, E. (2008). Predictive information and explorative behavior of autonomous robots. Eur Phys J B, 63(3):329–339.
- Balduzzi and Tononi (2008) Balduzzi, D. and Tononi, G. (2008). Integrated Information in Discrete Dynamical Systems: Motivation and Theoretical Framework. PLoS Comput Biol, 4(6):e1000091.
- Beer (1996) Beer, R. (1996). Toward the evolution of dynamical neural networks for minimally cognitive behavior. In Maes, P., Mataric, M., Meyer, J., Pollack, J., and Wilson, S., editors, From Animals to Animats: Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior, pages 421–429, Cambridge, MA. MIT Press.
- Beer (2003) Beer, R. (2003). The dynamics of active categorical perception in an evolved model agent. Adaptive Behavior, 11:209–243.
- Bialek et al. (2001) Bialek, W., Nemenman, I., and Tishby, N. (2001). Predictability, complexity and learning. Neural Computation, 13(11):2409–63.
- Bongard et al. (2006) Bongard, J., Zykov, V., and Lipson, H. (2006). Resilient machines through continuous self-modeling. Science, 314(5802):1118–1121.
- Brooks (1991) Brooks, R. A. (1991). Intelligence without representation. Artificial Intelligence, 47:139–159.
- Chomsky (1965) Chomsky, N. (1965). Aspects of the Theory of Syntax. Cambridge, Mass.: The MIT Press.
- Clark (1997) Clark, A. (1997). The dynamical challenge. Cognitive Science, 21:461–481.
- Clark and Toribio (1994) Clark, A. and Toribio, J. (1994). Doing without representing? Synthese, 101:401–431.
- Cohen and Lefebvre (2005) Cohen, H. and Lefebvre, C., editors (2005). Handbook of Categorization in Cognitive Science, Amsterdam, The Netherlands. Elsevier.
- Cover and Thomas (1991) Cover, T. M. and Thomas, J. A. (1991). Elements of Information Theory. John Wiley, New York, NY.
- Craik (1943) Craik, K. (1943). The Nature of Explanation. Cambridge University Press, Cambridge (UK).
- Edlund et al. (2011) Edlund, J. A., Chaumont, N., Hintze, A., Koch, C., Tononi, G., and Adami, C. (2011). Integrated information increases with fitness in the simulated evolution of autonomous agents. PLoS Comput Biol, 7:e1002236.
- Floreano and Mondada (1996) Floreano, D. and Mondada, F. (1996). Evolution of homing navigation in a real mobile robot. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, 26(3):396–407.
- Fodor (1975) Fodor, J. (1975). The Language of Thought. New York: Crowell.
- George and Hawkins (2005) George, D. and Hawkins, J. (2005). A hierarchical Bayesian model of invariant pattern recognition in the visual cortex. In Prokhorov, D., editor, Proceedings of the International Joint Conference on Neural Networks (IJCNN), volume 3, pages 1812–1817. IEEE.
- George and Hawkins (2009) George, D. and Hawkins, J. (2009). Towards a mathematical theory of cortical micro-circuits. PLoS Comput Biol, 5(10):e1000532.
- Harnad (1987) Harnad, S., editor (1987). Categorical Perception: The Groundwork of Cognition, Cambridge, UK. Cambridge University Press.
- Haugeland (1991) Haugeland, J. (1991). Representational genera. In Ramsey, W., Stich, S. P., and Rumelhart, D. E., editors, Philosophy and connectionist theory, pages 61–89, Hillsdale, NJ. Lawrence Erlbaum.
- Hawkins and Blakeslee (2004) Hawkins, J. and Blakeslee, S. (2004). On Intelligence. Henry Holt and Co., New York, NY.
- Humphreys (2009) Humphreys, P. (2009). The philosophical novelty of computer simulation methods. Synthese, 169(3):615–626.
- Johnson-Laird and Wason (1977) Johnson-Laird, P. and Wason, P. (1977). Thinking: Readings in Cognitive Science. Cambridge University Press.
- Kabanets and Cai (2000) Kabanets, V. and Cai, J.-Y. (2000). Circuit minimization problem. In Yao, F. and Luks, E., editors, Proc. 32nd Symposium on Theory of Computing, pages 73–79.
- Kawato (1999) Kawato, M. (1999). Internal models for motor control and trajectory planning. Current Opinion In Neurobiology, 9(6):718–727.
- Kay and Phillips (2011) Kay, J. W. and Phillips, W. A. (2011). Coherent infomax as a computational goal for neural systems. Bull Math Biol, 73(2):344–72.
- Koller and Friedman (2009) Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models. MIT Press, Cambridge, MA.
- Lenski et al. (2003) Lenski, R. E., Ofria, C., Pennock, R. T., and Adami, C. (2003). The evolutionary origin of complex features. Nature, 423(6936):139–144.
- Marr (1982) Marr, D. (1982). Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. Henry Holt and Co., Inc., New York, NY, USA.
- Marstaller et al. (2010) Marstaller, L., Hintze, A., and Adami, C. (2010). Measuring representation. In Christensen, W., Schier, E., and Sutton, J., editors, ASCS09: Proceedings of the 9th Conference of the Australasian Society for Cognitive Science, pages 232–237. Sydney: Macquarie Centre for Cognitive Science.
- McGill (1954) McGill, W. J. (1954). Multivariate information transmission. Psychometrika, 19(2):97–116.
- Michalewicz (1996) Michalewicz, Z. (1996). Genetic Algorithms + Data Strucures = Evolution Programs. Springer Verlag, New York.
- Moravec (1984) Moravec, H. P. (1984). Locomotion, vision and intelligence. In Brady, M. and Paul, R., editors, Robotics Research 1, pages 215–224.
- Murphy (1996) Murphy, R. (1996). Biological and cognitive foundations of intelligent sensor fusion. Ieee Transactions On Systems Man and Cybernetics Part A-Systems and Humans, 26(1):42–51.
- Nadal and Parga (1994) Nadal, J. P. and Parga, N. (1994). Nonlinear neurons in the low noise limit: A factorial code maximizes information transfer. Network, 5(4):565–581.
- Newell and Simon (1972) Newell, A. and Simon, H. (1972). Human Problem Solving. Englewood Cliffs, NJ: Prentice Hall.
- Nolfi (2002) Nolfi, S. (2002). Power and the limits of reactive agents. Neurocomputing, 42(1-4):119–145.
- Nolfi and Floreano (2000) Nolfi, S. and Floreano, D. (2000). Evolutionary Robotics. MIT Press, Cambridge, MA.
- Orr (2000) Orr, H. (2000). The rate of adaptation in asexuals. Genetics, 155(2):961–968.
- Phillipps et al. (1994) Phillipps, W. A., Kay, J., and Smyth, D. M. (1994). How local cortical processors that maximize coherent variation could lay foundations for representation proper. In Smith, L. S. and Hancock, P. J. B., editors, Neural Computation and Psychology, pages 117–136, New York. Springer Verlag.
- Phillips and Singer (1997) Phillips, W. and Singer, W. (1997). In search of common foundations for cortical computation. Behavioral and Brain Sciences, 20(4):657–+.
- Pinker (1989) Pinker, S. (1989). Learnability and Cognition. Cambridge, Mass.: The MIT Press.
- Pitt (2008) Pitt, D. (2008). Mental representation. In Zalta, E. N., editor, The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, fall 2008 edition.
- Rickmann (2011) Rickmann, S. (2011). Logic friday (version 1.1.3) [computer software]. Retrieved from: http://www.sontrak.com/downloads.html.
- Riesenhuber and Poggio (1999) Riesenhuber, M. and Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature Neuroscience, 2(11):1019–1025.
- Schneidman et al. (2003) Schneidman, E., Still, S., Berry II, M. J., and Bialek, W. (2003). Network information and connected correlations. Phys Rev Lett, 91(23):238701.
- Shannon (1948) Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27:379– 423, 623–656.
- Stanley and Miikkulainen (2002) Stanley, K. O. and Miikkulainen, R. (2002). Evolving neural networks through augmenting topologies. Evol Comput, 10(2):99–127.
- Tononi (2008) Tononi, G. (2008). Consciousness as Integrated Information: a Provisional Manifesto. Biol Bull, 215(3):216–242.
- van Dartel et al. (2005) van Dartel, M., Sprinkhuizen-Kuyper, I., Postma, E., and van den Herik, J. (2005). Reactive agents and perceptual ambiguity. Adaptive Behavior, 13:227–42.
- van Dartel M.F. (2005) van Dartel M.F. (2005). Situated Representation. PhD thesis, Maastricht University.
- Ward and Ward (2009) Ward, R. and Ward, R. (2009). Representation in dynamical agents. Neural Networks, 22:258–266.
- Watanabe (1960) Watanabe, S. (1960). Information theoretical analysis of multivariate correlation. IBM Journal of Research and Development, 4:66–82.
- Wolpert et al. (1995) Wolpert, D. M., Ghahramani, Z., and Jordan, M. I. (1995). An internal model for sensorimotor integration. Science, 269(5232):1880–1882.