Universal limits to parallel processing capability of network architectures
The ability to learn new tasks and generalize performance to others is one of the most remarkable characteristics of the human brain and of recent AI systems. The ability to perform multiple tasks simultaneously is also a signature characteristic of large-scale parallel architectures, that is evident in the human brain, and has been exploited effectively in more traditional, massively parallel computational architectures. Here, we show that these two characteristics are in tension, reflecting a fundamental tradeoff between interactive parallelism that supports learning and generalization, and independent parallelism that supports processing efficiency through concurrent multitasking. We formally show that, while the maximum number of tasks that can be performed simultaneously grows linearly with network size, under realistic scenarios (e.g. in an unpredictable environment), the expected number that can be performed concurrently grows radically sub-linearly with network size. Hence, even modest reliance on shared representation strictly constrains the number of tasks that can be performed simultaneously, implying profound consequences for the development of artificial intelligence that optimally manages the tradeoff between learning and processing, and for understanding the human brainâs remarkably puzzling mix of sequential and parallel capabilities.
There is a fundamental tension between two kinds of use for parallel distributed computing in network architectures.
The first focuses on incorporating a variety of interacting constraints in the learning and processing of complex representations (‘interactive parallelism’).
This has been profitably exploited in theories of human cognitive function McClelland et al. (1986) and most recently in the design of “deep learning” artificial systems Bengio et al. (2013).
In contrast, a second kind of use focuses on the capacity of a network to carry out multiple processes independently (’independent parallelism’).
This approach has been exploited by massively parallel systems used in most modern computing clusters, and optimized by message-passing systems such as MPI Gropp et al. (1996) that seek to identify and distribute independent components of computation. What has been less well explored is the relationship between these two types of parallelism, and the consequences that this has for the design of adaptive systems. Recent work has begun to suggest that this tradeoff lies at the heart of human cognitive function Musslick et al. (2017).
Human behavior presents an interesting puzzle with regard to the capacity to perform multiple tasks simultaneously. On the one hand, we can effortlessly perform many kinds of tasks at the same time, such as walking, taking, and responding to our surroundings, all of which presumably involve extensive simultaneous computations. On the other hand, we are radically constrained in our ability to perform other kinds of tasks concurrently, such as planning a grocery list while simultaneously carrying out multidigit mental arithmetic. In cognitive psychology, this is attributed to a fundamental distinction between automatic and controlled processing, with the former capable of effortless, simultaneous execution, and the latter subject to cross-task interference and constraints on simultaneous performance Posner and Snyder (1975); Shiffrin and Schneider (1977).
Early theorists proposed that controlled processing relies on a centralized, limited capacity processing mechanism (akin to the core of a traditional computer), thus explaining the dramatic limitation in the human ability to simultaneously perform multiple control-dependent tasks. Early on, however, an alternative interpretation of this limitation was offered: that constraints in parallel processing reflect local competition among the resources required to perform specific combinations of tasks (based on the overlapping, shared use of representations) Wickens (1991); Allport (1980); Meyer and Kieras (1997); Navon and Gopher (1979). While compelling, this proposal was not undergirded by formal analysis on the extent to which process overlap (i.e., shared use of representations) would constrain processing. Moreover, process overlap was thought to diminish as the overall size of the system increases, providing at best a weak account of the constraints observed for a system as large as the human brain. Recently, however, numerical work has suggested instead that constraints imposed by overlap may be more radical and, under some conditions, approach scale-invariance Feng et al. (2014); Musslick et al. (2016a).
Here, we give a general theoretical analysis of the problem and show that even modest degrees of shared representations impose radical constraints on the number of tasks that agents can perform simultaneously without the risk of interference from crosstalk. This defines a fundamental tension in network architectures between the benefits that accrue from shared representations (i.e. flexibility of processing and generalization Bengio et al. (2013); Caruana (1997); Baxter (1995)) and their costs in terms of processing efficiency (i.e., the number of independent tasks that can be performed simultaneously Musslick et al. (2017)).
This explains the constraints in the human controlled processing capacity, suggesting that it reflects the purpose of control – to limit crosstalk among processes that share representations – rather than a limitation intrinsic to control mechanisms themselves Feng et al. (2014); Musslick et al. (2016a, 2017). The balance between controlled and automatic processes in human behavior may, in fact, represent an evolutionary and/or developmental optimization of the human brain between interactive and independent parallelism – an optimization that may be central to our remarkable ability for adaptive behavior. Hence, understanding this trade-off, and how the brain manages it, may be crucial in the design of human-level adaptive artificial systems.
ii.1 Measures of task dependency predict parallel processing capability in a trained neural network
To consider the problem of concurrent parallel processing (multitasking) analytically, we first provide a formal definition of a task. Given an input space of stimuli (e.g. colors) and an output space of responses (e.g. verbal response), a task represents a mapping between the two (e.g. naming the color of a stimulus), such that the mapping is independent of any other, and that selection of a feature from its input space can be made independent of any other. Different tasks can share an input space, output space, or both (e.g., reading a color word such as ”red” and naming the color in which it is printed share an output space). When this occurs, there is the potential for the tasks to interfere with one another Stroop (1935); Feng et al. (2014). Such interference can be made explicit by describing the task structure in the form of a (bipartite) task structure graph . makes the sharing of representations across tasks explicit (Figure 1a), in which , and are respectively the sets of input spaces, output spaces and tasks. Whenever two tasks share an input node or an output node we assume that they are at risk of interference due to cross-talk and therefore should not be executed in parallel; we call this dependency structural because of the direct reliance on common resources Musslick et al. (2016a). In Figure 1a this is represented as the dependency between tasks and between . Importantly, in addition to structural dependence, there can also be functional between two tasks: this is the case whenever maps the input space (i.e. connects the input node) of the first to the output space (i.e. output node) of the second. In Fig. 1a and are functionally dependent via , because activating a stimulus in ’s input space does the same for thus invoking a response to that may conflict with the response to .
These dependencies can be made explicit in a task dependency graph derived from the bipartite graph, in which nodes represent tasks and edges their (structural or functional) dependence (see Methods and SI).
This, in turn, can be used to determine the maximum parallel processing capacity of the network (i.e., the largest number of tasks it can perform simultaneously) by finding the largest set of independent tasks. In graph-theoretic terms, this corresponds to finding a maximum induced edge matching of : a subset of tasks in which none of the tasks are either structurally or functionally dependent on one another. In Fig. 1b we show an example of induced matching (in orange). This is equivalent to finding the maximum independent set (MIS) of Gavril (1973). The size of the MIS is the independence number of , which provides a measure of the maximum parallel processing capacity of a given task structure .
This equivalence is key to our first main contribution: a neural network trained on a task structure characterized by graph with given topology and parallel capacity exhibits a maximum parallel capacity given by the independence number of the corresponding . To assess the correspondence of this theoretical measure of parallel processing capacity to the performance of an actual network, we trained a simple non-linear feed-forward network (see Figure 1c), with four layers, that has been used previously to simulate a wide array of empirical findings concerning human cognitive performance (e.g. Cohen et al. (1990, 1992); Botvinick et al. (2001)). The network architecture entails two input layers, one that encodes the current stimulus (stimulus layer) and another one that encodes the task to be performed on the stimulus (task layer). Both input layers project to a hidden layer that computes an internal representation of task-relevant input features of the stimulus. Finally, information encoded at the hidden layer is projected together with the task layer input to an output layer at which the response of the network is computed. Note that the projections from the task layer allow it to bias processing towards task-relevant stimulus information represented at the hidden layer, as well as task-relevant responses at the output layer, thereby shaping the representations of the input and output space respectively for each task.
We trained 480 networks on a range of different task environments , varying both the total number of tasks (between ) and the associated task structure graph (Figure 1c). Weight projections from the task layer to the hidden layer were fixed and impose similarity between tasks with respect to their shared input spaces. For each network , trained on task environment , we constructed an empirical task dependency graph by studying the similarity between single task representations encoded in the hidden and output layers Musslick et al. (2016a). A single task representation for a given layer corresponds to the average activity vector at that layer for a given task. Two tasks were considered to be structurally dependent if either their hidden layer representation or their output layer representation exceeded a Pearson correlation threshold of , indicating that they shared either a common input or output space respectively. A pair of tasks were assessed to be functionally dependent if there was a third task for which the hidden layer representation was similar to one task in the pair and the output layer representation was similar to the other task in the pair. The resulting MIS of the empirical dependency graph matches the MIS of the theoretical dependency graph describing the task environment . Critically, yields an almost perfect agreement with the largest task subset that the network can perform with a mean absolute error below , corresponding to accuracy (Figure 1d). We refer to as the empirical maximum parallel processing capability of the network. These results also highlight the crucial possibility of faithfully extracting from network activation data when the original is not known. For example, it would be possible in principle to use correlation matrix of patterns of neural activity elicited by individual tasks (e.g, the pattern of firing of a neuronal population, or the multivoxel pattern of activity in an fMRI image) to construct a dependency graph analogous to the one constructed for the neural network as described above, without the need of knowing the task structure graph a priori, suggesting the possibility of observing representation learning and predicting parallel processing capability in vivo.
ii.2 Maximum parallel capacity estimation for dependency graphs or arbitrary size
The results above are encouraging. However, it is very hard to scale up the direct computation of , as for a network with tasks it requires the enumeration of all task subsets and thus scales as .
Using the MIS size computation affords a significant computational advantage, as the algorithmic complexity of computing explicitly for a graph with nodes is and thus scales as in our case Bollobás (1998); Tarjan and Trojanowski (1977).
Although efficient algorithms for specific classes of graphs exist Quaddoura (2014); Grotschel et al. (1984); Gavril (1972), measuring directly is impractical for graphs relevant to most real-world applications. More importantly, algorithmic solutions do not provide insights about the features of the task structure that are responsible for limiting parallel processing capacity.
To gain such insight, we provide a graph ensemble formulation of the MIS problem in terms of the degree distribution of the task structure. This allows us to isolate apart the roles of graph density and heterogeneity, independently from the network size and make general observations about the relationship between task structure and task encoding policies in determining parallel processing capacity.
The first step is to link the task graph with the corresponding task dependency graph by calculating the degree distribution of from that of . This can be done in a manner similar to the standard calculation of the number of second neighbours Newman (2010). Consider a task graph with input and output nodes, and a degree distribution (or in short) for the input () and output () node degrees at the endpoints of task edge in . Following the dependency rules (see Fig. 1), the task dependency graph is the square of the line graph of . This allows us to calculate the estimated degree of task in as :
where are the expectation values of and , and is the number of edges in (we refer the reader to the SI for full details on the approximation and the calculation).
In Figure 2a (and the SI) we show that Eq. 1 gives good results for graphs of various densities and for various degree distributions.
Note that is written in terms of the first two moments of , recovering the previously observed connection between the heterogeneity of graph and that of the corresponding dependency graph. This will play an important role in the estimation of the density of MIS of .
The second step is to write as a function of the degree distribution of . We do this by building on recent work by Lucibello and Ricci-Tersenghi Lucibello and Ricci-Tersenghi (2014), we calculated an estimation for based on a factor graph description of the maximum set packing problem, of which the independence number problem is a particular instance. Crucially, these expressions depend only on the graph’s degree distribution, which in turn allows to study the effect that different network topologies have on their parallel capacity to be studied. Exploiting the properties of the degree and excess degree (the degree a node reached following an edge) distributions Newman (2009), the estimate can be rewritten in terms of generating functions and takes the form:
where , and needs to satisfy the self-consistent equation
Here is the node degree in , and refer to the factor nodes’ degrees and excess degrees, which in the case of the MIS are fixed to (see Methods and SI), and is the generating function for the degree distribution .
For a generic , Eq. 1 can be used to calculate the degree distribution of the corresponding , which, substituted in Eq. 2, gives .
For a generic Gaussian-like distribution with mean and variance , the moment generating function takes the form . Following the expression above, we obtain the numerical solution of which can be used to compute:
In Figure 2b we show that this expression provides a good approximation of the behaviour of for increasing network density and for various levels of degree heterogeneity. Importantly, it provides an analytical grounding for the previous empirical observations that increased heterogeneity of task overlap for a given average density results in a higher Feng et al. (2014); Musslick et al. (2016a). Here, we used Gaussian degree distributions to illustrate the impact of the density of the dependency graph, which depends, in turn, on the density of the task structure graph (referred to as task overlap in Feng et al. (2014); Musslick et al. (2016a)) and its degree heterogeneity: for a fixed size, dense and uniform graphs have a smaller MIS than sparse, heterogeneous ones. In summary, Eq. 1 can be applied to a of arbitrary size, providing an estimate of the degree distribution of the corresponding , which can be used in turn to estimate . In Figure 2c we show that our analytical method accurately predicts the result of exact computation of for a wide range of in which this was carried out (see SI for other topologies). These results justify the use of our –size-independent– predictions of Eq. 2 for of arbitrary sizes.
ii.3 Effective parallel processing capacity
We showed that using the degree distribution of a dependency graph, it is possible to approximate well the independence density of the graph , and, using that, the corresponding independence number (where is the number of nodes in ) for networks of arbitrary size.
It is important to note here that the independence number is specific to a particular set (or sets) of tasks; that is, the specified level of parallelism can only be achieved for the particular tasks that comprise a maximal independence set. The size of this set tells us what is the maximum number of tasks that a given neural architecture can possibly perform simultaneously, assuming that there are no contraints on task selection. Critically, however, this does not address a question that is likely to arise in naturalistic settings: how many tasks can the system be expected to perform simultaneously on average, given a probabilistically determined subset of tasks that are currently viable (i.e. desirable and/or possible). That is, what is the expected number of independent tasks across all task subsets with cardinality ? This effective parallel processing capacity can be written as
, where is the probability that exactly out of nodes are independent from each other in . The latter requires the nodes in to be not linked with each other, and the remaining nodes to be connected to at least one of the first tasks. For a graph with edges, we can write the probability of successfully executing tasks in as: . For , the probability of executing the task is always 1 since a single task cannot interfere with itself. For comparison with , in Figure 2d we plot the results for the mean expected performance (MEP) density , obtained by numerically integrating for a range of task structure sizes. As expected, decreases with increasing network density, as it becomes progressively less likely to randomly find independent tasks. Critically, the increase in as a function of is strongly sub-linear (see SI). As a consequence, increasing network size is associated with a rapid decline in the average expected fraction of simultaneously executable tasks as opposed to the size-independent . This highlights a fundamental limitation of network architecture: the very network fabric that supports interactive parallelism by sharing representations between tasks (e.g. for learning and/or generalization) induces limits on independent parallelism – that is, their ability to perform multiple tasks simultaneously. Musslick et al. (2017).
The ability of neural network architectures to support parallel distributed processing has been exploited in a variety of domains, ranging from cognitive neuroscience Rumelhart et al. (1986); McClelland et al. (1987); O’Reilly and Munakata (2000) to recent advances in machine learning (e.g., “deep learning” LeCun et al. (2015); Mnih et al. (2015); Gibney (2016)).
Most approaches focused on the learning and processing of complex representations by simultaneously integrating a large number of constraints through fine-grained interactions among many units processing the same input in parallel – what we call interactive parallelism. In contrast, the traditional computer science approach to parallelism focused on exploiting large numbers of processors by separating jobs into as many independent components as possible, and then process them in isolation â what we call here independent (or also “embarrassing”) parallelism.
Remarkably, the human brain has the ability to integrate the use of both forms of parallelism – a feature that we suspect may be fundamental to its capacity for adaptation.
However, the factors that influence the tradeoff between each, and how the brain decides which form to use, remain a mystery.
Here, we contribute to unraveling this mistery by providing a formal description of the constraints that the degree of interaction among processes imposes on the capacity for independent parallelism in simple networks.
Our findings provide support for the idea that the striking limit in the number of control-dependent processes that humans can execute at one time Posner and Snyder (1975); Shiffrin and Schneider (1977) may in fact reflect a constraint imposed by the shared representation among task processes encoded in the network – a constraint that engages the function of control mechanisms (to mitigate cross talk) rather than reflecting a limitation that is intrinsic to control mechanisms themselves. One might ask why the human brain did not evolve or develop to avoid such crosstalk, for example by encoding tasks independently. In point of fact, it does exhibit the capacity for such parallelism in many domains (such as the realtime regulation of the many homeostatic systems for which it is responsible), and the ability to develop it in others (i.e., the automaticity associated with skill acquisition, such as learning to play a musical instrument, or to drive a vehicle). Still, the question remains: why do so many important behaviors (e.g. planning and language) exhibit strict constraints on simultaneous multitasking indicative of control-dependent processing?
One compelling response to this question â made clear by advances in deep learning Baxter (1995); Caruana (1997); Bengio et al. (2013) â is that shared representations are a powerful means for promoting efficient learning and generalization. Indeed, we have previously shown that there is a direct tradeoff in networks between speed of learning, on the one hand, and concurrent parallel processing capability on the other Musslick et al. (2017). The findings we report here reinforce and further illuminate this point, making it clear that the tradeoff is not relaxed by increasing network size: while the maximum parallel capacity density â that is, the theoretical limit in the number of simultaneously executable tasks â grows linearly with the size of a network, the maximum expected parallelism is subject to strict asymptotic limits. The problem is compounded with deeper networks, in which the likelihood of interaction increases across layers. An obvious means of mitigating the limits imposed by the development of shared representations would be the use of learning policies that weigh the advantages of speed of acquisition and/or generalization against the value of processing efficiency. This however demands an understanding of how different policies impact task structure and its effects on performance. Our results provide the tools and language to assess the impact of different learning policies and set a foundation for exploring the relative costs and benefits of independent vs. interactive parallelism, and how adaptive learning systems might be regulated to favor one vs. the other Ozcimder et al. (ress). Finally, our findings are relevant to other formally similar problems similar, e.g. maximum channel capacity for coding problems Butenko et al. (2002), jobs with interfering schedules Bomze et al. (1999); Wan et al. (2011) and register allocation Hack et al. (2006). Our present findings are limited by a number of factors, including the use of undirected binary dependency graphs, corresponding to the assumption that interference between processes is symmetric and all or none. In natural systems, of course, interactions can be both asymmetric and graded, a feature that can be captured by the use of directed weighted graphs. The generalization of our methods to such graphs present considerable challenges, and is an important direction for future research. Nevertheless, the correspondence of our theoretical results with the numerical analyses (implementing asymmetric and graded forms of interference) suggests that our findings may provide useful approximations for current applications, and a valuable foundation for future theoretical work.
- McClelland et al. (1986) J. L. McClelland, D. E. Rumelhart, and G. E. Hinton, The appeal of parallel distributed processing (Cambridge, MA: MIT Press, 1986).
- Bengio et al. (2013) Y. Bengio, A. Courville, and P. Vincent, IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1798 (2013).
- Gropp et al. (1996) W. Gropp, E. Lusk, N. Doss, and A. Skjellum, Parallel computing 22, 789 (1996).
- Musslick et al. (2017) S. Musslick, A. Saxe, K. Özcimder, B. Dey, G. Henselman, and J. D. Cohen, in 39th cognitive science society conference, GB (2017).
- Posner and Snyder (1975) M. Posner and C. Snyder, in Information processing and cognition: The Loyola symposium (1975) pp. 55–85.
- Shiffrin and Schneider (1977) R. M. Shiffrin and W. Schneider, Psychological review 84, 127 (1977).
- Wickens (1991) C. D. Wickens, Processing resources and attention, Vol. 1991 (1991) pp. 3–34.
- Allport (1980) D. A. Allport, Cognitive psychology: New directions 1, 12 (1980).
- Meyer and Kieras (1997) D. E. Meyer and D. E. Kieras, Psychological review 104, 3 (1997).
- Navon and Gopher (1979) D. Navon and D. Gopher, Psychological review 86, 214 (1979).
- Feng et al. (2014) S. F. Feng, M. Schwemmer, S. J. Gershman, and J. D. Cohen, Cognitive, Affective, & Behavioral Neuroscience 14, 129 (2014).
- Musslick et al. (2016a) S. Musslick, B. Dey, K. Özcimder, M. M. A. Patwary, T. L. Willke, and J. D. Cohen, in 38th cognitive science society conference, PA (2016).
- Caruana (1997) R. Caruana, Machine learning 28, 41 (1997).
- Baxter (1995) J. Baxter, in Proceedings of the eighth annual conference on Computational learning theory (ACM, 1995) pp. 311–320.
- Stroop (1935) J. R. Stroop, Journal of experimental psychology 18, 643 (1935).
- Gavril (1973) F. Gavril, Networks 3, 261 (1973).
- Cohen et al. (1990) J. D. Cohen, K. Dunbar, and J. L. McClelland, Psychological Review 97, 332 (1990).
- Cohen et al. (1992) J. D. Cohen, D. Servan-Schreiber, and J. L. McClelland, The American journal of psychology , 239 (1992).
- Botvinick et al. (2001) M. M. Botvinick, T. S. Braver, D. M. Barch, C. S. Carter, and J. D. Cohen, Psychological review 108, 624 (2001).
- Bollobás (1998) B. Bollobás, Modern Graph Theory, Vol. 184 (Springer Science & Business Media, 1998).
- Tarjan and Trojanowski (1977) R. E. Tarjan and A. E. Trojanowski, SIAM Journal on Computing 6, 537 (1977).
- Quaddoura (2014) R. Quaddoura, World of Computer Science and Information Technology Journal (WCSIT) 4, 38 (2014).
- Grotschel et al. (1984) M. Grotschel, L. Lovasz, and A. Schrijver, in Topics on Perfect Graphs, North-Holland Mathematics Studies, Vol. 88, edited by C. Berge and V. Chvatal (North-Holland, 1984) pp. 325 – 356.
- Gavril (1972) F. Gavril, SIAM Journal on Computing 1, 180 (1972).
- Newman (2010) M. Newman, Networks: An Introduction (Oxford University Press, Inc., New York, NY, USA, 2010).
- Lucibello and Ricci-Tersenghi (2014) C. Lucibello and F. Ricci-Tersenghi, International Journal of Statistical Mechanics 2014, 1 (2014).
- Newman (2009) M. E. Newman, Physical review letters 103, 058701 (2009).
- Rumelhart et al. (1986) D. E. Rumelhart, J. L. McClelland, P. R. Group, et al., Cambridge, MA (1986).
- McClelland et al. (1987) J. L. McClelland, D. E. Rumelhart, P. R. Group, et al., Parallel distributed processing, Vol. 2 (MIT press Cambridge, MA, 1987).
- O’Reilly and Munakata (2000) R. C. O’Reilly and Y. Munakata, Computational explorations in cognitive neuroscience: Understanding the mind by simulating the brain (MIT press, 2000).
- LeCun et al. (2015) Y. LeCun, Y. Bengio, and G. Hinton, Nature 521, 436 (2015).
- Mnih et al. (2015) V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., Nature 518, 529 (2015).
- Gibney (2016) E. Gibney, Nature 529, 445 (2016).
- Ozcimder et al. (ress) K. Ozcimder, B. Dey, S. Musslick, G. Petri, K. N. Ahmed, T. L. Willke, and J. D. Cohen, in Proceedings of the 38th Annual Conference of the Cognitive Science Society (London, GB, in press).
- Butenko et al. (2002) S. Butenko, P. Pardalos, I. Sergienko, V. Shylo, and P. Stetsyuk, in Proceedings of the 2002 ACM symposium on Applied computing (ACM, 2002) pp. 542–546.
- Bomze et al. (1999) I. M. Bomze, M. Budinich, P. M. Pardalos, and M. Pelillo, in Handbook of combinatorial optimization (Springer, 1999) pp. 1–74.
- Wan et al. (2011) P.-J. Wan, O. Frieder, X. Jia, F. Yao, X. Xu, and S. Tang, Wireless link scheduling under physical interference model (IEEE, 2011).
- Hack et al. (2006) S. Hack, D. Grund, and G. Goos, in International Conference on Compiler Construction (Springer, 2006) pp. 247–262.
- Rogers and McClelland (2004) T. T. Rogers and J. L. McClelland, Semantic cognition: A parallel distributed processing approach (MIT press, 2004).
- David E. Rumelhart and Williams (1986) G. E. H. David E. Rumelhart and R. J. Williams, Nature 323, 533 (1986).
- Hazan et al. (2006) E. Hazan, S. Safra, and O. Schwartz, computational complexity 15, 20 (2006).
- Musslick et al. (2016b) S. Musslick, B. Dey, K. Ãzcimder, M. A. Patwary, T. L. Willke, and J. D. Cohen, “Controlled vs. automatic processing: A graph-theoretic approach to the analysis of serial vs. parallel processing in neural network architectures,” https://github.com/musslick/Multitasking (2016b).
Appendix A Materials and Methods
a.1 Neural Network Architecture and Processing.
We use a standard non-linear feed-forward network, with four layers, that has been used previously to simulate a wide array of empirical findings concerning human cognitive performance Cohen et al. (1990); Botvinick et al. (2001); Rogers and McClelland (2004). The network consists of two input layers, one of which represents the stimulus presented to the network and another that encodes the task that the network has to perform on this stimulus. Both input layers project to a hidden layer (compromising 100 units). The hidden layer and the task layer further project to an output layer that computes the network’s response. The real-valued activity of each input unit composes the current stimulus. Activated units in the task layer indicate the tasks that are being currently executed. Performing a single task corresponds to clamping the corresponding task unit to 1 (activated) while all other units are set to 0. Multitasking conditions are represented by activating multiple task units at the same time. Units in the hidden and output layers take values between 0 and 1, as determined by a logistic activation function applied to their net input. Stimulus input units are structured according to dimensions (subvectors of the stimulus pattern), each of which is comprised of a set of feature units with only one feature unit activated per dimension. Similarly, output units are organized into response dimensions, with only one of the response units permitted to be active within a response dimension. Each task is represented by a single task input unit that is associated with a set of unique, one-to-one mappings between the input units in one stimulus dimension and the output units in one response dimension, and that is independent of the mappings for all other tasks (see Figure 1c). The number of stimulus input dimensions and response dimensions was varied between 4 and 6 across environments. The task mappings were generated with the Erdős-Rényi model such that the number of overlapping tasks for a given stimulus input dimension varied between 1 and 5. For each environment a network was initialized with a set of small random weights and then trained using the backpropagation algorithm David E. Rumelhart and Williams (1986) to produce the task-specified response for all stimuli in each task until it reached a mean-squared error performance of 0.001. We constrained the learned representations of the network to reflect the task similarity structure of the environment by fixing the weights from the task units to the associative layer: Weight vectors for tasks relying on the same stimulus input dimensions were set to yield a Pearson correlation coefficient of value whereas weight vectors for tasks of non-overlapping stimulus dimensions were uncorrelated.
a.2 Dependency graph extraction.
We follow the analysis described in Musslick et al. (2016a) and focus on the representations (patterns of activity) over the hidden and output units, insofar as these reflect the computations carried out by the network required to perform each task. In particular, we are interested in the characteristics of these representations for each task, how they compare across tasks, and how these factors correspond to empirical parallel processing performance. The representations associated with each task can be characterized by calculating, for each unit in the hidden and output layers, the mean of its activity over all of the stimuli for a given task; this mean pattern of activity can then be used as a representation of the task.
Correlating these patterns of activity across tasks yields a task similarity matrix that can be examined separately for the hidden and output layers of the network. This can then be used to assess the extent to which different tasks rely on similar or different representation within each layer of the network. Figure 1c provides an example of such similarity matrices (thresholded for similarity correlations above ). Tasks that have similar representations over the hidden layer can be inferred to rely on the same input dimension â– that is, they share an input component in the bipartite graph representation of the network â– and tasks that are similar at the output layer can be inferred to share an output component. Accordingly, a task dependency graph (of the type shown in Figure 1b) can be constructed by measuring the patterns of activity observed in the network while it performs each individual task.
a.3 Computation of the MIS of the dependency graph
The Maximum Independent Set (MIS) problemTarjan and Trojanowski (1977) is a particular instance of a larger optimization problem class, called Maximum Packing Set (MPS) problem Hazan et al. (2006), which we introduce below. Given a set and a collection of its subsets labeled by . A set packing is a collection of the pairwise disjoint subsets , its size is the packing number. The problem of finding the maximum packing number can be formulated as an integer programming problem as follows:
We follow here the approach of Lucibello and Ricci-Tersenghi (2014), denote the variable nodes set to be and to each associate a variable which takes values in . Denote by be the factor factor nodes set, that contains the elements acting as constraints on the variables . The edge set is then defined as .
Denoting by G the factor graph the problem specification can be rewritten as
Summing the values of the s for a configuration satisfying the condition above yields the size of the solution. Working in density this becomes:
Note that all the results are defined in thermodynamic limit .
It is possible to write analytical expressions for the expected MIS density, which are exact in certain regimes (as defined by a ration between the average factor and variable nodes degrees, effectively amounting to local tree-likeness of the factor graph) and in the Replica Symmetry solution are given by Lucibello and Ricci-Tersenghi (2014):
where is the degree of a factor node, the excess degree of a factor node, the degree of a variable node, is the excess degree of a factor node, and and the average factor and variable degrees respectively. The expectations over the corresponding degree as denoted by and in standard uncorrelated networks the excess degree distributions take the forms and . They can be rewritten in terms of generating functions by exploiting the properties of the degree and excess degree distributions as follows:
with and is the moment generating function of the degree distribution and the derivative of is intended in .
so the final expressions become:
These allow us to directly substitute the moment-generating functions corresponding to the degree distribution of interest.
In the main text, we use a slightly different notation.
This is due to the fact that for the case of MIS for all factor nodes, so we can simplify the expressions.
For a generic Gaussian-like distribution, the moment generating function takes the form . Following the expression above, we obtain:
While for a Poisson degree distribution,the generating function for Poisson distribution and in our case with , yielding:
In Figure 3 we show that this expression captures well the behavior of for increasing network density and for various levels of degree heterogeneity. Moreover, it gives an analytical grounding to the previously empirical observation that a larger heterogeneity of task overlap at fixed density results in a higher Musslick et al. (2016b); Feng et al. (2014).
In the manuscript we focused on unimodal degree distributions for simplicity of explanation and for consistence with previous work by Feng et al.
However, Eq. A.3 gives good results also with more structured degree distributions.
We give here the solutions for Gamma and Pareto distribution.
For a gamma distribution we obtain:
While for a Pareto distribution , the density expression becomes:
In Figure 5 we show the comparison of simulated and predicted .
a.4 Computation of the expected performance
When we calculate the maximum parallel capacity for a certain task graph with task-set (i.e. nodeset) we are focusing only on one (or a few, in some cases) maximum independence sets. It is however important to ask a different question: given a certain task set with cardinality , what is the average number of tasks that can be performed in parallel? Since we do not want to specify which specific task subset we are interested in, it is useful to rewrite this as follows:
where is the probability that nodes are independent from each other, are attached at least to one of first (otherwise they would be independent too) in . Again, we would like to estimate this quantity using the degree distribution of in order to be able to compute the expected performance. This can be easily visualized as in Figure 6: denote the degree of the nodes as degrees ; i) we need the nodes to be independent, which for each pair of nodes happens with probability (we are prohibiting the red edge in the Figure); ii) we then need the remaining nodes to connect to at least one of the stubs belonging to the independent nodes, and this happens with probability per node (we impose the existence of at least one yellow edge). Clearly . Putting these contributions together we finally arrive at the probability of executing successfully out of tasks with degrees :
Naturally, the full probability should include the probability of the degree configuration which for uncorrelated random graphs factorises in the product of times ’s degree distribution. So finally becomes:
Specifying to the case of -regular networks for simplicity, the previous equation takes a very simple form:
This form of the equation is interesting because it makes very easy to see the size dependence of the expected performance on both the average degree and size of the original : indeed in this case , where here is the number of tasks in which can grow both by enlarging the size of or by increasing its average degree. For a generic degree distribution with finite first and second moment, if we consider the three main factors in the probability as independent variables (which is a reasonable assumption for or equivalently ), we can approximate the expressions above with the following:
which are the ones we use in the main text.
Appendix B Estimation of dependency graph degree sequence and distribution
Consider an input-output pairing bipartite graph with the same number of input and output nodes , and denote its input degree distribution as and the output degree distribution as . The line graph of has node set equal to the edge set of . Hence to each edge in , or equivalently node in , we can associate its extremal nodes’ degrees , giving the degree of . If we denote the probability for a input node with degree to be linked to a output node with degree as , we can use it to generate the degree distribution for as . We write the generating function for as , which by substitution in the expression for , yields Newman (2009). The corresponding excess degree distributions are obtained easily as
which can be then rewritten in the standard generating function formalism.
In order to obtain an estimate for the actual degree in the dependency graph of a node in characterized by , we need to calculate what is the contribution to the degree coming from the closure of open wedges, that is how many second neighbours the node has. This calculation can be performed in a similar way to standard calculation of the number of second neighboursNewman (2010). In this case, however, we need to take care of the potential effect of the joint degree distribution . The degree of node in the dependency graph is the sum of the degree of in and the number of second neighbours reached from the input excess edges and from the output excess edges:
This result holds exactly for sparse graphs but we show that it also gives good results for graphs with intermediate densities. In that case, however, we need to keep track of the possibility that input and output edges of a node in might connect to the same node. For a node characterized by ,the expression for the degree correction takes the form
. Collecting all terms together we arrive at:
Formally, we can write and, similarly to above, this generates the degree distribution for the dependency graph . In practice, calculating the solution for this distribution can be cumbersome, especially when the does not have a well specified functional form. However, Eq. 30 can directly be used to generate the degree sequence for from . Note also that the final expression for is written in terms of the first two moments of the task structure graph’s degree distribution , which plays a crucial role in the estimation of of the dependency graph. In Figure 7 we show the performance of Eq. 30 for two different network topologies, exponential graphs and scale-free graphs. It is easy to see that in the exponential case, by virtue of the definiteness of the moments the estimation is quite accurate, while for scale-free graphs its accuracy is reduced due to the divergence of the second moment of the degree distribution (for slopes between 2 and 3).