How big is too big? Critical Shocks for Systemic Failure Cascades
Abstract
External or internal shocks may lead to the collapse of a system consisting of many agents. If the shock hits only one agent initially and causes it to fail, this can induce a cascade of failures among neighboring agents. Several critical constellations determine whether this cascade remains finite or reaches the size of the system, i.e. leads to systemic risk. We investigate the critical parameters for such cascades in a simple model, where agents are characterized by an individual threshold determining their capacity to handle a load with being their safety margin. If agents fail, they redistribute their load equally to neighboring agents in a regular network. For three different threshold distributions , we derive analytical results for the size of the cascade, , which is regarded as a measure of systemic risk, and the time when it stops. We focus on two different regimes, (i) EEE, an external extreme event where the size of the shock is of the order of the total capacity of the network, and (ii) RIE, a random internal event where the size of the shock is of the order of the capacity of an agent. We find that even for large extreme events that exceed the capacity of the network finite cascades are still possible, if a powerlaw threshold distribution is assumed. On the other hand, even small random fluctuations may lead to full cascades if critical conditions are met. Most importantly, we demonstrate that the size of the “big” shock is not the problem, as the systemic risk only varies slightly for changes of 10 to 50 percent of the external shock. Systemic risk depends much more on ingredients such as the network topology, the safety margin and the threshold distribution, which gives hints on how to reduce systemic risk.
[5mm]
Chair of Systems Design, ETH Zurich, Weinbergstrasse 58, CH8092 Zurich, Switzerland
1 Introduction
Current research on systemic risk can be roughly divided into two different strands each one having its own focus: (i) the probability of extreme events which can cause a breakdown of the system, (ii) the mechanisms which can amplify the failure of a few system elements, to cause a failure cascade of the size of the system. The former line of research assumes that systemic risk is caused by external events, e.g. big earthquakes, tsunamis, or meteor impacts. Thus, in addition to the likelihood of extreme events, another interesting question regards the response of the system to such perturbations, i.e. its capability to absorb shocks of a given size. The latter research area, on the other hand, sees systemic failure as an endogenous feature that basically emerges from the nonlinear interaction of the constituents, i.e. how they redistribute, and possibly amplify, load internally.
In both approaches, the likelihood of a systemic breakdown can only be determined by considering the internal dynamics of system elements, denoted as agents in this paper, such as their capacity to resist shocks, their timebound interaction with neighbors, their dependence on macroscopic feedback mechanisms, such as coupling to the macroscopic state of the system. Only in rare cases the dynamics of systemic risk can be reduced to mere topological aspects, such as the diversity in the number of neighbors, the role of hubs, etc.. In this paper, we combine the two research questions mentioned above: on the one hand, we are interested in the critical size of an external shock that may lead to collapse of the system. At the same time, we address that such critical values depend on the safety margins of the system elements, and the details of their interaction when redistributing load internally. We also investigate how these cascading dynamics are affected by the structural features of the network (level of connectivity, topological heterogeneities)and by individual properties of the agents, such as the probability distributions of the failure thresholds. Such insights can directly benefit a robust system design by means of individualization of agents (i.e. designing agents with optimal heterogeneity).
Given the importance of such problems for social, economical and technological systems, the topic is already discussed in a wide range of scientific literature. Some modeling framework were recently proposed [1, 2, 3]. The complex network approach was also used to describe cascading processes in power grids and in Internet services [4, 5], and was also applied to data storage services [6]. Importantly, similar agentbased approaches were developed to model avalanche defaults among financial institutions [7, 8].
Our paper is organized as follows: in the next section we introduce the agentbased model studied, determining e.g. agent’s fragility and agents’ interaction by means of a load redistribution mechanism. This allows us to define a measure for systemic risk on the macroscopic level. In Section 3 we develop an analytical framework that allows us to unveil the dynamics of systemic risk based on cascading processes. In Section 3.2, we discuss the critical conditions for systemic risk to emerge. Later, in the Section 3.3, we study when agents can be considered as systemic by their importance. The paper finishes with some conclusions in Section 4, which also allow for a generalized picture of how to prevent systemic risk.
2 Analytical approach to systemic risk
2.1 Description of the model
Net fragility
In a recent paper [1], a framework to model systemic risk by means of an agentbased approach was developed. In this framework, each agent is characterized by three individual variables: a discrete variable , which describes its state at a discrete time , i.e. for an operating state and for a failed state, and two continuous variables, the threshold and load . The threshold is assumed to describe the individual ‘capacity’ of an agent: it defines how much load an agent can carry before it fails. On the other hand, the variable describes the load which is exerted on an agent.
We note that while the load can change in time e.g. through systemic feedback, it further depends on the state of other agents and on the network of interactions, described by the adjacency matrix . Written this way, the load also depends on how it is exchanged between agents. A special case will be discussed below. We define that agent fails if its net fragility ,
(1) 
is equal or larger than zero. I.e, in a deterministic model, the dynamics of an agent is given by
(2) 
where is the Heaviside function. Certainly, the dynamics only depends on the net fragility, i.e. on the relative distance between load and threshold. Nevertheless, a distinction between these two individual variables is very useful, as it allows us to conceptually distinguish between internal and external influences on the failure.
Systemic risk
We now define the important measure of systemic risk. We define it as the fraction of failed agents at any point in time. For a system composed of agents, it reads
(3) 
represents the density of agents with a net fragility at time ; the integral runs over the agents whose net fragility is positive. Failures in a subset of agents will result in cascading processes over the network of interaction, which results in changes of the fragility of other agents in the course of time. This can be expressed by the recursive dynamics , where is some function that describes how the load of failing agents is redistributed depending on the interaction mechanisms. With this, by specifying the initial condition , it is possible to compute for a deterministic dynamics.
In Ref. [1], was calculated by making suitable assumptions about the distribution of the net fragility, , the initial conditions , and for the particular case of a fully connected network – i.e. each agent interacts with everyone else. Specifically, an initial condition , was used; i.e. the initial fragility of agents is normally distributed with a mean and standard deviation . This implies that the initial fraction of failed agents at time is given by , where denotes the cumulative function of the normal distribution. I.e. it gives the (normalized) number of agents with an initial net fragility (defined in Eq. (1)), equal or larger than zero. The authors calculated the size of cascades measured by the final fraction of failed agents for different interaction mechanisms. Remarkably, it was found that systemic risk depends on the variance of the distribution in a nonmonotonous manner. This means, systemic risk can decrease if the agents become more heterogeneous, i.e. if their individual threshold becomes more different. On the other hand, for homogeneous agents characterized by the same threshold, a firstorder phase transition was found between no systemic risk and complete failure.
Initial fragility
We use these previous findings as a reference point, but we will extend our model in different ways. First of all, instead of a normal distribution for the initial net fragility , we assume a fixed relation between initial fragility and threshold :
(4) 
The parameter is a constant, equal for all agents. Only for one agent , instead of the fixed relation (4), we define . Thus, we consider that initially only one agent , is at a critical condition, whereas with all other agents are initially capable of handling the load assigned to them. I.e. different from the distribution of initial loads in [1], we do not have an initial failure cascade. Instead, the initial condition for the systemic risk is simply .
The value can then be regarded as the agent available capacity (or safety margin) before they fail if their load is increased. A fixed relation between fragility and threshold was first used in [4] to describe cascading processes in power grids and Internet (see also [5, 9, 10]). It basically reflects the situation of many sociotechnical systems in which the capacity of agents is usually ad hoc designed to handle the load, because limited by cost, under normal conditions. We will later vary the safety margin to determine how the severity of cascading failures will depend on it.
Threshold distribution
With these considerations, only the threshold distribution remains to be specified to complete the initial conditions. It is worth remarking that the capacities of the agents –in contrast with other studies found in the Literature so far– are decoupled from the topological artifacts of the network connecting them. Here we will use three different assumptions for both analytical calculations and computer simulations:

a delta distribution , where is the Dirac delta function, i.e. all agents have the same threshold ,

a uniform distribution with the mean and the range , i.e. all agents have different, but comparable thresholds in the interval . For all further calculations we define .

a powerlaw distribution
(5) i.e. agents have thresholds that can differ by orders of magnitudes. As the normalization depends on the value , we assign its numerical value for our further calculations to be comparable with the minimum value of the uniform distribution, .
Agent interaction
In order to describe the agent interaction, we use the network approach in which agents are represented by nodes and interactions by links between agents. I.e., the network topology specifies which other agents a particular one interacts with. This can be statistically described by the degree distribution for which we will use in this paper only , i.e. a regular network, in which each agent interacts with other agents. The fully connected network is a special case with .
Secondly, we have to specify how agents interact through these links. Here, we assume a load redistribution mechanism in which the initially failing agent shares its load equally among its neighboring agents (labeled ), see Fig. 1. That means for each of these agents, their own load increases in the next time step by an amount of . If this addition leads to a positive net fragility , agents fail as well and redistribute their load equally to their neighboring agents, and so on.
Cascade sizes
This way, failure cascades can occur in the course of time, and we are interested in their relative size and the probability distribution of their occurrence, . For this calculation, which will be done in Sect. 3.2, we define the fraction of failing agents at each time step and the number of failing agents during the same time interval as:
(6) 
gives the number of agents that are hit by the cascade at time , i.e. they are located in the –th neighborhood of agent which failed initially. Hence, with the model of Fig. 1 in mind, is the number of agents that can potentially fail during time step . Dependent on the topology of the regular network, there are two limiting cases to express how grows in time. holds in regular networks, where the interface grows linearly with distance. On the other hand, for Bethe lattices, treelike structures and random topologies in which loops are neglected [2], the number of nodes at distance is .
Using the definition (6), we can calculate the size of the cascade at time , which is equal to the systemic risk as:
(7) 
In general, Eq. (7) cannot be solved analytically. However, in the following sections we will derive analytical expressions for , assuming different distributions of agents’ thresholds.
Finite versus infinite cascades
According to the definitions above, a total failure occurs if . In a finite system, this will happen at a finite time , while in an infinite system this final state is reached only asymptotically, . However, in a finite system a cascade can stop even for if the potential number of failing agents reaches the system size at a given time ,
(8) 
Yet, there is a third case to be considered, namely that the cascade stops at a finite time , even if and , simply because the redistribution of loads to the nearest neighbors does not cause further failure. This is expressed by the condition .
We will refer to an “infinite” cascade if , which means every agent in the system has failed. On the other hand, a “finite” cascade occurs either if it stops at time , or if the redistribution of load has reached the system size, Eq. (8), without causing all agents to fail. Consequently, finite cascades stop at time . We note that, according to our definition, Eq. (7), systemic risk refers to finite cascades as well, not just to . Precisely, we are interested in the distribution , i.e. the density of failed agents at the time by which cascades end regardless of the cause for that.
Network capacity
In order to put the size of the initial shock into perspective, we refer to the total capacity of the network to absorb shocks, which depends on the safety margin , the total number of nodes, and the threshold distribution . Thus, the capacity that the system could a priori absorb during the cascade is simply given by
(9) 
If the threshold distribution has a defined mean value, , this expression reduces to . On the other hand, for a normalized powerlaw distribution with a minimum threshold value , the mean value is only defined for . For , a simple argument [11] shows that for a finite system the expected value can still be computed. The result is
(10) 
It is worth noticing that for the delta threshold distribution, the uniform (with ) and the powerlaw distribution with , the network capacity is of the same order of magnitude. In Fig. 2 we show the network capacity for the powerlaw distribution as a function of and . Precisely, it gets the same numerical value in all three cases cases, , if and , as used for the numerical calculations. However, for the power law distribution with , the network capacity becomes much larger than in the three other cases because of the additional dependence of the number of agents, . Choosing for the numerical calculations later implies that, compared to the uniform case, we have .
In this paper, depending on the magnitude of the initial shock, we distinguish between two different regimes:
(i) EEE – the extreme exogenous event resulting in a very large which is of the order of , i.e. much larger than the capacity of the initially failing agent (or the average capacity of agents): . In this case, there is no surprise that agents involved in the redistribution of load will fail, at least in an early phase. Hence, we are mostly interested in the conditions under which cascades may stop before they have reached the size of the system.
(ii) RIE – the random internal event, which assumes that initially one randomly chosen agent faces a load that is slightly larger than its own capacity , drawn from the distribution , i.e. . This is likely to happen by a random fluctuation of the load that exceeds the threshold, rather than a big impact on the system. In this case, we are interested in the conditions under which cascades occur at all.
2.2 Conditions for failing nearest neighbors
We assume that, at , a single randomly chosen agent fails because of an initial shock, i.e. . According to the redistribution mechanism described above, this failure will increase the fragility of the nearest neighbors of , labeled (see Fig. 1). Agent can fail if its net fragility becomes positive, i.e.:
(11) 
which together with Eq. (4) leads to the critical condition for the failure of agent ,
(12) 
Here defines the critical threshold for the firstorder neighborhood of agent , or the critical threshold at time , respectively. Agents with a threshold between and will fail, hence the fraction of failing agents at time reads:
(13) 
This fraction depends on the threshold distribution , so explicit calculations will be given in the next Section. At the moment we just assume that at least one agent has failed, i.e. there will be a cascade to the next neighborhood (cf. Fig. 1).
Let us denote failing agents in the first step by . Their load will be redistributed to their nearest neighbors labeled . Following the reasoning used for Eq. (11), we obtain for the load of agents at time :
(14) 
The summation is performed over the whole set of failed agents that belong to the neighborhood of , their load being .
The exact amount of agents depend on the topological properties of the network considered. For example, in square and hexagonal lattices, some second nearest neighbors of , i.e. agents , have more than one link to agents in the nearest neighborhood. E.g. for the hexagonal lattice, half of the agents at level have two links to agents , whereas the other half has only one. In general, for regular lattices, this number will be between one or two. In this paper, we will restrict our analysis to the case of a single failing node in the neighborhood of which is the case for Bethe lattices or sparse random regular networks [12]. The theory can be extended for other regular geometries in a straightforward manner. With this considerations in mind, Eq. (14) becomes
(15) 
From Eq.(15) we obtain the critical condition for the failure of agent if its net fragility becomes positive:
(16) 
This expression for the critical threshold at , i.e. in the secondorder neighborhood of , is comprised of two redistribution processes. On the one hand those from agents failing at and, on the other, the redistribution of the initial load from agent failing at time . The fraction of failed agents at time () is then given by
(17) 
where indicates that the critical threshold at depends on the load of failing agents at time , which does not need to be equal for every , but depends on the topology.
Using the same reasoning for the different time steps of the cascade, we obtain a general expression for the fraction of agents failing during time step (which are the neighbors of agents failing at ):
(18) 
The critical threshold at time depends on the load as follows:
(19) 
This is a recursive equation, i.e. depends on the load redistributed by all the agents that failed along the path connecting the initially failing agent with agents failing at time . However, Eq. (18) cannot be computed in general, thus, in the following sections, we will study some cases in which this equation can be reduced and solved.
2.3 Threshold approximation in the RIE and EEE regimes
Inequality (19) is an important result to understand the propagation of cascades, which holds for both regimes introduced above, the EEE regime, where the external shock dominates the dynamics, and the RIE regime, where small random events inside the first failing agents may trigger the cascade. For both, we are able to derive some general results even before specifying the threshold distribution .
In the RIE case, the redistribution of loads, that means the network effect, plays the most important role. Note that pairs of agents connected through an edge have in general different capacities. If one of them fails, its neighbor is exposed to failure during the following time step. However, whether it fails or not will depend on: (a) the load redistributed from the failing agent; (b) its own capacity. Let us assume that if a agent with capacity fails, the total load induced on its neighbors is . This assumption neglects the contribution of agents that failed before the time step , which are terms of order , with . This implies that is the load distributed by the agent, i.e. it is exactly its capacity and not more. Thus, the largest capacity of the failing agent at time , , depends on the capacity of the agent that failed on the previous time step , i.e.
(20) 
which is a lower bound for the failure condition, Eq. (19). This means that among all neighbors of the agents at layer , those with a capacity lower than will fail.
With the assumption (20), the fraction of failing agents at time can be decomposed in terms of failure of two consecutive agents in a pairwise approximation as follows:
(21) 
This approach differs from the previous meanfield approximation, in the following. Now, the net effect of the load redistributed by a failing agent is taken into account to determine the fraction of its neighbors that will fail in the next time step. Thus, this approximation entails information about the heterogeneity at the edge level. On the other hand, even in the EEE regime, , the role of the network still cannot be neglected. We are interested in the limit satisfying: (i) the load redistributed by the previously failed nodes cannot be totally neglected –i.e. –; (ii) the main contribution to the load comes from (i.e. from the initial load). We assume that at any point in time, there exists a critical threshold above which agents do not fail. In this case Eq. (19), becomes simply . Then, the load of agents at a distance from agent is simply given by
(22) 
which results in the critical threshold for the EEE regime:
(23) 
This gives the critical condition for the failing threshold of an agent that is hit by the cascade at time (i.e. it belongs to the th nearest neighborhood of the initially failing agent ). It nicely separates two effects that determine the severity of a cascade: (a) the size of the initial shock , (b) the number of neighbors to share the load and their respective safety margin, i.e. . In the limit of large external shocks, and independent of further assumptions about the threshold distribution, the sequence of the critical thresholds crucially depends on the sign of the factor . If , the sequence will approach zero exponentially, i.e. with increasing distance from the initially failing agent, this condition will be more easily met. Hence, there should be a finite at which all reasonably chosen threshold values are larger than the critical threshold, which implies that the cascade stops. This is shown in Fig. 3 for the case of the uniform threshold distribution for different values of the safety margin . We can verify for the given set of parameters that for , and the cascade stops right after , while for it stops after . On the other hand, for we see that the critical threshold is already at larger than any existing threshold, so the full cascade cannot be prevented.
While this is an intuitive and illustrative example, we will calculate analytically the exact time at which the cascade may stop, in the following section. We note again that due to the finite system size cascades may stop already at time , which gives an additional limit.
3 Critical conditions for systemic risk
3.1 Analytical estimations of cascade sizes
Up to this point, we have derived a measure for systemic risk , Eq. (7) that is based on the fraction of agents failing at a given time . Failure cascades can propagate in the system if the net fragility of agents is positive, i.e. if the load exceeds the capacity. While becomes a function of the redistribution of loads in previous time steps, the capacity is determined by a threshold distribution function , for which we use three different specifications. Already the general framework outlined above allows us to expect “infinite” () and “finite” () cascades, where the latter can encompass the whole system or stop before. In the following, we will specify the conditions for these findings for the different threshold distributions.
Cascade size for homogeneous threshold
Let us start with the simplest case that all agents have the same threshold and the same number of neighbors. As stated above, there is a failing agent at , for which . Because of the homogeneous distribution, it can be noted that if an agent fails due to the failure of one of its neighbors, , then all the neighbors of will fail as well. I.e. if . Let be the load of agents at a distance of the initially failing agent at and let us assume that agents at a distance lower than already failed. Then, the load of agents in shell is
(24) 
With the initial condition , this recursive equation can be easily solved, yielding
(25) 
I.e., agents exposed to the redistribution of load at time will fail if . This equation allows to gain insight into the cascade mechanism. On the one hand, according to the above discussion of the EEE regime, infinite cascades can only be triggered if , irrespective of the threshold distribution. This means a topological effect (the number and safety margin of neighbors among which the load is redistributed) decides about finite and infinite cascades.
On the other hand, when , the initial load cannot (by itself) trigger an infinite cascade in the case of homogeneous threshold. I.e. even in the EEE regime, a cascade will only last time steps, where results from the failing condition . From the condition , we can compute
(26) 
As discussed before, the actual time at which the cascade stops is , where denotes the time where the cascade reaches the system size, Eq. (8).
Knowing , we can further calculate the systemic risk according to Eq. (7), with , i.e. for and , otherwise. We find
(27) 
for a Bethe lattice or a tree with coordination number and
(28) 
for a regular lattice, with the exact factor depending on the topology.
Cascade size for uniform threshold distribution
We now turn to the simplest case that allows some heterogeneity in the agent’s threshold, which is the uniform distribution . The failing condition in Eq. (12) for the nearest neighbors still holds, but the question is how often we find thresholds below the critical limit:
(29) 
With given by Eq. (12) it turns out that the fraction of failing agents at the first time step is
(30) 
and otherwise. Regarding , we know from Eq. (18) that the fraction of failing agents at any time step crucially depends on the history of failed agents, i.e. the path connecting the initially failed agent with the currently failing one. Therefore, in general, the process is not solvable. There is the need of a simplifying assumption to break the integral expression in Eq. (18) into solvable parts.
For the EEE regime we use Eq. (23) and the underlying assumptions to obtain the closed equation,
(31) 
With this expression, we find from the time at which the cascade stops for the uniform threshold distribution:
(32) 
Again, the time at which the cascade ends is given by , with given by Eq. (8).
Considering instead the RIE regime, , where redistribution effects play a mayor role, we use Eq. (21) and the underlying assumptions. Neglecting capacitycapacity correlations among agents, we find for the case of the uniform threshold distribution:
(33) 
The time at which the cascade stops is, as in the previous cases, given by the condition , where is the first time step that verifies , and is the one defined in Eq. (8).
Cascade size for powerlaw threshold distribution
Now, we discuss the case where agent’s threshold follows a powerlaw distribution, Eq. (5), and can vary by orders of magnitude. With the same procedure as used before, we determine the fraction of failed agents during the initial cascade as
(34) 
So, cascades are obtained if .
Considering first the EEE regime, we use the approximation given by Eq. (22) and find for the fraction of failing agents during time step :
(35) 
From , we calculate the time when the cascade stops as
(36) 
Again, following our previous discussion, the cascade stops at .
3.2 Numerical results for the EEE regime
Size of finite cascades
The analytical results for and allow us now to calculate the systemic risk for the different threshold distributions, by varying system parameters such as the safety margin or the network topology. In this section, we first concentrate on the EEE regime, where the external shock is comparable to the network capacity and much larger than the average threshold of an agent, .
In Fig. 4 we compare, for the uniform threshold distribution, the systemic risk in Bethe and 2D regular lattices. We remind that the difference is in the number of agents potentially affected by the cascade at a given time . For Bethe lattices and regular trees, we have , whereas for regular networks , i.e. for a given in Bethe lattices much more agents are affected. Conversely, for a given , regular lattices have a larger diameter. For example for the 2D regular lattice the diameter grows with system size as . On the other hand, the diameter in a Bethe lattice grows as .
In both cases, trivially, if the safety margin vanishes, any external shock results in an immediate collapse. This also happens for finite safety margins as long as the global stability condition is not met. On the other hand, for large safety margins, , we do expect finite cascades. As the plots show, the parameter region for these is much larger for regular networks where at each time step a smaller number of agents is affected, than for Bethe lattices. As shown, in the first case an initial shock of almost 30% the network capacity already leads to full cascade, whereas in the second cease it requires an initial shock of almost 60% the system’s available capacity for a full cascade.
The results are to be compared with Fig. 5, where we plot the systemic risk , for Bethe lattices only, for a power law threshold distributions with two different exponents . We remind that the network capacity is comparable to the case of the uniform threshold distribution only for , Fig. 5 (right) and we also observe a similar dependence of on the safety margin and on the relative initial load . To be precise, in this case the safety margin plays a less important role, but we find finite cascades also for , which was not the case for the uniform threshold distribution.
The situation differs for , as Fig. 5 (left) shows. Here the system seems to be much more vulnerable, indicated by large values of in all parameter regions. To put this finding into perspective, we remind that for the network capacity is much larger than for , i.e. the initial shock also has much higher values as compared to Fig. 5 (right). This explains the severity of the cascades in this case.
The results obtained for these two particular values of can be generalized to and . As a consequence, we may conclude that with a much broader threshold distribution, the system can absorb higher initial shocks (in absolute values), but shocks of a size comparable to the network capacity most likely result in infinite cascades, i.e. total failure.
To better understand the role of the skewness of the threshold distribution and the topology, we fix the relative initial shock and vary for two different network topologies. Fig. 6 confirms (a) that a Bethe lattice or regular tree structure leads to more severe cascades as compared to a regular network, which is due to the smaller diameter of the network, and (b) that an increasing skewness, i.e. smaller values of lead to an increasing systemic risk. Remarkably, there is a nonmonotonic dependence of on and , and the cascade size becomes larger around . The reason for this is, on the one hand, the system size dependence of the network capacity for and, on the other hand, the larger fragility resulting from a more skewed distribution (i.e. thresholds are closer to ). The first effect is a global one, i.e. larger load is added to the system, the second is a local one, i.e. there are fewer agents that can handle large loads.
3.3 Results for the RIE regime
The previous results have focused on the EEE regime of large external shocks, . This means that the initial load is largely responsible for triggering the cascades. Here we focus on the opposite case, the RIE regime, , where small fluctuations of an agent’s load lead to the agent’s failure provided that the safety margin of that agent was rather small. The question is then under which conditions this failure leads to a cascade of macroscopic size. Applications of this case include powergrid cascade failures [4, 5, 10], or failures of server infrastructure [6].
We now consider for the failing agent, and we assume that the system is in a critical condition. This means that a single failure among the neighbors of the initially failing agent (i.e. ) is enough to trigger a systemwide cascade. We now define as the load such that is exactly equal to . With these assumptions, we compute the ensemble average of the systemic risk
(38) 
The integral runs over the threshold of all possible agents which trigger a full cascade. These agents can be regarded as systemically important because their failure induces a systemic collapse.
The quantity then represents the frequency at which a randomly chosen agent can trigger the complete failure of the system. In the following we compute for the uniform and powerlaw thresholds distribution.
Uniform threshold distribution
The initial critical load in the RIE regime can be easily computed from Eq. (30), using the condition :
(39) 
Then, the average systemic risk for the uniform distribution is given by
(40) 
Powerlaw threshold distribution
In an analogous way we obtain from Eq. (34) with the expression for the initial critical load in the case of a powerlaw threshold distribution:
(41) 
This allows us to calculate the average systemic risk as:
(42) 
In order to study the role of connectivity in the cascade process, we created random regular networks [12] with arbitrary values of the average connectivity . Fig. 7 shows the average cascade size for a rather small safety margin, which is in line with the RIE regime. This means that small fluctuations of the load of a single agent may lead to its failure. Fig. 7 allows to compare the analytical expressions in Eqs. (40), (42) with numerical simulations of the cascade process. The graphs show a sharp transitions from to at . This results directly from the change in global instability , at that particular point. On the other hand, when the system is not globally unstable, the results show that only a subset of agents are able to trigger a cascade in the system, their fraction indicated by
The left panel of Fig. 7 shows the results for the uniform threshold distribution. For identical agents (left panel, ), a sharp transition between complete failure () and no failure () can be observed. This result immediately follows from Eq. (25), in the limiting case . For a fixed value , the graphs show that larger values of exhibit larger frequencies of full cascades, . This results from Eq. (39) which shows that the critical initial load decreases for increasing heterogeneity; thus, for larger values of a larger fraction of agents is likely above such a threshold.
The right panel of Fig. 7 shows the results for the powerlaw threshold distribution. Again, broader distributions, i.e. lower values of , result in a higher probability of complete failures. This is in line with Eq. (41) where, for a fixed , the critical load increases with . At the same time, the threshold distributions become more narrow with larger . Thus, the amount of systemically important agents that are able to trigger a full cascade, is much lower in distributions with large , and thus the average systemic risk decreases.
4 Conclusions
The model proposed in this paper is based on very simple ingredients, to allow for analytical treatment: (i) a regular network, i.e. all agents have the same number of neighbors, , (ii) a constant safety margin , the same for all agents, which defines a fixed relation between the load an agent can possibly carry and the threshold at which it fails, (iii) an initial condition that only one randomly chosen agent fails when facing a load , (iv) a redistribution of the load of the failing agent to its neighbors.
The variability of the model comes from two assumptions: (v) the threshold distribution which was chosen as a delta distribution, a uniform distribution, or a power law distribution, (vi) the severity of the initial shock, which was either of the order of the network capacity , i.e. much larger than the average threshold, or much smaller than the network capacity, i.e. of the order of the average threshold. The latter allowed us to distinguish between two important regimes: (a) EEE regime, where an extreme external event was large enough to cause the failure of many agents, (b) RIE regime, where small random internal events may result in the failure of a single agent. In both cases, this initial failure may have triggered a cascade of failures in the neighboring agents.
For both regimes, we are interested in the following question: what is the possible size of a failure cascade, , measured as the total number of failed agents compared to the system size , at a given time . Will this be an “infinite” () or a “finite” () cascade and, in the latter case, at what time will it stop: at a time where the cascade has already reached all agents but did not cause all of them to fail, or at a time before it has passed the entire system. We regard as a measure of systemic risk.
For both regimes, we are able to derive analytical results to answer these questions. These results allow us to draw conclusions about the conditions that lead a smaller systemic risk. In the following we summarize our finding:
(1) We derive a global stability condition that has to be met in order to allow for finite cascades, in principle. The larger the number of neighboring agents, , or the larger the safety margin, , the more likely this condition is met. This allows for an interesting discussion because of the possible tradeoff between the two ingredients. In most cases, the safety margin is given by the technical constitution of an agent, e.g. in power grids or routing servers. , on the other hand, refers to the network topology but not to internal properties of agents. Hence, systemic risk can be reduced by increasing the network density  at least up to a certain point [7]. It should be noted that we have assumed the initial failure of only one agent, here. If it is, however, more costly to improve the network connectivity than increasing the safety margin of the agents, the latter can serve the same purpose, namely reducing systemic risk.
(2) In a system of thousands of agents, the network capacity, , which is the total load the system can carry a priori, is also quite big. One would not easily assume that an initial shock is of the same magnitude as as in the EEE regime. Hence such big shocks are extreme external events to the system. The interesting finding in this paper is that even such extreme events may not lead to an “infinite” cascade, i.e. to a total collapse of the system. Instead, provided that the global stability condition is met, we find a broad range of system parameters where such cascades stop at finite time, affecting only part of the system. We have shown that systemic risk resulting from extreme shocks can be reduced (a) by a regular lattice structure (as opposed to e.g. a regular tree structure), (b) by a broad threshold distribution. In the latter case, we found finite cascades, i.e. , even if the initial shock was 24 times larger than the total network capacity, which can be regarded as a sign of real robustness. Comparing the powerlaw threshold distributions of and , we found that, in absolute measures of the shock, the broader distribution lead to the more robust systems. In relative measures, however, this result inverts, simply because a broader distribution also results for a larger network capacity, and hence for larger initial shocks, while the relative measure remains the same.
(3) Investigating the RIE regime where the initial shock was of the order of the threshold of an average agent, i.e. much smaller than the total network capacity, a systemic failure can occur only if (a) the the initial shock is larger or equal to the threshold of the initially failing agent, and (b) the redistribution of load is large enough. Hence, dependent on the threshold distribution we can calculate this failure probability. Even in the global stability regime, we find “infinite” cascades, but the probability of their occurrence depends on the probability that randomly chosen agent fails initially. The broader the threshold distribution, the more likely this condition is met, i.e. the frequency of observing “infinite” cascades increases with the heterogeneity.
(4) The initial question: “How big is too big”, from this perspective, can be answered as follows: Initial shocks, even if they exceed the capacity of the whole system (not just the capacity of a single agent), are probably not the problem. Of course, there are parameter regimes that lead to complete collapse (). At the same time, we see that a change of of 10 or even 50 percent does not change the systemic risk very much. Of much larger influence are system parameters related to the network topology, the safety margin, and the threshold distribution. As it was also found in other papers [1], an optimal heterogeneity in the agent’s threshold can reduce systemic risk considerably. In addition to that we find that an change of the safety margin by 10 or 50 percent generates a much larger impact on systemic risk than a comparable change in the external shock. So, when seeking for protection against systemic risk the focus should be (a) on those parameters that influence the global stability, i.e. (see above), and (b) on the optimal heterogeneity in the threshold distribution.
References
 [1] J. Lorenz, Stefano Battiston, and Frank Schweitzer. Systemic risk in a unifying framework for cascading processes on networks. The European Physical Journal B, 71(4):441–460, October 2009.
 [2] D J Watts. A Simple Model of global cascades on random networks. Proceedings of the National Academy of Sciences, 99(9):5766, 2002.
 [3] P S Dodds and D J Watts. Universal behavior in a generalized model of contagion. Physical Review Letters, 92(21):218701, 2004.
 [4] A E Motter and Y C Lai. Cascadebased attacks on complex networks. Physical Review E, 66(6):65102, 2002.
 [5] P Crucitti, V Latora, and M Marchiori. Model for cascading failures in complex networks. Physical Review E, 69(4):45104, 2004.
 [6] D. Rossi, M. Mellia, and M. Meo. Evidences behind skype outage. In In IEEE International Conference on Communications, Dresden, 2009.
 [7] Stefano Battiston, Domenico Delli Gatti, Mauro Gallegati, B C N Greenwald, and Joseph E Stiglitz. Liaisons Dangereuses: Increasing Connectivity, Risk Sharing, and Systemic Risk. January 2009.
 [8] P Gai and S Kapadia. Contagion in financial networks. Bank of England working papers, 2010.
 [9] Motter. Cascade Control and Defense in Complex Networks. Physical Review Letters, 93(9):98701, 2004.
 [10] R Kinney, P Crucitti, R Albert, and V Latora. Modeling cascading failures in the North American power grid. The European Physical Journal B, 46:101–107, 2005.
 [11] Claudio Juan Tessone, Markus M Geipel, and Frank Schweitzer. Sustainable growth in complex networks. EPL (Europhysics Letters), 96(5):58005, December 2011.
 [12] A. Bekessy, P. Bekessy, and J. Komlos. Stud. Sci. Math. Hungar., 7:343–353, 1972.