Self-healing systems and virtual structures
Modern networks are large, highly complex and dynamic. Add to that the mobility of the agents comprising many of these networks. It is difficult or even impossible for such systems to be managed centrally in an efficient manner. It is imperative for such systems to attain a degree of self-management. Self-healing i.e. the capability of a system in a good state to recover to another good state in face of an attack, is desirable for such systems. In this paper, we discuss the self-healing model for dynamic reconfigurable systems. In this model, an omniscient adversary inserts or deletes nodes from a network and the algorithm responds by adding a limited number of edges in order to maintain invariants of the network. We look at some of the results in this model and argue for their applicability and further extensions of the results and the model. We also look at some of the techniques we have used in our earlier work, in particular, we look at the idea of maintaining virtual graphs mapped over the existing network and assert that this may be a useful technique to use in many problem domains.
Modern networks have evolved to become both large and highly complex, with some networks spanning nations and even the globe. Networks provide a multitude of services using a wide variety of protocols and components to the extent that they have now begun to resemble self-governed living entities. Most modern networks are dynamic with nodes entering the network or leaving by choice, failure or attack. There are dynamic networks which have always been around in some form, like social networks, which we have only now begun to analyze and in fact, influence. That maintaining robustness in modern networks can be an issue can be ascertained by the regular breakdowns in large and important networks e.g. the crash of the Skype network in 2007 [9, 23, 25, 31, 34] attributed to the failure of its “self-healing” mechanisms . Also, due to the scale and nature of design of such networks, it may simply not be practical to build robustness into the individual nodes or into the structure of the initial network itself. Thus, the need for a responsive approach to robustness. Many important networks are also reconfigurable in the sense that they can change their topology e.g. peer-to-peer, wireless, ad-hoc networks and friendship networks on social networking sites etc. . We exploit this property of networks to allow us a responsive approach towards robustness. Moreover, our algorithms are scalable since our repair costs are constant or at most logarithmic in the number of nodes, and inherently handle the dynamism of the network. Also, we conjecture that some of the techniques we use, in particular, virtual graphs can be effectively used for a wider range of problems than we discuss.
Informally, self-healing is the maintenance of certain properties within desirable bounds by the nodes in a network suffering from failures or under attack. As the name implies, self-healing has to be initiated and executed by the nodes themselves. As such, the self-healing algorithms we have devised are fully distributed. We can say that a self-healing system, when starting from a correct state, can only be temporarily out of a correct state i.e. it recovers to a correct state, in presence of attacks.
Our sense of self-healing is more formally captured by the model discussed in Section 1.1. Informally, the model we adopt in this work is as follows. We assume (for simplicity of description) that the network is initially a connected graph over nodes. An adversary repeatedly attacks the network. This adversary knows the network topology and our algorithm, and it has the ability to delete arbitrary nodes from the network or insert a new node in the system which it can connect to any subset of the nodes currently in the system. However, we assume the adversary is constrained in that in any time step it can only delete or insert a single node. Following that, the self-healing algorithm has a short time to reconfigure and heal the network by adding edges between remaining nodes before the next act of the adversary. Our model could, for example, capture what can happen when a worm or software error propagates through the population of nodes. We have developed a series of self-healing algorithms: DASH , ForgivingTree , ForgivingGraph [15, 14], Xheal ,Xheal+  that we succintly compare in Section 2. Though our algorithms are directly applicable to reconfigurable computer networks, the notion of self-healing is important across different domains.
Our Contributions: In this paper, we contend that: a) The self-healing model is a powerful and flexible model to study and design reconfigurable dynamic networks, and b) We introduce virtual graphs (and suggest a framework) contending that they are powerful tools e.g. for designing self-healing solutions.
Related Work: Self-healing is one of the so called ‘Self-*’ properties which systems such as autonomic systems  may be required to have. In the distributed systems world, perhaps the most well-known self-* property is self-stabilization [5, 6, 7, 35]. Self-stabilization was introduced by Djikstra in 1974 . A self-stabilizing system is a system which, starting from an arbitrary state and being affected by adversarial transient failures, can, in finite time, recover to a correct state. Other self-* properties, often broadly defined, include self-scaling, self-repairing (similar to self-healing), self-adjusting (similar to self-managing), self-aware/self-monitoring, self-immune, self-containing .
Self-healing is a responsive approach to reliable systems. There have been numerous other papers that discuss strategies for adding additional capacity or rerouting in anticipation of failures [8, 10, 19, 28, 37, 38]. Results that are responsive in some sense include the following. Médard, Finn, Barry, and Gallager  propose constructing redundant trees to make backup routes possible when an edge or node is deleted. Anderson, Balakrishnan, Kaashoek, and Morris  modify some existing nodes to be RON (Resilient Overlay Network) nodes to detect failures and reroute accordingly. Some networks have enough redundancy built in so that separate parts of the network can function on their own in case of an attack . In all these past results, the network topology is fixed. In contrast, our approach adds edges to the network as node failures occur. Further, our approach does not dictate routing paths or specifically require redundant components to be placed in the network initially.
Dynamic network topology and fault tolerance have always been core concerns of distributed computing [3, 22]. There are many models and a large volume of work in this area. The self-healing model is a suitable model for overlay networks in a dynamic setting. Broadly, dynamic models may be classified as node-dynamic or edge-dynamic. Some reconfigurable overlay network based models are node-dynamic in that nodes join and leave continously [21, 36]. A special class of this is the self-healing model where the algorithm can add a limited number of edges in response to a deletion [30, 11]. A notable recent edge-dynamic model is the dynamic graph model introduced by Kuhn, Lynch and Oshman in . They introduced a stability property called T-interval connectivity (for ) which stipulates the existence of a stable connected spanning subgraph for every rounds.
There has also been research in the physics community on preventing cascading failures. In the model used for these results, each vertex in the network starts with a fixed capacity. When a vertex is deleted, some of its “load” (typically defined as the number of shortest paths that go through the vertex) is diverted to the remaining vertices. The remaining vertices, in turn, can fail if the extra load exceeds their capacities. Motter, Lai, Holme, and Kim have shown empirically that even a single node deletion can cause a constant fraction of the nodes to fail in a power-law network due to cascading failures[17, 27]. Motter and Lai propose a strategy for addressing this problem by intentional removal of certain nodes in the network after a failure begins . Hayashi and Miyazaki propose another strategy, called emergent rewirings, that adds edges to the network after a failure begins to prevent the failure from cascading. Both of these approaches are shown to work well empirically on many networks. However, unfortunately, they perform very poorly under adversarial attack.
1.1 Model of self-healing
Our general model of self-healing is shown in Figure 1. This model was introduced in . It is generalized from the model in [15, 14]. Somewhat similar models were also used in [29, 33, 16, 32]. The specific models used in most of our algorithms are special cases of this model, differing mainly in the way the success metrics of the graph properties are presented. The model used in Xheal [29, 33] also differs in the synchronicity and message assumptions.
Let be an arbitrary graph on nodes, which represent processors in a distributed network. In each step, the adversary either deletes or adds a node. After each deletion, the algorithm gets to add some new edges to the graph, as well as deleting old ones. At each insertion, the processors follow a protocol to update their information. The algorithm’s goal is to maintain the chosen graph properties within the desired bounds. At the same time, the algorithm wants to minimize the resources spent on this task. Initially, each processor only knows its neighbors in , and is unaware of the structure of the rest of . After each deletion or insertion, only the neighbors of the deleted or inserted vertex are informed that the deletion or insertion has occured. After this, processors are allowed to communicate by sending a limited number of messages to their direct neighbors. We assume that these messages are always sent and received successfully. The processors may also request new edges be added to the graph.
We also allow a certain amount of pre-processing to be done before the first attack occurs. This may, for instance, be used by the processors to gather some topological information about , or perhaps to coordinate a strategy. Another success metric is the amount of computation and communication needed during this preprocessing round. For our success metrics, we compare the graphs at time : the actual graph to the graph which is the graph with only the original nodes (those at ) and insertions without regard to deletions and healing. This is the graph which would have been present if the adversary was not doing any deletions and (thus) no self-healing algorithm was active. This is the natural graph for comparing results. Figure 2 shows an example of and a corresponding . The figure also shows, in , the nodes and edges inserted and deleted, and in , the edges inserted by the healing algorithm, as the network evolved over time.
2 The idea of Reconstruction Structures
Conceptually, Our algorithms use the same basic idea: when a node is deleted, replace it by a healing structure formed from its neighbors or nearby nodes, as shown in Figure 3. We can call this structure the . Notice that we have defined Reconstruction Structures like a template and the exact structure is determined according to the desired properties of the algorithm. In most of our algorithms (DASH, ForgivingTree, ForgivingGraph ) the healing structure is a tree. In DASH, a balanced binary tree of neighbors with nodes arranged by previous degree increases is used. In the ForgivingTree, another kind of binary balanced tree is used. In the Forgiving Graph our reconstruction tree is a haft(or half-full tree), In Xheal, the structure used is an expander.
It turns out that in most of these, trees are a natural choice for the graph properties we have tried to maintain. A balanced tree is a structure which has low distance between nodes (at most for a balanced binary tree) while each node has a small degree (at most 3 for a binary tree). At the same time, coming up with the suitable s and maintaining them over the run of the algorithm is quite a significant challenge. For Xheal, however, trees are not the right structure since the main property we are trying to heal here is edge expansion and such spectral properties, and trees do not have good edge expansion.
3 The idea of Virtual Graphs
An idea that we have sometimes found very useful is the idea of using virtual nodes. A virtual node can be thought of as a marker or a placeholder in a reconstruction structure. A virtual node will be simulated by a real node (we call the existing non virtual nodes as real nodes). Informally, simulating would simply mean that the simulating node takes responsibilities of the connections attributed to the virtual node (more formally discussed later) In our algorithms, a virtual node is simulated by exactly one real node, but it may be possible to imagine algorithms where one virtual node may be simulated by multiple real nodes. Of course, one may have a single real node simulating multiple virtual nodes. In our algorithms, the resulting graph that we maintain is a mixture of real and virtual nodes. We call this a virtual graph. This is opposed to the real graph, which is the usual bijective mapping of the network to the graph with a processor mapping to a node and a connection mapping to an edge. Section 3.2 discusses a simple mapping used in the ForgivingTree and the ForgivingGraph that gives the real graph from the virtual graph and also shows some simple properties helpful for bounding certain properties.
More formally, consider the actual graph corresponding to the network, and a virtual graph with nodes and edges . Consider a partition of the set into two sets and (possibly empty), with a surjective mapping and a mapping , The edge sets and are related by a mapping, a natural mapping being the homomorphism given in Section 3.2, possibly other mappings maybe imagined. Then, we have:
real node: A node . By definition, we have a node , such that .
real edge: An edge such that both and are real nodes.
virtual node: Node . By definition, we have a node , where . We say the real node simulates virtual node .
virtual edge: An edge such that not both and are real nodes.
Virtual graphs are useful for a few conceptual reasons, some of which are:
Virtual graphs may be easier to analyze and are good accountability structures (e.g. for bounding node degrees or distances). For example, if our Reconstruction Structure is a tree, and the virtual node is one of the internal nodes, we can claim that the node simulating it has increased its degree by at most 3 due to that virtual node.
A well designed virtual graph scheme may give a clearer insight into the structure and workings of the algorithm i.e. virtual graphs may be easier to visualize. Sometimes, the real graph and its connections may look messy and the underlying pattern, if any, may be obscured. However, this may be alleviated by the virtual graph. For example, the ForgivingTree data structure is a virtual graph that is a tree (and the algorithm is a tree maintenance algorithm) whereas the real graph corresponding to the ForgivingTree may not be a tree at all.
Virtual structures are easier to manipulate. After all, virtual nodes and edges are not really there, so algorithmically, it could be easy to drop or add them. Also, they act like placeholders for the real node simulating them and it is easy to imagine preserving the virtual structure while changing the ownership of the node, thus, manipulating the real structure. This is especially useful when reasoning about dynamic strucutres.
The Forgiving Tree and Forgiving Graph have reconstruction structures which use virtual nodes, and these structures are also trees. We call these reconstruction structures Reconstruction Trees and define them as follows:
Reconstruction Tree: A tree like structure (Figure 5) added by the healing algorithm on adversarial deletion of a single node and its edges. The reconstruction tree uses existing real nodes from the network and may also may have virtual nodes simulated by the real nodes.
3.1 A framework for using virtual graphs
How to use virtual graphs for self-healing? As before, let be the real graph. Let be the self-healing algorithm and be the desired set of invariants on . Let be the condition that successfully maintains on graph i.e. the self-healing algorithm is successful. A method for successfully using a virtual graph for self-healing will be to come up with such a virtual graph and necessary mappings such that if the algorithm successfully maintains some invariants on the virtual graph, this implies success of the algorithm on the real graph. More formally, we need to develop an algorithm , virtual graph , set of invariants on such that implies .
When to use virtual graphs? It would be useful to have an idea of what problems and structures could lend themselves to solutions (hopefully elegant) using virtual graphs. This would be somewhat akin to using transformations in Algebra to solve a problem in a different basis. For example, the approach highlighted previously for self-healing could be extended to any algorithmic problem on the right structures. However, even for our own self-healing algorithms, we have found some problems which seem to be amenable to solution using virtual graphs (e.g. ForgivingTree, ForgivingGraph) and some otherwise (DASH, Xheal).
Virtual graphs beyond self-healing? There may be many problems and problem domains where it may be useful to use a virtual graphs framework. A general framework could be on the lines above: create an appropriate virtual graph with real and virtual nodes with suitable mappings/reduction. Reduce the problem on the real graph to the virtual graph and solve it on the virtual graph so that the results hold for the real graph. For example, in certain problems involving mobile computing and mobile networks, it may be useful to have virtual nodes as placeholders (e.g. fixed positions where an agent is always present).
3.2 De-simulation: Real Graph from a virtual Graph
In the ForgivingTree/ForgivingGraph, a virtual graph maps to a real graph in a straightforward way: map all the virtual nodes to the real nodes simulating them. Figure 6 shows an example. More formally, the real graph is a homomorphic image of the virtual graph. Consider two graphs , and . In this context, a homomorphism may be defined as follows: A homomorphism is a function such that if undirected edge is in (the edge set of ) this implies that the edge is in . Moreover, we say that is the homomorphic image of under if the edges of are exactly the images of the edges of under the homomorphism. There can be multiple real and virtual nodes corresponding to a processor in the network that perform all the functions required of those nodes. Each node can be identified by its processor and some additional information. For node in , let be the name of that processor. In the real graph , there is only one node per processor and consider this node to be labelled with the name of that processor. Then, our homomorphism is simply .
As mentioned earier, we need to show that implies . Since the ForgivingGraph/ForgivingTree give bounds on distances and degree increase, the observations below suffice to show the required bounds in the papers (we refer the reader to the paper for details, but intuitively, bounding the number of virtual nodes simulated in the virtual graph by a real node bounds the real node’s degree, if the virtual nodes are of constant degree).
For any graph homomorphism , for all nodes in , where is the distance between two nodes and in a graph .
If the graph is the homomorphic image of graph under a graph homomorphism , then for all nodes in , , where is the degree of the node in a graph .
4 Directions and Conclusions
This paper discussed self-healing in dynamic networks and introduced a responsive and scalable approach towards self-healing in reconfigurable networks. A general model of self-healing and recent algorithms on self-healing using this and similar models were succinctly compared. A general idea in these algorithms is to replace a deleted node by a Reconstruction Structure. We also introduce a virtual graph framework and a generic idea of using virtual structures (which may be useful for many problems besides self-healing) was introduced. We contend that this approach may be useful for a wide range of problems.
- David Andersen, Hari Balakrishnan, Frans Kaashoek, and Robert Morris. Resilient overlay networks. SIGOPS Oper. Syst. Rev., 35(5):131–145, 2001.
- Villu Arak. What happened on August 16, August 2007. http://heartbeat.skype.com/2007/08/what-happened-on-august-16.html.
- Hagit Attiya and Jennifer Welch. Distributed Computing: Fundamentals, Simulations and Advanced Topics. John Wiley & Sons, 2004.
- Andrew Berns and Sukumar Ghosh. Dissecting self-* properties. Self-Adaptive and Self-Organizing Systems, International Conference on, 0:10–19, 2009.
- Edsger W. Dijkstra. Self-stabilizing systems in spite of distributed control. Commun. ACM, 17(11):643–644, November 1974.
- Shlomi Dolev. Self-stabilization. MIT Press, Cambridge, MA, USA, 2000.
- Shlomi Dolev and Nir Tzachar. Empire of colonies: Self-stabilizing and self-organizing distributed algorithm. Theor. Comput. Sci., 410(6-7):514–532, 2009.
- Robert D. Doverspike and Brian Wilson. Comparison of capacity efficiency of dcs network restoration routing techniques. J. Network Syst. Manage., 2(2), 1994.
- Ken Fisher. Skype talks of ”perfect storm” that caused outage, clarifies blame, August 2007. http://arstechnica.com/news.ars/post/20070821-skype-talks-of-perfect-storm.html.
- T. Frisanco. Optimal spare capacity design for various protection switching methods in ATM networks. In Communications, 1997. ICC 97 Montreal, ’Towards the Knowledge Millennium’. 1997 IEEE International Conference on, volume 1, pages 293–298, 1997.
- Debanjan Ghosh, Raj Sharman, H. Raghav Rao, and Shambhu Upadhyaya. Self-healing systems - survey and synthesis. Decis. Support Syst., 42(4):2164–2185, 2007.
- Sanjay Goel, Salvatore Belardo, and Laura Iwan. A resilient network that can operate under duress: To support communication between government agencies during crisis situations. Proceedings of the 37th Hawaii International Conference on System Sciences, 0-7695-2056-1/04:1–11, 2004.
- Yukio Hayashi and Toshiyuki Miyazaki. Emergent rewirings for cascades on correlated networks. cond-mat/0503615, 2005.
- Thomas P. Hayes, Jared Saia, and Amitabh Trehan. The forgiving graph: a distributed data structure for low stretch under adversarial attack. In PODC ’09: Proceedings of the 28th ACM symposium on Principles of distributed computing, pages 121–130, New York, NY, USA, 2009. ACM.
- Thomas P. Hayes, Jared Saia, and Amitabh Trehan. The forgiving graph: a distributed data structure for low stretch under adversarial attack. Distributed Computing, pages 1–18, 2012.
- Tom Hayes, Navin Rustagi, Jared Saia, and Amitabh Trehan. The forgiving tree: a self-healing distributed data structure. In PODC ’08: Proceedings of the twenty-seventh ACM symposium on Principles of distributed computing, pages 203–212, New York, NY, USA, 2008. ACM.
- Petter Holme and Beom Jun Kim. Vertex overload breakdown in evolving networks. Physical Review E, 65:066109, 2002.
- Rainer R. Iraschko, M. H. MacGregor, and Wayne D. Grover. Optimal capacity placement for path restoration in STM or ATM mesh-survivable networks. IEEE/ACM Trans. Netw., 6(3):325–336, 1998.
- Fabian Kuhn, Nancy Lynch, and Rotem Oshman. Distributed computation in dynamic networks. In Proceedings of the 42nd ACM symposium on Theory of computing, STOC ’10, pages 513–522, New York, NY, USA, 2010. ACM.
- Fabian Kuhn, Stefan Schmid, and Roger Wattenhofer. A Self-Repairing Peer-to-Peer System Resilient to Dynamic Adversarial Churn. In 4th International Workshop on Peer-To-Peer Systems (IPTPS), Cornell University, Ithaca, New York, USA, Springer LNCS 3640, February 2005.
- N. Lynch. Distributed Algorithms. Morgan Kaufmann Publishers, San Mateo, CA, 1996.
- Om Malik. Does Skype Outage Expose P2Ps Limitations?, August 2007. http://gigaom.com/2007/08/16/skype-outage.
- Muriel Medard, Steven G. Finn, and Richard A. Barry. Redundant trees for preplanned recovery in arbitrary vertex-redundant or edge-redundant graphs. IEEE/ACM Transactions on Networking, 7(5):641–652, 1999.
- Matt Moore. Skype’s outage not a hang-up for user base, August 2007. http://www.usatoday.com/tech/wireless/phones/2007-08-24-skype-outage-effects-N.htm.
- Adilson E Motter. Cascade control and defense in complex networks. Physical Review Letters, 93:098701, 2004.
- Adilson E Motter and Ying-Cheng Lai. Cascade-based attacks on complex networks. Physical Review E, 66:065102, 2002.
- Kazutaka Murakami and Hyong S. Kim. Comparative study on restoration schemes of survivable ATM networks. In INFOCOM (1), pages 345–352, 1997.
- Gopal Pandurangan and Amitabh Trehan. Xheal: localized self-healing using expanders. In Proceedings of the 30th annual ACM SIGACT-SIGOPS symposium on Principles of distributed computing, PODC ’11, pages 301–310, New York, NY, USA, 2011. ACM.
- Robert Poor, Cliff Bowman, and Charlotte Burgess Auburn. Self-healing networks. Queue, 1:52–59, May 2003.
- Bill Ray. Skype hangs up on users, August 2007. http://www.theregister.co.uk/2007/08/16/skype_down/.
- Jared Saia and Amitabh Trehan. Picking up the pieces: Self-healing in reconfigurable networks. In IPDPS. 22nd IEEE International Symposium on Parallel and Distributed Processing., pages 1–12. IEEE, April 2008.
- Atish Das Sarma and Amitabh Trehan. Edge-preserving self-healing: keeping network backbones densely connected. In Workshop on Network Science for Communication Networks (NetSciCom 2012), IEEE InfoComm, 2012. IEEE Xplore.
- Brad Stone. Skype: Microsoft Update Took Us Down, August 2007. http://bits.blogs.nytimes.com/2007/08/20/skype-microsoft-update-took-us-down.
- Gerard Tel. Introduction to distributed algorithms. Cambridge University Press, New York, NY, USA, 1994.
- Amitabh Trehan. Algorithms for self-healing networks. Dissertation, University of New Mexico, 2010.
- B. van Caenegem, N. Wauters, and P. Demeester. Spare capacity assignment for different restoration strategies in mesh survivable networks. In Communications, 1997. ICC 97 Montreal, ’Towards the Knowledge Millennium’. 1997 IEEE International Conference on, volume 1, pages 288–292, 1997.
- Yijun Xiong and Lorne G. Mason. Restoration strategies and spare capacity requirements in self-healing ATM networks. IEEE/ACM Trans. Netw., 7(1):98–110, 1999.