Deep Reinforcement Learning for Taskdriven Discovery of Incomplete Networks
Abstract
Complex networks are often either too large for full exploration, partially accessible, or partially observed. Downstream learning tasks on these incomplete networks can produce low quality results. In addition, reducing the incompleteness of the network can be costly and nontrivial. As a result, network discovery algorithms optimized for specific downstream learning tasks given resource collection constraints are of great interest. In this paper, we formulate the taskspecific network discovery problem in an incomplete network setting as a sequential decision making problem. Our downstream task is selective harvesting, the optimal collection of vertices with a particular attribute. We propose a framework, called Network Actor Critic (NAC), which learns a policy and notion of future reward in an offline setting via a deep reinforcement learning algorithm. A quantitative study is presented on several synthetic and real benchmarks. We show that offline models of reward and network discovery policies lead to significantly improved performance when compared to competitive online discovery algorithms.
caption=falsesubfig
Peter Morales, Rajmonda Sulo Caceres, and Tina EliassiRad
1 Introduction
Complex networks are critical to many applications such as those in the social, cyber, and bio domains. We commonly have access to partially observed data. The challenge is to discover enough of the complex network so that we can perform a learning task well. The network discovery step is especially critical in the case when the learning task has the characteristics of the “needle in a haystack” problem. If the discovery process is not carefully tuned, the noise introduced, almost always, overwhelms the signal. This presents an optimization problem: how should we grow the incomplete network to achieve a learning objective on the network, while at the same time minimizing the cost of observing new data?
In this work we view the network discovery problem from a decision theoretic lens, where notions of utility and resource cost are naturally defined and jointly leveraged in a sequential, closedloop manner. In particular, we will leverage Reinforcement Learning (RL) and its mathematical formalism, Markov Decision Processes (MDP). MDP approaches have been successfully used in many other application settings [1, 2, 3]. However, the use of decision theoretic approaches in the context of discovery of complex networks is novel and presents very interesting research opportunities. In particular, it requires learning effective models of reward that can capture properties of network structure at various topological scales and learning contexts. The network science community has defined many such topological and task quality metrics, but, todate, they have not been leveraged in the context of guiding the process of discovery of a partially observed, incomplete network. We consider the task of selective harvesting on graphs [9], where the learning objective is to maximize the collection of nodes of a particular type, under budget constraints. We make the following contributions:

We introduce a deep RL framework for taskdriven discovery of incomplete networks. This formulation allows us to learn offlinetrained models of environment dynamics and reward.

We show that, for a variety of complex learning scenarios, the added feature of learning from closely related scenarios leads to substantial performance improvements relative to existing online discovery methods.

We present an efficient way of organizing the state of possible discovered networks based on personalized Pagerank. Our approach achieves substantial reductions in training and convergence time.

Our approach is modelfree, yet is able to generalize well to unseen real network topologies and tasks.
2 Related Work
Our learning task falls under the category of finding the largest number of a particular type of node under budget constraints. The node type can be specified by the node attributes (for example, follower nodes on a twitter network), or they can be determined by node’s participation on a particular class of behavior (for example, membership to anomalous activity). Unlike the problem setting in [4], we do not assume access to the full topology of the network and therefore have to perform the learning task with partial information.
Discovering incomplete networks with limited resources has received a lot of attention in recent literature. The primary learning objective in these works is to increase the visibility of the network topology by either increasing the number of undiscovered nodes [5, 6, 7], or by increasing network coverage [8]. Our problem setting is the most similar to selective harvesting [9]. Our approach differs from [9] by leveraging the Reinforcement Learning paradigm to estimate offline models of network discovery strategies (policy) and node utility (reward) that are stateaware. More specifically, our approach explicitly connects the utility of a discovery choice to the network state when that choice was made.
Reinforcement learning for tasks on complex networks is a relatively new perspective. Work in [15, 16] leverages Reinforcement Learning to engineer diffusion processes in networks assumed to be fully observed, while authors in [12] focus on the problem of graph partitioning. You et al. [11] leverage Reinforcement Learning to generate novel molecular graphs with desired domainspecified properties. There are connections to our problem setting. The graph generation is approached in a similar fashion to the network discovery problem, by iteratively expanding a seed graph via defined actions. There are, however, some important differences with our work. Since the application in [11] is molecular design, the size of the graphs they consider is very small. Their definition of reward and environment dynamics is tailored to the biochemical domain. Our approach is more general and can support discovery of different types of networks and different network sizes. Our notion of reward is also more general in that we do not utilize domainspecific properties to guide the learning process. De Cao and Kipf [13] similarly to [11] focus on small molecular graph generation, and furthermore, they do not consider the generation process as a sequence of actions. Finally, [14, 17] leverage deep Reinforcement Learning techniques to learn a class of graph greedy optimization heuristics on fully observed networks.
3 Problem Definition
We start with the assumption that a network contains a target subnetwork representing a set of relevant vertices. The objective is to strategically explore and expand the network so that we optimize discovery of these relevant vertices. The decision making agent is initially given partial information about the network . A subset of those vertices have their relevance status revealed as well, with representing nontarget vertices and representing target vertices. We assume our exploration starts from a seed vertex belonging to the partial target subnetwork. At each step, the agent can choose from a set of vertices that are observed, but whose label is unknown. We refer to this set of vertices as the boundary set . After selecting a vertex, the agent can gain knowledge of the vertex label, as well as of the identity of all its neighbors. An immediate reward is given if the selected vertex belongs to the target subnetwork.
This problem may be stated as a Markov Decision Process (MDP). An MDP is defined by the tuple :

The state space, , is the set of intermediate discovered networks.

The action space, , at each step, where is the set of boundary vertices at step .

The transition model, encodes how the network state changes by specifying the probability of state transitioning to given action , We do not model this transition function explicitly and take the modelfree approach, where we iteratively define and approximate reward without having to directly specify the network state transition probabilities. We make this more precise in Section 4.

The local reward function, returns the reward gained by executing action in state and is defined as: if . The total cumulative, actionspecific reward, also referenced as the value function , is defined as:
(1) with representing a discount factor that captures the utility of exploring future graph states. In the next section, we describe in detail our deep reinforcement learning algorithm.
4 Network Actor Critic (NAC) Algorithm
4.1 Offline Learning and Policy Optimization
In our setting, learning happens offline over a training set of possible discovery paths. We use simulated instances of both background networks and target subnetworks to generate paths or trajectories over the network state space.
Each path represents an alternating sequence of discovered graph, action , taken over steps. Since in this setting we have access to the ground truth vertex labels, we can map each discovery path to the corresponding cumulative reward value using equation (1). An illustration is given in Figure 1.
Given the sampled trajectories, one of our learning objectives becomes to approximate the value function by minimizing the loss ,
(2) 
We formulate this objective by taking the input tuples of discovered graphs , boundary nodes and corresponding cumulative reward values , such that . The approximated function can then be utilized to estimate the policy function , which defines the action probability distribution at each state. In particular, we estimate the advantage of choosing one node versus another at state ,
(3) 
This advantage is used to scale the policy gradient estimator, typically defined as, We utilize a proximal policy optimization (PPO) method [23] in order to compute this gradient. PPO methods are widely utilized for policy network optimization and have been demonstrated to achieve state of the art performance on graph tasks [11]. The objective function utilized is defined in equation 4,
(4) 
Here, is used to bound the loss function and help with convergence. During offline training, we modify this objective to encourage exploration and reduce the number of required training epochs to converge to a solution. For equation 5, denotes the entropy of policy in state and is used to balance exploitation vs exploration,
(5) 
Both learning objectives 2 and 5 are jointly optimized via an actor critic training framework. This framework is detailed further below in the description of the Network Actor Critic (NAC) algorithm. To help with training times, multiple instantiations of agents are run simultaneously. Collected values are gathered from each agent and are stored in a buffer which is used to compute the losses for the value function and policy networks after a fixed time window of steps.
4.1.1 Training and Network Details
The NAC algorithm is updated differently during offline training versus online evaluation. During offline training, the ADAM optimizer [25] is used to update network parameters and for the policy and value function networks. In offline training, eight agents simultaneously carry out the anomaly discovery task on a unique network realization generated using the random graphs outlined in Table 2. During offline training, the hyper parameters used are: , , , , , and learning rate . For online evaluation a single agent and , , , , , and . The policy and value function networks are both comprised of 3 convolutional layers with 64 hidden channels and a final fully connnected layer.
4.2 Truncated Node Rank Embedding
One challenge that many reinforcement learning algorithms have to address is exploration of large state spaces. We consider the transformation of personalized Pagerank (PPR) [18] which produces a ranking of vertices and allows for more effective detection invariant structures among the potential network states [19, 20]. Furthermore, PPR fits perfectly into our sequential network discovery setting and has been shown to effectively highlight other target nodes related to the initial seed network. We use the PPR ranking to reorder the rows of the original adjacency matrix. We further truncate this adjacency matrix for additional efficiency gains and only retain the adjacency matrix defined by the top vertices. is a parameter we select and it defines the supporting network for computing potential discovery trajectories and longterm reward.
5 Experiments
We evaluate our algorithm against several learning scenarios for both synthetic and realistic datasets. Next we describe our datasets and baselines used for comparison.
5.1 Datasets
Synthetic Datasets: We approach synthetic graph generation by individually modeling a background network (i.e., the network that does not contain any of the target nodes), and the foreground network (i.e., the network that only contains the target nodes and the interactions among them). We use two models to generate samples of background networks. Stochastic Block model (SBM) [26] is a common generative graph model that allows us to model community structure as dense subgraphs sparsely connected with the rest of the network. Lancichinetti–Fortunato–Radicchi (LFR) model [21] is another frequently used generative model that, in contrast to SBM, allows us to simulate network samples with skewed degree distributions and skewed community sizes, and therefore is able to capture more realistic and complex properties of real networks. Finally, we use the ErdősRenyi (ER) model [26] to simulate the foreground network. ER is a simple generative model where vertices are connected with equal probability controlling the density of the foreground network. Parameter choices for all the models above are detailed in Table 2.
In order to create a background plus foreground network sample, we select a subset of the nodes from the background network that will represent the identity of the target nodes. We then simulate an ER subnetwork on these nodes and replace their background induced subnetwork with the ER subnetwork. We reference this process in the rest of the paper as embedding the foreground subnetwork.
Real Datasets: We analyzed two Facebook datasets [22] representing pages of different categories as nodes and mutual likes as edges. For both cases, we study the discovery of a target set of vertices, where we control how we generate and embed them in the background network. In particular, we embed a synthetic foreground subnetwork consisting of a denser (anomalous) ER graph with size and density . We also consider the Livejournal dataset [9]. This dataset represents an online social network with users representing the nodes, and their selfdeclared friendships representing the edges. For each user, there is also information on the groups they have joined. Similarly to [9], we use one of the listed groups as the target class. The Livejournal dataset represents a departure from the two Facebook datasets, both in terms of its much larger size, but also because the target class does not represent an anomaly. A few topological characteristics of the real networks described here, as well as details on their target class are listed in Table 2.
5.2 Baselines
We evaluate the NAC algorithm by comparing performance with two top performing online network discovery approaches. The Network Online Learning (NOL) [5] algorithm learns an online regression function that maximizes discovery of previously unobserved nodes for a given number of queries. We modify the objective of NOL to match our problem setting by requiring the discovery of previously unobserved nodes of a particular type. A second baseline we consider is the Directed Diversity Dynamic Thompson Sampling () [9] approach. is stochastic multiarmed bandit approach that leverages different node classifiers and Thompson sampling to diversify the selection of a boundary node. Finally, we compare to a simple fixed node selection heuristic referenced in [9] called Maximum Observed Degree (MOD). At every decision step, MOD selects the node with the highest number of observed neighbors that have the desired label.
Model  Type  Parameters 
SBM  Background  
LFR  Background  
ER  Foreground 
Name  # Nodes  # Edges  Target Type  Target Size 
Facebook Politician  5,908  41,729  Synthetic  80 
Facebook TV Shows  3,892  17,262  Synthetic  80 
Livejournal  4,000k  35,000k  Real  1,400 
5.3 Learning Scenarios
In the first learning scenario, the goal is to detect a set of distributed anomalous vertices. They are represented by two cliques, each containing 40 vertices, that are embedded 2 to 3 hops away from each other. The training instances are networks generated by the SBM model, while the test cases are network instances generated by the LFR model. In this scenario, the discovery agent has to figure out 1) how to value longer exploration paths over the cost of including nodes not in target set, and 2) how to adjust to topological differences between training and testing instances. In Figure 2(a), we consider a test case where detactability of the two cliques with complete network information is relatively easy (average background density where the cliques are embedded is comparatively low). We observe that all the methods are able to find the first clique, yet all the baselines struggle once they enter the region where no clique nodes are present. The baselines eventually find some clique nodes, but, even then, they are unable to fully retrieve the second clique. NAC is able to leverage estimation of longterm reward and access to the offline policy to fully recover both cliques, and furthermore, is able to generalize to the more complex LFR topology.
In Figure 2(a), we consider a much harder case: embedding two disjoint dense subgraphs, each with density 0.2 in a background of density 0.05. These parameters are close to the detectability bound [24] for the complete network case. In this case, neither of the baselines learns how to recover the second clique. NAC goes through a longer exploration phase, but eventually learns how to grow the network to identify the second clique. In Figure 3(a) and (b), we illustrate how our model trained on synthetic background networks generalizes to realistic background topologies. For this scenario, we trained with instances from both the LFR and SBM models. We observe that NAC generalizes very well to the Facebook network topologies and is able to fully discover the target nodes.
In our last learning scenario (Figure 3(c)), we illustrate how our model generalizes to a test case where both the background network and the target set are from real data. Our model has only seen target class examples represented by a dense ER model, yet is able to discover an online Livejournal group with 1400 users. We note the initial exploration cost, as NAC learns to adapt to the new target topology. Eventually, by query 850, is able to more efficiently discover the group members and by query 1400 fully recovers the whole group. In Figure 4(a)(b), we demonstrate how reordering the adjacency matrix of the observed network by the PPR score supports a faster model convergence during training time. We illustrate by analyzing the convergence behavior on the test case described in Figure 2(a), but the behavior is consistent for all the different test cases considered. Finally, in Figure 4(c), we illustrate, that NAC has learned strategies beyond picking a vertex with high ppr score. In particular, NAC has learned how to explore regions where delayed reward is critical (in this example, the region between the two disjoint cliques).
6 Conclusions and Future Work
We introduced NAC, a deep RL framework for taskdriven discovery of incomplete networks. NAC learns offline models of reward and network discovery policies based on a synthetically generated training set. NAC is able to learn effective strategies for the task of selective harvesting, especially for learning scenarios where the target class is relatively small and difficult to discriminate. We show that NAC strategies transfer well to unseen and more complex network topologies including real networks.
Our approach has opened up many interesting venues for future research. The effectiveness and convergence of our algorithm relies on being trained on a sufficiently representative training set. It is valuable to further explore and quantify the limits of transferability of synthetically generated training sets. Interestingly, our current framework is flexible enough to incorporate additional discovery strategies generated from other methods, as part of the offline training process. This feature can lead to more efficient discovery strategies, but we leave the careful analysis for future work. Selecting an effective approximation strategy is another topic for future research. NAC leverages Pagerank to quickly identify regions of relevance, but it is of great interest to identify other graph space embeddings that can support fast navigation through the network state space. Finally, the framework is general enough to support discovery for other network learning tasks. It is valuable to explore how a different learning objective changes the training, convergence, and generalizibility requirements.
References
 [1] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, Demis Hassabis: Humanlevel control through deep reinforcement learning. Nature (2015).
 [2] Nicolas Heess, Dhruva TB, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, S. M. Ali Eslami, Martin A. Riedmiller, David Silver: Emergence of locomotion behaviours in rich environments. CoRR, abs/1707.02286 (2017).
 [3] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Ioannis, Aja Huang, Arthur Guez, Arthur, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, Demis Hassabis: Mastering the game of Go without human knowledge. Nature, (2017).
 [4] Xuezhi Wang, Roman Garnett, Jeff Schneider: Active search on graphs. Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (2013).
 [5] Timothy LaRock, Timothy Sakharov, Saheli Bhadra, Tina EliassiRad: Reducing network incompleteness through online learning: a feasibility study. The 14th International Workshop on Mining and Learning with Graphs (2018).
 [6] Sucheta Soundarajan, Tina EliassiRad, Brian Gallagher, and Ali Pinar:MaxOutProbe: an algorithm for increasing the size of partially observed networks. CoRR, abs/1511.06463 (2015).
 [7] Sucheta Soundarajan, Tina EliassiRad, Brian Gallagher, and Ali Pinar: MaxReach: reducing network incompleteness through node probes. In ASONAM, pp 152–157 (2016).
 [8] Konstantin Avrachenkov, Prithwish Basu, Giovanni Neglia, Bernardete Ribeiro, Don Towsley: Pay few, influence most: online myopic network covering. IEEE Conference on Computer Communications Workshops, pp 813–818 (2014).
 [9] Fabricio Murai, Diogo Rennó, Bruno Ribeiro, Gisele L. Pappa, Donald F. Towsley, Krista Gile: Selective harvesting over networks. Data Mining and Knowledge Discovery, Volume 32, Issue 1, pp 187–217 (2017).
 [10] Ziwei Zhang, Peng Cui, Wenwu Zhu: Deep learning on graphs: a survey. CoRR, abs/1812.04202 (2018).
 [11] Jiaxuan You, Bowen Liu, Rex Ying Vijay Pande, Jure Leskovec: Graph convolutional policy network for goaldirected molecular graph generation. Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 6412–6422 (2018).
 [12] Mohammad H. Mofrad, Rami Melhem, Mohammad Hammoud: Partitioning graphs for the cloud using reinforcement learning. CoRR, abs/1907.06768 (2019).
 [13] Nicola De Cao, Thomas Kipf: MolGAN: An implicit generative model for small molecular graphs, CoRR, abs/1805.11973 (2018).
 [14] Hanjun Dai, Elias B. Khalil, Yuyu Zhang, Bistra Dilkina, Le Song: Learning combinatorial optimization algorithms over graphs. Proceedings of the 31st International Conference on Neural Information Processing Systems, pp 6351–6361 (2017).
 [15] Christopher Ho, Mykel J. Kochenderfer, Vineet Mehta, Rajmonda S. Caceres: Control of epidemics on graphs. 54th IEEE Conference on Decision and Control, pp 4202–4207 (2015).
 [16] Mahak Goindani, Jennifer Neville: Social reinforcement learning to combat fake news spread. Proceedings of the ThirtyFifth Conference on Uncertainty in Artificial Intelligence (2019).
 [17] Akash Mittal, Anuj Dhawan, Sourav Medya, Sayan Ranu, Ambuj K. Singh: Learning heuristics over large graphs via deep reinforcement learning. CoRR, abs/1903.03332 (2019).
 [18] Taher H. Haveliwala: Topicsensitive pagerank: A contextsensitive ranking algorithm for web search. IEEE Transactions Knowledge Data Eng. 15(4), pp 784–796 (2003).
 [19] Isabel M. Kloumann, Johan Ugander, Jon Kleinberg: Block models and personalized PageRank. Proceedings of the National Academy of Sciences, 114 (1), pp 33–38 (2017).
 [20] David Gleich: PageRank beyond the web. SIAM Review. 57. 10.1137/140976649 (2014).
 [21] Andrea Lancichinetti, Santo Fortunato, and Filippo Radicchi: Benchmark graphs for testing community detection algorithms. Physical Review E, Volume 78 (2008).
 [22] http://snap.stanford.edu/data/gemsecFacebook.html
 [23] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347 (2017).
 [24] Raj Rao Nadakuditi, M. E. J. Newman: Graph spectra and the detectability of community structure in networks. CoRR, abs/1205.1813 (2012).
 [25] Diederik P. Kingma, Jimmy Ba: Adam: a method for stochastic optimization. CoRR, abs/1412.6980 (2014).
 [26] Paul W. Holland, Kathryn Blackmond Laskey, Samuel Leinhardt: Stochastic blockmodels: first steps. Social Networks 5 (2), 109–137 (1983).
 [27] Paul Erdös, Alfréd Rényi: On random graphs. Publicationes Mathematicae, Volume 6, 290–297 (1959).