Informative Path Planning for Mobile Sensing with Reinforcement Learning
Abstract
Largescale spatial data such as air quality, thermal conditions and location signatures play a vital role in a variety of applications. Collecting such data manually can be tedious and labour intensive. With the advancement of robotic technologies, it is feasible to automate such tasks using mobile robots with sensing and navigation capabilities. However, due to limited battery lifetime and scarcity of charging stations, it is important to plan paths for the robots that maximize the utility of data collection, also known as the informative path planning (IPP) problem. In this paper, we propose a novel IPP algorithm using reinforcement learning (RL). A constrained exploration and exploitation strategy is designed to address the unique challenges of IPP, and is shown to have fast convergence and better optimality than a classical reinforcement learning approach. Extensive experiments using realworld measurement data demonstrate that the proposed algorithm outperforms stateoftheart algorithms in most test cases. Interestingly, unlike existing solutions that have to be reexecuted when any input parameter changes, our RLbased solution allows a degree of transferability across different problem instances.
I Introduction
A wide range of applications rely on the availability of largescale spatial data, such as water and air quality monitoring, precision agriculture, WiFi fingerprint based indoor localization, etc. One common characteristic of these applications is that the data to be collected are location dependent, and time consuming to obtain if done manually. Over the last two decades, wireless sensor networks (WSN) [2] have been extensively investigated as a means of continuous environment monitoring. To exploit mobility, WSN with mobile elements [11] has also been considered. While individual sensor devices are typically at low costs, deploying and maintaining a largescale WSN incur high capital and operational expenses.
For onetime or infrequent spatial data collection, robotic technologies offer a viable alternative to fixed deployments [28]. A robot equipped with sensing devices can be controlled to traverse a target area and collect environmental data along its path. Although utilizing robots for spatial information gathering can significantly reduce human efforts, they are battery powered and have limited life time. Given a budget constraint (e.g., maximum travel distance or time), it is important to plan motion trajectories for the robots such that the state of the environment can be accurately estimated with the sensor measurements.
In this work, we model the distribution of spatial data in a target area as a Gaussian Process (GP) [31]. GPs are versatile in that by choosing appropriate kernel functions, it can be used to model processes of different degrees of smoothness. In prediction, besides the predicted values, uncertainties (variances) are also provided. Based on GPs, in [13], mutual information (MI) is proposed as a criteria to measure the informativeness for sensor placement. In [23, 6, 20], MI is used to measure the informativeness of a path when data are collected by a robot following the path. The problem of finding the most informative path from a predefined start location to a terminal location subject to a budget constraint is called informative path planning (IPP).
In general, IPP problems are formulated on graphs [20, 6, 7], with vertices representing waypoints and edges representing path segments. The utility
In this paper, a novel reinforcement learning (RL) algorithm is proposed to solve the IPP problem. Specifically, we model IPP as a sequential decision process. Given the start vertex on the IPP graph, a path is constructed sequentially by appending the next waypoint vertex. With reinforcement learning, the total rewards of the generated paths are expected to improve gradually.
Compared with conventional RL tasks, IPP poses nontrivial challenges. The available actions depend on the current position of the agent on the graph, since it can only choose among adjacent vertices as the next step. Furthermore, the reward of an action depends on past actions. For instance, revisiting a vertex can lead to less but nonzero reward. Lastly, eligible paths (states) are constrained by the budget and the predefined terminal vertex. As a result, RL needs to be tailored to the problem setting. We adopt a recurrent neural network (RNN) based Qlearning structure, and select feasible actions using a mask mechanism. In order to improve learning efficiency, a constrained exploration and exploitation strategy is devised. Such a strategy allows looking ahead and restricting to valid paths that can terminate at the specified vertex within budget constraint.
To evaluate the proposed approach, we consider the task of WiFi Radio Signal Strength (RSS) collection in indoor environments. WiFi RSS measurements are commonly used in fingerprintbased indoor localization solutions [32, 33, 17]. Real data have been collected from two areas. In total, 20 different configurations (different start/terminal vertices, or budget constraints) have been evaluated. Among them, the RL based IPP algorithm outperforms stateoftheart methods in 17 configurations with higher informativeness. Furthermore, we find that when the change in configuration is small, transfer learning from a pretrained model can greatly improve the convergence speed on a new problem instance.
The rest of this paper is organized as follows. In Section II, related work to IPP and a background of RL are introduced briefly. The IPP problem is formally formulated in Section III. We present the proposed solution in Section IV. Experimental results are shown in Section V, and we conclude our work in Section VI.
Ii Related Work and Background
In this section, representative solutions to IPP are reviewed first. A brief background of RL is then presented, with a focus on the Qlearning approach. We also review two recent works that attempt to solve the combinatorial optimization problem with RL.
Iia Existing solutions to IPP
In [30], IPP has been shown to be NPhard. A greedy algorithm and a genetic algorithm (GA) are investigated. Experiments show that GA achieves a good tradeoff between computation complexity and optimality. In [27], another evolution based path planning approach is proposed with ant colony optimization. In [19], the path planning process is modeled as a control policy and a heuristic algorithm is proposed by incrementally constructing the policy tree.
Several algorithms decompose the optimization problem into subset selection and path construction. The main intuition is that once the subset of vertices are determined, a TSP solver can be used to construct a path with the minimum cost. In [4], vertices are randomly added or removed, and a TSP solver is used to maintain the path. Similarly, in [18], waypoints are added incrementally and a TSP solver is used to determine the traversing order. Such approaches usually assume that each selected vertex can only be visited once (due to TSP) and the reward is accumulated only from the selected vertices. In IPP applications, such assumptions do not generally hold since robots can continue sensing the environment while travelling along the path. Furthermore, a vertex can be visited multiple times and rewards can still be obtained, particularly when MI is used as the criteria of informativeness.
Another line of IPP algorithms are based on the recursive greedy (RG) algorithm proposed for OP [10]. RG is an approximate algorithm. The basic idea is to consider all possible combinations of intermediate vertices and budget splits, and then it is recursively applied on the smaller subproblems. IPP with RG can be found in [23, 6]. In order to reduce computation complexity, in [23], the authors propose spatial decomposition to create a coarse graph by grouping the vertices. Unfortunately, doing so can compromise the approximation guarantee of the original algorithm.
Most of the above mentioned algorithms suffer from a limited performance in terms of optimality. On the other hand, although RG has an approximation guarantee, it is not practical on large graphs or when the budgets are large due to its complexity.
IiB Reinforcement Learning
Under the framework of RL [15, 24], an agent interacts with the environment through a sequential decision process, which can be described by a Markov Decision Process (MDP) , where

is a finite set of states;

is a finite set of actions;

is a state transition function
^{2} defined as ; 
is a reward function defined as , where is a real value reward signal.
To solve the MDP with RL, a policy is required to guide the agent towards decision making. The policy can be deterministic or stochastic. A deterministic policy is defined as , i.e., given the state, the policy outputs the action to take for the following step.
At each time step , the environment is at a state . The agent makes a decision by taking an action . It then receives an immediate reward signal and the state moves to . The goal of RL is to find a policy such that the total future reward
(1) 
is maximized, where is a discount factor controls the priority of step reward and is the last action time.
There are two main approaches to find the desired policy , namely the policybased and the valuebased approaches. The policybased approach (e.g., policy gradient) aims to directly optimize the policy and output the action (or action distribution for nondeterministic policy) given an input state, while the valuebased approach (e.g., Qlearning) is indirect. The insight is to predict the total future reward given an input state or a stateaction pair, the agent can then make decisions through the predicted reward.
We consider the Qlearning approach in this work. Specifically, Qlearning aims to learn a function , with representing the total future reward by taking the action from state . The policy given can then be formulated as
(2) 
In practice, the Qfunction is usually approximated with a neural network , which is known as DQN [21]. The network is optimized in an iterative way by minimizing the temporal difference with a loss function defined as
(3) 
There are many variants and techniques for Qlearning models and training methodology [25, 29, 22]. We only cover the basic background here due to space limitation and Qlearning itself is not a part of our contribution. Most of these techniques can be applied directly in our proposed method.
In recent years, RL with neural network has been applied to solve combinatorial optimization problems. In [5], the authors consider TSP and utilize a pointer network to predict the distribution of vertex permutations. Negative tour lengths are used as reward signals, and parameters of the neural network are optimized using the policy gradient method. Experiments show that neural combinatorial optimization achieves close to optimal results on 2D Euclidean graphs.
In [16], a Qlearning approach is presented to solve the combinatorial optimization problems on graphs. A graph embedding technique is desinged for graph representation, and solutions are greedily constructed with Qlearning. The effectiveness of the approach is evaluated on Minimum Vertex Cover, Maximum Cut and TSP.
Both [16] and [5] assume complete graphs. In contrast, presence of obstacles in spatial areas implies that the resulting graphs have limited connectivity. Furthermore, as discussed previously, IPP is fundamentally a harder problem than TSP, and in some cases TSP is a subprocess for some IPP solutions. In this paper, we show how RL can be applied in the IPP context.
Iii Problem Formulation
Since IPP is defined on graphs, the target area needs first to be converted to a graph. Points of Interests (PoIs) in the area can be seem as vertices, and an edge exists if two vertices are reachable.
Iiia General Path Planning with Limited Budget
We define the graphbased general path planning problem using a fivetuple . Specifically,

is the graph. Each is associated with a physical location . For each , there is a corresponding cost (e.g., the length of the edge) for travelling along the edge.

is the specified start and terminal vertex, respectively.

A valid path
^{3} is denoted by , and its reward is denoted by . 
is the total budget available for the path.
The cost of is the sum of edge costs along the path,
(4) 
where is the th vertex in and represents the corresponding edge. The objective is to find the optimal path that satisfies
(5) 
where is the set of all paths in from to .
One classic variant of the general path planning formulation is OP [12, 26, 14]. In OP, each vertex is associated with a reward and the goal is to find a subset of vertices to visit so as to maximize the collected reward within a budget constraint. When is submodular or correlated, it is also known as the submodular orienteering problem (SOP) [10] or correlated orienteering problem (COP) [34].
IiiB Informative Path Planning
IPP is a specific case of the general path planning problems where the reward of a path is defined by the informativeness of data collected along the path. In information theory, informativeness can be measured through MI [23, 6, 9, 18]. Next, we present the calculation of for IPP based on GPs and MI. Detailed mathematical background of GPs can be found in [31].
Assume the data to be collected can be modeled by a GP. Thus, for each at a physical location , the corresponding data (e.g., temperature, humidity, etc.) is a Gaussian distributed random variable, and the variables at all the locations of follow a joint multivariate Gaussian distribution,
where is the mean function and is the kernel, and is the total number of vertices. For simplicity, we denote the multivariate Gaussian distribution by , where is a matrix for the locations of and is the covariance matrix as defined by the above kernel function .
The differential entropy (also referred to as continuous entropy) [1] of is
(6) 
Given , suppose data are going to be collected by an agent along the path every meter interval (depends on the traveling speed and sampling frequency). The sample locations can be easily calculated with the positions of the vertices. We denote all the sample locations as and the corresponding measurements as . The posterior distribution of given is , where
(7) 
(8) 
Here represents the noise variance of the underlying GP, and is the kernel matrix generated by with pairwise entries in and . The conditional differential entropy is then given by
(9) 
The MI based reward can be calculated with
(10) 
Note that since the differential entropy only depends on the kernel matrix (i.e, the kernel function and the locations), reward can be calculated analytically without travelling along the actual path and taking real measurements. That is why it is possible for offline path planning.
However, the kernel function usually has some hyperparameters which may not be known in advance. Thus, pilot data are needed to learn these hyperparameters [6, 13, 10]. Given a small set of pilot data collected in advance at locations with measurements , the reward can be calculated with
(11) 
Given the input as , one naive approach is to enumerate all the valid paths from to and choose the path with the highest . However, since the problem is NPhard, brute force search is not computationally feasible in practice.
Iv Proposed Solution
In this section, we present the proposed solution with a Qlearning approach. Related concepts are defined first. Then we present the overview and details of each component.
It is straightforward to view IPP as a sequential decision problem. Specifically, suppose an agent is exploring solutions in from to , with a budget . As shown in Fig. 1, we denote the vertices traversed by the agent as the partial path . Initially, . In subsequent steps, available actions for the agent are the adjacent vertices of the last vertex in , i.e., the current position of the agent. Once the agent decides which action to take according to some policy , the action (vertex) is appended to the partial path, and a corresponding immediate reward is sent to the agent. The process repeats until the budget is exhausted or the agent successfully reaches . We summarize the corresponding RL concepts in the context of IPP as follows,

Agent and Environment: An agent is a robot at and moves along the edges. The environment is a simulator based on the input graph.

State: Many RL solutions such as [21] encode the states with pixel level images and use convolutional neural network (CNN) for an end to end learning. For IPP, since it is defined on a graph, it is not necessary to use CNN. Instead, we define the state with and state transition means appending a vertex to .

Action: Action means selecting which available vertex to go for the next step. The available actions (the next waypoint to visit) vary significantly when the agent moves to a new vertex, depending on the connectivity of the graph .

Reward: Reward is a numerical value given to the agent by the environment after it takes an action. The rewards are expected to link to the optimization goal, i.e, maximize the informativeness of the path.

Episode: Each episode represents the process to construct a trial path starting from until the budget is used up or reaches . The agent is expected to reach the terminal vertex within the budget.
Iva Solution Overview
Fig. 2 shows the overall architecture of the solution. The input is a target area with a small amount of pilot data. The area is discretized into a graph. As mentioned previously, the data to be collected are spatially correlated. A GP regression model is fitted and optimized with the pilot data to estimate the hyperparameters. Once the hyperparameters are estimated, the reward function defined in (11) is determined, which can be used to calculate the step reward for the agent.
We utilize a Recurrent Neural Network (RNN) as the Qvalue approximator, since future rewards (Qvalues) depend on all the visited vertices. Meanwhile, for each input state, we bind a Qvalue for every vertex in the graph, even if it is not a direct neighbor to the last vertex of . The Qvalues are then masked with the connectivity of the graph to filter out those nonreachable vertices. For each epoch, the agent starts from and select actions with the greedy policy based on the output Qvalues. Reward is calculated with and state transition tuples are added to the experience buffer. For each step, a batch of transition tuples are sampled from the buffer, which are utilized to update the model’s parameter by minimizing the temporal difference in (3).
Next, we give a detailed description to some of the key components.
IvB Constrained Exploration and Exploitation
One major obstacle in applying RL to IPP is the Constrained Terminal State. Any valid path should start from and terminate at , and also satisfy the budget constraint. We design a novel constrained exploration and exploitation strategy that can reduce the computation complexity in finding valid and highreward paths.
Exploration
Let and is the agent’s current location. The set of available actions to the agent is given by the neighborhood of , i.e.,
(12) 
Among , we denote valid actions that have chances to reach within the budget as ,
(13) 
where is the remaining budget that can be calculated by subtracting the length of from , and ShortestPath denotes the least cost path from vertex to .
If the agent randomly chooses an action from at each step, it can be guaranteed to reach within the budget. Furthermore, among , it is possible that some vertices have been visited previously. For IPP, our exploration strategy is to randomly select a vertex that has not been visited if exists, otherwise randomly select a vertex from . Note that changes with the states. In other words, the valid actions are updated stepwise.
Exploitation
Through controlling the exploration actions, the agent is guaranteed to reach . However, when actions are generated through exploitation with the maximum predicted Qvalue, they may be invalid. This is particularly the case in the initial stage when the predicted Qvalues are not accurate. Again, the shortest path is utilized to identify such actions. If the remaining budget is not sufficient to cover the selected action and the shortest path thereafter, the episode is terminated immediately and a penalty reward is triggered.
IvC Reward Mechanism
For each action, the environment provides an immediate reward signal and transits to the next state. A simulator is created based on the input graph.
The reward of taking action is defined as
(14) 
In such a way, the reward in each step adds up to the reward of the constructed path at the last step.
When action violates the budget constraint, we signal a penalty reward to the agent to discourage such an action. Specifically, a variable is used to track the cumulative reward obtained. Once the budget constraint is violated, the reward perceived becomes the negative of . Therefore, any invalid path eventually leads to a zero reward (except the initial reward from the pilot data). The state transition and reward mechanism in one step are outlined in Algorithm 1. The procedure returns a transition tuple , namely, upon taking action from state the agent gets a reward , the state transits to , and IsDone means whether the action terminates the episode. The transition tuple is stored in an experience buffer, which is the input for Qnetwork training.
IvD Qlearning Network
The Qlearning network is used to predict the Qvalues, for the agent to make better decisions. We adopt a RNN based neural network since the input is a sequence. Given , the input to the RNN is the corresponding 2D location coordinates for each vertex in . The output of the last cell is a Qvalue vector with length .
On the other hand, since the graph may not be fully connected and the predicted Qvalues are only valid for the adjacent vertices, we define a mask vector of length as
(15) 
where is a predefined large negative number as a penalty reward and is the current position. Therefore, the final masked output of the Qvalues are .
IvE Learning and Searching Algorithm
Based on the Qnetwork, the agent uses an greedy policy to explore the solution space, with the proposed constrained exploration and exploitation strategy in Section IVB. The state transition tuples from Algorithm 1 are cached in an experience buffer , and network parameters are trained based this memory buffer. However, when neural network is used as the function approximator, there is no theoretical convergence guarantee for Qlearning [21]. Further, with gradient descent based optimization, the final model may stuck at local optima. Thus, it is possible that paths sampled during the training stage may have a larger reward than the paths generated by the final Qnetwork. We utilize a learning and searching strategy similar to the “Active Search” in [5]. For every iterations, a path is constructed with greedy search according to the Qnetwork, and we keep track of the best path ever seen as the final solution.
V Performance Evaluation
In this section, the performance of the proposed Qlearning approach for IPP is evaluated. In particular, we compare with a naive exploration approach in terms of learning efficiency and also compare the performance with other IPP algorithms. Finally, we show that the knowledge of the Qnetwork is transferable when the constraints change, especially in cases when the changes are moderate.
Va Graph Setting and RL Implementation
In experiments, we consider WiFi signal strengths as the environmental data, which have been extensively used for fingerprintbased indoor localization [32, 33, 17]. Two realworld indoor areas are selected and discretized into grid graphs. The first area is an open area and the second area is a corridor. A small amount of pilot WiFi signals are collected to estimate the hyperparamers of the underlying GP for each area. The two areas are illustrated in Fig.3 and Fig.4, respectively.
A simulator representing the interactions between the agent and the graph is implemented with Python. The APIs are similar to those in the OpenAI Gym [8], which is a reinforcement learning platform. For IPP, the main logic of the simulator is the state transition and reward mechanism as outlined in Algorithm 1.
VB Comparison with Naive Exploration
We first compare the performance with a naive exploration approach, which simply extends the partial path through neighborhood vertices until the budget is used up, and all the other settings are kept the same with the constrained exploration and exploitation strategy. Fig. 5 and Fig. 6 show the average episode reward with the learning process in Area One and Two, respectively. Similar to [21], each epoch is defined as 50 episodes of learning, and 100 epochs are run for each configuration. It can be seen clearly that the constrained exploration and exploitation strategy achieves higher reward (MI) and higher efficiency. During the initial episodes of the naive exploration, the rewards are low (penalized by if fails to reach , but not 0 because MI from the pilot data are considered) since most generated paths are invalid, i.e, not terminate at . Furthermore, the difference is more significant in Area Two since the graph size is larger than Area One. In a larger graph, the probability that blind searches can construct a valid path is smaller. As can be seen from Fig.6, for some budget setting (e.g., 100, 110, 140) the naive approach failed to improve in terms of average reward. In comparison, the constrained exploration and exploitation strategy shows a promising result, and the average reward improves gradually until convergence under different budget settings.
VC Comparison with Other IPP algorithms
The ultimate goal of IPP is to plan a path that can reduce prediction uncertainty with GP regression using the collected data. Unlike existing heuristics or evolution based approaches, the Qlearning solution learns from trial paths and improves gradually. For comparison, we have also implemented the following algorithms:
Brute Force Tree Search The brute force tree search tries to enumerate all the paths from to and record the path with the highest reward. A stack is utilized to store the partial paths and branches are searched similar to the depthfirstsearch traverse. Here can be seen as the root of the search tree. A search branch is terminated whether is encountered or budget is exhausted.
Recursive Greedy Algorithm The Recursive Greedy (RG) algorithm is adapted from [10]. Originally, RG is designed for the orienteering problem. For IPP, the reward function is adapted to consider samples along edges.
Genetic Algorithm Genetic Algorithm is implemented according to [30]. Each valid path represents a chromosome, and a set of individuals (paths) are initialized. For each generation, a crossover and a mutation process are implemented. After a number of generations, the path with the maximum MI is considered as the final solution.
Both the brute force and RG approaches suffer from high computation complexity. The brute force approach is only applied to the graph of Area One, and it manages to return the result in 72 hours only when the budget is below 40 meters.
Both RL and GA are able to improve from trial paths, but accomplish so differently. RL is a learning based algorithm, while GA is an evolutionary algorithm. In RL, each trial path is an episode, the agent learns to make decisions for path construction. In contrast, in GA, the information is inherited through genetic operators such as crossover and mutation, and each individual represents a trial path. For a fair comparison, we learn for 5000 episodes with RL in Area One. With GA, the population size is set to 100, and we run 50 generations. Thus, the total number of individuals (paths) involved are . Due to randomness, we run five rounds of experiments independently and take the average for each budget setting. Similarly, for Area Two, the number of episodes for RL is set to 10000, and 100 generations are run for GA accordingly.
Meanwhile, for each Area, we consider both the tour case and a nontour case. The tour case means the agent must return to the start vertex, i.e., . While for the nontour case, the terminal vertex is selected to be different from the start vertex.
It can be seen from Fig. 7 that RL achieves the best performance compared with all the other algorithms. When the budget is under 40 meters, the optimal solution can be found by RL, since they coincide with those from the brute force search. The rewards obtained by GA and RG increase monotonically with budgets, while the rewards from the greedy algorithm sometimes remain unchanged even if the budgets increase.
The graph from Area Two contains 61 vertices, with budgets larger than Area One. Fig. 8 shows the results from RL, RG, GA and the greedy approach. RL outperforms the other algorithms for most of the budget settings. However, on the nontour case in Fig. (b)b, for two budget settings (110, 120), the greedy approach achieves the best results.
VD Transfer Learning
In practice, the budget constraint depends on battery capacity. Meanwhile, the start and terminal vertices could change if the locations of the charging stations change. One natural question is whether it is possible to adapt the trained models when these constraints change. Specifically, the parameters of the Qnetwork can be initialized randomly or initialized from pretrained models, this is known as transfer learning. In this section, experiments are carried out to demonstrate that the learned models are transferable when one of the constraints changes.
Different Budget Fig. 9 shows the effect of transfer learning when the budget changes. For each area, a base model is learned first. Then we change the budget, the learning curves of random initialization and fine tune from the base model are compared. It can be seen that in both areas when the budget is close to the base model, the effect of transfer learning is clear since the model converges faster. When the budget is far from the base model, the advantage is less significant.
Varying Terminal Vertices Fig. 10 shows the transfer learning effect when the terminal vertex changes, and the start and budget keep the same. In both areas, transfer learning shows a earlier convergence time compared with random initialization.
Varying Start Vertices Fig. 11 shows the result when the start vertex is changed. In both areas, we observed that when the new start vertex is close to the start vertex from the base model, transfer learning is advantageous. However, when the start vertex is far apart from that in the pretrained model, random initialization performs better.
From the above comparison we can conclude that the learned models are transferable, particularly when the changes () are moderate. This can be attributed to the fact that the Qnetwork is learned from the transition tuples stored in the experience buffer. When the constraints are similar or close, the experience buffer tends to have identical transition tuples. Thus, model parameters are expected to be adapted using less transition tuples.
VE Computation Complexity
The RG suffers from a high computation complexity with [10], where is the number of vertices and is the maximum time to evaluate the reward function on a given set of vertices, and is the recursion depth. The Greedy algorithm relies on the TSP solver to generate paths, and the complexity can be expressed as , where is the complexity of the adopted TSP solver.
GA is an evolutionary algorithm, and the complexity is dominated by the defined number of generations and population size. Similarly, RL is a learning based algorithm and its complexity depends on the number of episode iterated and the budget size, since more budget means for within each episode there are more steps.
The execution time on an iMac desktop computer (4GHz Intel Core i7, 16 GB RAM, without GPU) is shown in Fig. 12. In general, GA and the Greedy algorithm is fast and can finish execution within a few minutes. Due to the training and optimization of neural network, RL takes a longer time than RG on the small graph (Area One). However, the execution time of RG increases exponentially when the number of nodes and budgets increase. In contrast, the execution time of RL grows linearly with budget and the number of iterations, and thus in Area Two RL takes less time than RG.
Vi Conclusion
In this paper, a Qlearning based solution to IPP was presented. We proposed a novel exploration and exploitation strategy with the assistance of the shortest path. Compared with the naive exploration strategy, it has a better efficiency and optimality. Furthermore, the result is promising compared with stateoftheart algorithms. We also demonstrated that the Qnetwork is transferable in presence of moderate changes in the input parameters. Our future research direction is to investigate the IPP problem for multiple robots.
Footnotes
 In this work, we use the terms utilty, reward and informativeness exchangeably.
 In this work we consider deterministic transitions.
 In graph theory, a path is defined as a sequence of vertices and edges without repeated vertices or edges. To be consistent with existing IPP literature, we allow repetition of vertices on a path, the equivalent of a walk in graph theory.
References
 (1989) Entropy expressions and their estimators for multivariate distributions. IEEE Transactions on Information Theory 35 (3), pp. 688–692. Cited by: §IIIB.
 (2002) Wireless sensor networks: a survey. Computer networks 38 (4), pp. 393–422. Cited by: §I.
 (2006) Concorde tsp solver. Cited by: §VC.
 (2017) Randomized algorithm for informative path planning with budget constraints. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pp. 4997–5004. Cited by: §IIA.
 (2016) Neural combinatorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940. Cited by: §IIB, §IIB, §IVE.
 (2010) Informative path planning for an autonomous underwater vehicle. In 2010 IEEE International Conference on Robotics and Automation, pp. 4791–4796. Cited by: §I, §I, §IIA, §IIIB, §IIIB.
 (2012) Branch and bound for informative path planning. In 2012 IEEE International Conference on Robotics and Automation, pp. 2147–2154. Cited by: §I.
 (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §VA.
 (2013) Multirobot informative path planning for active sensing of environmental phenomena: a tale of two algorithms. In Proceedings of the 2013 international conference on Autonomous agents and multiagent systems, pp. 7–14. Cited by: §IIIB.
 (2005) A recursive greedy algorithm for walks in directed graphs. In 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS’05), pp. 245–253. Cited by: §IIA, §IIIA, §IIIB, §VC, §VE.
 (2011) Data collection in wireless sensor networks with mobile elements: a survey. ACM Transactions on Sensor Networks (TOSN) 8 (1), pp. 7. Cited by: §I.
 (1987) The orienteering problem. Naval Research Logistics (NRL) 34 (3), pp. 307–318. Cited by: §IIIA.
 (2005) Nearoptimal sensor placements in gaussian processes. In Proceedings of the 22nd international conference on Machine learning, pp. 265–272. Cited by: §I, §IIIB.
 (2016) Orienteering problem: a survey of recent variants, solution approaches and applications. European Journal of Operational Research 255 (2), pp. 315–332. Cited by: §IIIA.
 (1996) Reinforcement learning: a survey. Journal of artificial intelligence research 4, pp. 237–285. Cited by: §IIB.
 (2017) Learning combinatorial optimization algorithms over graphs. In Advances in Neural Information Processing Systems, pp. 6348–6358. Cited by: §IIB, §IIB.
 (2017) TuRF: fast data collection for fingerprintbased indoor localization. In 2017 International Conference on Indoor Positioning and Indoor Navigation (IPIN), pp. 1–8. Cited by: §I, §VA.
 (2017) Informative planning and online learning with sparse gaussian processes. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 4292–4298. Cited by: §IIA, §IIIB.
 (2019) Active sensing for motion planning in uncertain environments via mutual information policies. The International Journal of Robotics Research 38 (23), pp. 146–161. Cited by: §IIA.
 (2007) Nonmyopic informative path planning in spatiotemporal models. In AAAI, Vol. 10, pp. 16–7. Cited by: §I, §I.
 (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §IIB, 2nd item, §IVE, §IVE, §VB.
 (2015) Prioritized experience replay. arXiv preprint arXiv:1511.05952. Cited by: §IIB, §VA.
 (2007) Efficient planning of informative paths for multiple robots.. In IJCAI, Vol. 7, pp. 2204–2211. Cited by: §I, §IIA, §IIIB.
 (2018) Reinforcement learning: an introduction. MIT press. Cited by: §IIB.
 (2016) Deep reinforcement learning with double qlearning. In Thirtieth AAAI Conference on Artificial Intelligence, Cited by: §IIB, §VA.
 (2011) The orienteering problem: a survey. European Journal of Operational Research 209 (1), pp. 1–10. Cited by: §I, §IIIA.
 (2016) Planning with ants: efficient path planning with rapidly exploring random trees and ant colony optimization. International Journal of Advanced Robotic Systems 13 (5), pp. 1729881416664078. Cited by: §I, §IIA.
 (2007) Robotassisted sensor network deployment and data collection. In 2007 International Symposium on Computational Intelligence in Robotics and Automation, pp. 467–472. Cited by: §I.
 (2015) Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581. Cited by: §IIB.
 (2019) Informative path planning for location fingerprint collection. IEEE Transactions on Network Science and Engineering. Cited by: §I, §IIA, §VC, §VC.
 (2006) Gaussian processes for machine learning. Vol. 2, MIT Press Cambridge, MA. Cited by: §I, §IIIB.
 (2012) WILL: wireless indoor localization without site survey. IEEE Transactions on Parallel and Distributed Systems 24 (4), pp. 839–848. Cited by: §I, §VA.
 (2012) Locating in fingerprint space: wireless indoor localization with little human intervention. In Proceedings of the 18th annual international conference on Mobile computing and networking, pp. 269–280. Cited by: §I, §VA.
 (2014) Correlated orienteering problem and its application to informative path planning for persistent monitoring tasks. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 342–349. Cited by: §IIIA.
 (2014) Informative mobility scheduling for mobile data collector in wireless sensor networks. In 2014 IEEE Global Communications Conference, pp. 5002–5007. Cited by: §I.