Finding the Bandit in a Graph:Sequential Search-and-Stop

Finding the Bandit in a Graph:
Sequential Search-and-Stop

Pierre Perrault
SequeL team, INRIA Lille — CMLA, ENS Paris-Saclay \ANDVianney Perchet
CMLA, ENS Paris-Saclay — Criteo Research \ANDMichal Valko
SequeL team, INRIA Lille

We consider the problem where an agent wants to find a hidden object that is randomly located in some vertex of a directed acyclic graph (DAG) according to a fixed but possibly unknown distribution. The agent can only examine vertices whose in-neighbors have already been examined. In scheduling theory, this problem is denoted by . However, in this paper, we address learning setting where we allow the agent to stop before having found the object and restart searching on a new independent instance of the same problem. The goal is to maximize the total number of hidden objects found under a time constraint. The agent can thus skip an instance after realizing that it would spend too much time on it. Our contributions are both to the search theory and multi-armed bandits. If the distribution is known, we provide a quasi-optimal greedy strategy with the help of known computationally efficient algorithms for solving under some assumption on the DAG. If the distribution is unknown, we show how to sequentially learn it and, at the same time, act near-optimally in order to collect as many hidden objects as possible. We provide an algorithm, prove theoretical guarantees, and empirically show that it outperforms the naïve baseline.


Finding the Bandit in a Graph:
Sequential Search-and-Stop

  Pierre Perrault SequeL team, INRIA Lille — CMLA, ENS Paris-Saclay Vianney Perchet CMLA, ENS Paris-Saclay — Criteo Research Michal Valko SequeL team, INRIA Lille


noticebox[b]Preprint. Work in progress.\end@float

1 Introduction

We study the problem where an object, called hider, is randomly located in one vertex of a directed acyclic graph (DAG), and where an agent wants to find it by sequentially selecting vertices one by one, and examining them at a (possibly random) cost. The agent has a strong constraint: its search must respect precedence constraints imposed by the DAG, i.e., a vertex can be examined only if all its in-neighbors have already been examined. The goal of the agent is to minimize the expected total search cost incurred before finding the hider. This problem is a single machine scheduling problem [Lín, 2015], where a set of jobs have to be processed on a single machine that can process at most one job at a time. Once a job processing is started, it must continue without interruption until the processing is complete. Each job has a cost , representing its processing time, and a weight , representing its importance (here, is the probability that contains the hider). The goal is to find a schedule that minimizes , representing the expected cost suffered by the agent for finding the hider) where is the completion time of job .

The standard scheduling notation [Graham et al., 1979] denotes this problem as , and it was already shown to be NP-hard [Lawler, 1978, Lenstra and Rinnooy Kan, 1978]. On the positive side, several polynomial-time -approximations exist, depending on the assumption we take on the DAG (see e.g., the recent survey of Prot and Bellenguez-Morineau, 2018). For instance, the case of can be dealt without any additional assumption. On the other hand, there is an exact -time algorithm when the partially ordered set (poset) defined by the DAG is a series-parallel order [Lawler, 1978]. More generally, when the poset has fractional dimension of at most , there is a polynomial-time approximation with [Ambühl et al., 2011]. In this work, we assume the DAG is such that an exact polynomial-time algorithm is available, for example, we can take two-dimensional poset [Ambühl and Mastrolilli, 2009].

The problem is also well known in search theory [Fokkink et al., 2016], one of the original disciplines of Operations Research, and has been extensively studied following the initial results of Stone [1976]. Here, the search space is a DAG. We thus fall within the network search setting [Kikuta and Ruckle, 1994, Gal, 2001, Evans and Bishop, 2013]. When the DAG is an out-tree, the problem reduces to the expanding search problem introduced by Alpern and Lidbetter [2013].

The case of unknown distribution of the hider is usually studied within the field of search games, i.e., a zero-sum game where the agent picks the search, and plays against the hider, with search cost as payoff [Alpern and Gal, 2006, Alpern et al., 2013, Hohzaki, 2016]. In our work, we deal with an unknown hider distribution by extending the stochastic setting to the sequential case, where at each round , the agent faces a new, independent instance of the problem. The challenge is the need to learn the distribution through repeated interactions with the environment. Each instance, the agent has to perform a search based on the instances observed during the previous rounds. Furthermore, contrary to the typical search setting, the agent can additionally decide whether it wishes to abandon the search on the current instance and start a new one in the next round, even if the hider was not found. The goal of the agent is to collect as many hiders as possible, using a fixed budget . This may be particularly useful, when the remaining vertices have large costs and it would not be cost-effective to examine them.

As a result, the hider may not be found in each round and the agent has to make a trade-off between exhaustive searches, which lead to a good estimation (exploration) and efficient searches, which leads to a good benefit/cost ratio (exploitation). The sequential exploration-exploitation trade-off is well studied in multi-armed bandits [Cesa-Bianchi and Lugosi, 2006] and has been applied to many fields including mechanism design [Mohri and Munoz, 2014], search advertising [Tran-Thanh et al., 2014] and personalized recommendation [Li et al., 2010]. Our setting can thus be seen as an instance of stochastic combinatorial semi-bandits [Cesa-Bianchi and Lugosi, 2006, 2012, Gai et al., 2012, Gopalan et al., 2014, Kveton et al., 2015, Combes et al., 2015, Wang and Chen, 2017, Valko, 2016]. For this reason, we refer to a vertex as an arm. We shall see, however, that this particular semi-bandit problem is challenging, first because the offline objective is not easy to optimize, and second, because it does not directly satisfy the standard key assumptions as monotonicity and non-negativity when using optimistic estimates.

There are several motivations behind this setting. The decision-theoretic troubleshooting problem of giving a diagnosis for several devices having a malfunctioning component, and coming sequentially to the agent, is one example. Precedence constraints arise naturally in many troubleshooting applications, where there are restrictions imposed to the agent on the order of component tests, see e.g., Jensen et al., 2001. Moreover, allowing the agent to stop gives a new alternative to the so-called service call [Heckerman et al., 1995, Jensen et al., 2001] in order to deal with non-cost-effective vertices: Instead of giving a high cost to an extra action that will automatically find the fault in the device, we give it a zero cost, but do not reward such diagnostic failure. This way, we do not need to estimate any call-service cost, which could be practical, for example, when a new device is sent to the user if the diagnostic fails, with a cost that depends on a disutility for the user: loss of personal data, device reconfiguration, etc. Maximizing the number of hiders found is then analogous to maximizing the number of successful diagnoses.

Another motivation comes from online advertisement. There are several different actions that might generate a conversion from a user, such as sending one, or several emails, displaying one or several ads on a website, buying keywords on search engines, etc. We assume that some precedence constraints are imposed between actions, and that a conversion will occur if some sequence of actions is made, for instance, display first an ad, then send a first email, and finally a second one. As a consequence, the conversion is “hidden” in precedence constraints and the agent aims at finding it. However, for some users, finding the correct sequence might be too expensive and it might be more interesting to abandon that specific user to focus on more promising ones.

Finally, we can mention several related settings called stochastic probing [Gupta and Nagarajan, 2013], which differs in the fact that each arm can contain a hider, independently from each other and the framework of optimal discovery [Bubeck et al., 2013], widely studied in machine learning.

Our contributions

In this paper, we first introduce the sequential search-and-stop problem in detail. One of our main contributions is a stationary offline policy (i.e., an algorithm that solves the problem when the distribution is known), for which we prove the approximation guarantees and adapt it in order to fit the online problem. In particular, we prove it is quasi-optimal, and use the exact algorithm for to prove its computational efficiency. Finally, we provide a solution when the distribution is unknown to the agent, based on combinatorial upper confidence bounds (CUCB) algorithm [Chen et al., 2016], and UCB-variance (UCB-V) of Audibert et al. [2009]. Dealing with variance estimates allows us to sharp the bound on the expected cumulative regret, compared to the simple use of CUCB.

In the following, we typeset vectors in bold and indicate components with indices, i.e., . Furthermore, we indicate randomness by underlining the relevant symbols [Hemelrijk, 1966].

2 Background

In this section, we formalize the setting we consider. We denote a finite DAG by , where is its set of vertices, or arms, and is its set of arcs. For more generality, we assume arm costs are random and mutually independent. We denote , with expectation , the cost of arm . We thus have . We also assume that one specific vertex, called hider, is chosen at random, independently from , accordingly to some fixed categorical (or multivariate Bernoulli) distribution parametrized by vector satisfying111i.e., belongs to the simplex of and . We remind that if, given and with probability , and for all . Let denote the joint distribution of .

For an (ordered) subset of , we denote by , the complementary of in , and its cardinality. Moreover, if , we let . Let be the sub-DAG in induced by , i.e. the DAG with as vertex set, and with is an arc in if and only if . We call support of an ordered arm set the corresponding non-ordered set. If , we write (resp., ) if (resp., ). We let for . In addition, we let if there is such that , and otherwise. For two disjoint ordered arm sets and , we let be the concatenation of and .

We assume that allows a polynomial-time algorithm (w.r.t. ), denoted Schedul, for the problem of with precedence constraints given by , i.e., for minimizing of the poset defined by (that we call -linear extensions). represents the expected cost to pay for finding the hider, by searching arm first and paying , then by paying if , and so on until the hider is found (i.e., the last arm searched is such that ).

We define a search in as an ordering of different arms such that for all , predecessors of in are included in (a search is a prefix of a -linear extension). We denote (or simply ) the set of searches in . Support of a search is called initial set.

2.1 Protocol

The search problem we focus on is sequential. We consider an agent, also called a learning algorithm or a policy that knows but that does not know . At each round , an independent sample is drawn from . The aim of the agent is to search the hider (i.e., the arm such that ) by constructing a search on . Since the hider may be located at some arm that doesn’t belong to the search, it is not necessarily found over each round.

The search to be used by the agent can be computed based on all its previous observations , that we refer to as history, i.e., all the costs of explored vertices (and only those) and all the locations where the hider has been found or not. Obviously, the search cannot depend on non-observed quantities. For example, the agent may estimate and in order to choose the search accordingly. Each time an arm  is searched, the feedback and is given to the agent. Since several arms can be searched over one round, this problem falls into the family of stochastic combinatorial semi-bandits. The agent can keep searching until its budget, , runs out. is a positive number and does not need to be known to the agent in advance. The goal of the agent is to maximize the overall number of hiders found under the budget constraint.

The setting described above allows the agent to modify its behavior depending on the feedback it received during the current round. However, by independence assumption between random variables, the only feedback susceptible to modify the search the agent chose at the beginning of a round  is the observation of for some arm . Even if nothing prevents the agent from continuing “searching” some arms after having seen such an event, it would not increase the number of hiders found (there is no more hider to find), while this would still decrease the remaining budget, and therefore it would have a pure exploratory purpose. Knowing this, an oracle policy that knows exactly thus selects a search at the beginning of round , and then performs the search that follows until either is observed or is exhausted (i.e., no arms are left in ). Therefore, the performed search is in fact . We thus restrict ourselves to an agent that selects a search at the beginning of each round and then performs over this round.

Following Stone [1976], we refer to our problem as sequential search-and-stop. We now detail the overall objective for this problem: the agent wants to follow a policy , that selects a search at round (this choice can be random as it may depend on the past observations , as well as possible randomness from the algorithm), while maximizing the expected overall reward

where is the random round at which the remaining budget becomes negative; i.e., if , and . We evaluate the performance of a policy using the expected cumulative budgeted regret with respect to , the maximum value of (among all possible oracle policies that know and ), defined as

Example 1

One may wonder if there exist cases where it is interesting for the agent to stop the search earlier. Consider for instance the simplest non-trivial case with two arms and no precedence constraint. The costs are deterministically chosen to be and and the location of the hider is chosen uniformly at random. An optimal search will always sample first the arm with cost. If it also samples the other, then the hider will be found with an expected cost of . However, if the agent always stops the search after the first arm, and reinitializes on a new instance by doing the same, the overall cost to find one hider is

Therefore, stopping searches, even if the location of the hider is known, can be better than always trying to find it.

3 Offline oracle

In this section, we provide an algorithm for sequential search-and-stop when parameters and are given to the agent. We show that a simple stationary policy (i.e., the same search is selected at each round) can obtain almost the same expected overall reward as . We will denote by Oracle, an algorithm that takes , , and as input and outputs . This offline oracle will eventually be used by the agent for the online problem, i.e., when parameters are unknown. Indeed, at round , the agent can approximate by the output of , where can be any guesses/estimates of the true parameters used by the agent. Importantly, depending on the policy followed by the agent, may not stay in the simplex anymore. We will thus build Oracle such that an “acceptable” output is given for any input .

3.1 Objective design

A standard paradigm for designing a stationary approximation of the offline problem in budgeted multi-armed bandits is the following: has to minimize the ratio between the expected cost paid and the expected reward gain, over a single round, selecting . We thus define, for ,

that is equal to . Notice that we allow to be equal to (when ). We use the convention , because there is no interest in choosing an empty search for a round. We define the optimal values of on as In Proposition 3.1, we provide guarantees on the stationary policy. \thmt@toks\thmt@toks If is the offline policy selecting at each round , then

Proposition 1

A proof is given in Appendix B and follows the one provided for Lemma 1 of Xia et al. [2016]. Intuitively, Proposition 3.1 states that the optimal overall expected reward that can be gained (i.e., the maximum expected number of hiders found) is approximately (we assume that ). This is quite intuitive, since this quantity is actually the ratio between the overall budget and the minimum expected cost paid to find a single hider. Indeed, one can consider the related problem of minimizing the overall expected cost paid, over several rounds, to find a single hider. It can be expressed as an infinite-time horizon Markov decision process (MDP) with action space and two states: whether the hider is found (which is the terminal state) or not. The goal is to choose a strategy , minimizing where the stopping time is the first round at which the hider is found. The Bellman equation is

from which we can deduce that there exists an optimal stationary strategy [Sutton and Barto, 1998] such that for all . Therefore, we can minimize that gives the optimal value of .

As we already mentioned, Oracle aims at taking inputs . The first straightforward way to do is to consider

However, notice that with the definition above, could output negative values (if ), which is not desired, because the agent would then be enticed to search arms with a high cost. We thus need to design a non-negative extension of to . One way is to replace by , another is to consider where . There is a significant advantage of considering the second way, even if it is less natural than the first one, which is that222Notice this is not exactly a monotonicity property stated here, because we compare to a single point . for , if and . This property is known to be useful for analysis of many stochastic combinatorial semi-bandit algorithms (see e.g., Chen et al., 2016). Thus, we choose for Oracle the minimization of the surrogate .

3.2 Algorithm and theoretical guarantees

We now provide Oracle in Algorithm 1 and claim that it minimizes over in Theorem 3.2. We provide a proof in Appendix A. Notice that Oracle needs to call the polynomial-time algorithm , that minimizes the objective over -linear extensions. Then, Algorithm 1 only computes the maximum value index of a list of size that takes linear time. To give an intuition, follows the ordering given by , and stops at some point when it becomes more interesting to start a fresh new instance.

  Input: and .  .   (ties broken arbitrarily).
  Output: the search .
Algorithm 1 Oracle

For every , Algorithm 1 outputs a search minimizing over .

Theorem 1

4 Online search-and-stop

We consider in this section the additional challenge where the distribution is unknown and the agent must learn it, while minimizing over sampling policies , where is a fixed budget. Recall that a policy selects a search at the beginning of round , using previous observations , and then performs the search over the round. We will consider the problem as a variant of stochastic combinatorial bandits [Gai et al., 2012]. The feedback received by an agent at round is random, as usual in bandits, because it depends on . However, in our case, it also depends on , and thus it is not measurable w.r.t. . More precisely, is observed only for arms . Notice that since is a one-hot vector, the agent can always deduce the value of for all . As a consequence, we will maintain two types of counters for all arms ,


The corresponding empirical averages using these counter are then defined as


We propose an approach, similar to UCB-V of Audibert et al. [2009], based on CUCB by Chen et al. [2016], called CUCBV, that uses a variance estimation of in addition to the empirical average. Notice that the variance of for an arm is . In addition, since is binary, the empirical variance of  after rounds is For every round and every edge , we define

where we choose the exploration factor to be (notice that we could take any as shown by Audibert et al., 2009). We provide the policy that we consider in Algorithm 2.

  Input: .
  Initialization: Set all counters and to and empirical averages and , for all .
  for  do
     select given by .
     perform .
     collect feedback and update counters and empirical average according to Eq. 1 and 2.
  end for
Algorithm 2 Combinatorial upper confidence bounds with variance estimates

4.1 Analysis

Notice that since an arm is pulled (and thus is revealed to the agent) with probability over round , we fall into the setting of probabilistically triggered arms w.r.t. costs, described by Chen et al. [2016] and Wang and Chen [2017]. Thus we can use these prior results for our purpose. However, the main difficulty in our setting is that one also needs to deal with probabilities , that the agent actually observes for every arm in (either because it actually pulls arm , or because it deduces the value from other pulls of round ). In particular, if we follow a prior analysis of Chen et al. [2016] and Wang and Chen [2017], the double sum in the definition of leads to regret bound that is quite large. Indeed, assuming that all costs are deterministically equal to , if one suffers an error of when approximating each , then the global error can be as large as , contrary to just for the approximation error w.r.t. costs, that is more common in classical combinatorial semi-bandits. Thus, here we rather combine this work with the variance estimates of . Often, this does not provide a significant improvement over UCB in terms of a regret bound, but since in our case, the variance is of order , the gain is non-negligible.333The error is thus scaled by the standard deviation, of order , giving a global error of . We therefore recover the factor given in Theorem 4.1. We define and , and for any search , we let

that represents the local regret of selecting a sub-optimal search at round . In addition, for each arm , we define We can first provide the following bound, giving that the expected regret is of the same order as the sum of all local regrets. \thmt@toks\thmt@toks For any policy that selects search at round ,

Proposition 2

The proof is given in Appendix C. The analysis will be concluded for the expected regret of with bounds of the leading term in the previous proposition, , giving the following Theorem 4.1. The first bound is -dependent, and is characterized by and . Its main term scales logarithmically w.r.t. . The second bound is true for any value of and . We only provide the leading term in , neglecting the other ones and removing the universal multiplicative constants. Explicit statements can be found in Appendix D. \thmt@toks\thmt@toks The regret of CUCBV is bounded as

Theorem 2

The proof is given in Appendix D. We recall that the main difficulty comes from estimation of and not from . In particular, the proof uses triggering probability groups and the reverse amortization trick of Wang and Chen [2017] for dealing with costs. For hider probabilities, only the second trick is necessary.444When search is selected, all feedback is received with probability 1, so triggering probability groups are not useful. We use it not only to deal with the slowly concentrating confidence term for the estimates of each arm , as Wang and Chen [2017], but also to completely amortize the additional fast-rate confidence term due to variance estimation coming from the use of Bernstein’s inequality.

5 Experiments

In this section, we present experiments for sequential search-and-stop. We compare our CUCBV with two baselines. The first one is CUCB, i.e., the selected search is given by , where

The second is known as -greedy, that acts greedily with probability (exploitation), meaning that the agent selects the search given by , and with probability (exploration), the agent selects a random search of size (we uniformly select the next arm among available arms to continue the search until the hider is found). Notice that in the exploration step, we could potentially choose a non-complete search, by, for example, also taking a random stopping time. We chose not to do that since the -greedy would explore less this way in the exploration step. In experiments, we vary as , , and . We run simulations for all the algorithms without precedence constraints with parameters and chosen such that contains a search of size . We plot the regret curves for CUCB, CUCBV, and -greedy for different values of , all averaged over simulations. In Figure 1, left, we take , and right, we take . We use a distribution for the cost with means . In the first case (Figure 1, left), we notice that both CUCB and CUCBV give similar results, both better than the -greedy approach and that this difference appears already for small budget value . In the second case (Figure 1, right), since the number of arms considered doubles, we need to consider a larger budget to notice the difference between the algorithms. We note that since the optimal search is not complete, the difference between CUCB and CUCBV is significant, since the former explores too much, and is even worse than -greedy, while the latter makes use of the low variance of each (of order ) to constrain its exploration in order to reach a better regret rate. In both cases, we found that -greedy with gives a lower regret compared to -greedy with twice the () and the half of it (). Additional experiments are given in Appendix F.













Figure 1: Regret for learning expanding search-and-stop. Left: . Right: .

6 Conclusion and future work

We presented sequential search-and-stop problem and provided a stationary offline solution. We gave theoretical guarantees on its optimality and proved that it is computationally efficient. We also considered the learning extension of the problem where the distribution of the hider and the cost are not known. We gave CUCBV, an upper-confidence bound approach, tailored to our case and provide regret guarantees with respect to the optimal policy. There are several possible extensions. We could consider several hiders rather than just one. Another would be to explore the Thomson sampling [Chapelle and Li, 2011, Agrawal and Goyal, 2012, Komiyama et al., 2015] in the learning case. However, the choice of the prior to use is not straightforward: one can assign Beta prior for each arm, or consider a Dirichlet prior on the whole arm set. The latter is interesting because a draw from this prior is in the simplex. The main drawback is the difficulty of updating such prior to get the posterior, because sometimes the hider is not found and the one-hot vector is not received entirely.


V. Perchet has benefited from the support of the ANR (grant n.ANR-13-JS01-0004-01), of the FMJH Program Gaspard Monge in optimization and operations research (supported in part by EDF) and from the Labex LMH. The research presented was also supported by European CHIST-ERA project DELTA, French Ministry of Higher Education and Research, Nord-Pas-de-Calais Regional Council, Inria and Otto-von-Guericke-Universität Magdeburg associated-team north-european project Allocate, and French National Research Agency projects ExTra-Learn (n.ANR-14-CE24-0010-01) and BoB (n.ANR-16-CE23-0003).


Appendix A Proof of Theorem 3.2

Theorem ??

Here, we might abbreviate into , and into , keeping in mind that our results will be valid for all . To prove Theorem 3.2 we first define the concept of density, well know in scheduling and search theory.

Definition 1 (Density)

The density is the function defined on by , and .

Density of can be understood as the quality/price ratio of that set of arms: the quality is the overall probability of finding the hider in it, while the price is the total cost to fully explore it. Without precedence constraint, the so-called Smith’s rule of ratio [Smith, 1956] gives that minimizes over linear orders (i.e., permutations of ) if and only if555One can see that is the variation of when swapping a linear order by a permutation , where the set of inversions in . . Sidney [1975] generalized this principle to any precedence constraint with the concept of Sidney decomposition. Recall that an initial set is the support of a search.

Definition 2 (Sidney decomposition)

A Sidney decomposition is an ordered partition of such that for all , is a maximum density initial set of .

Notice that the Sidney decomposition defines a more refined poset on , with the extra constraint that an element of must be processed before those of for . Any -linear extension that is also a linear extension of this poset is said to be consistent with the Sidney decomposition. The following theorem was proved by Sidney [1975]:

Theorem 3 (Sidney, 1975)

Every minimizer of over -linear extensions is consistent with some Sidney decomposition. Moreover, for every Sidney decomposition , there is a minimizer of over -linear extensions consistent with .

Notice that Theorem 3 does not provide a full characterization of minimizers of over -linear extensions, but only a necessary condition. Nothing is stated about how to chose the ordering inside each ’s, and this highly depends on the structure of [Lawler, 1978, Ambühl and Mastrolilli, 2009, Ambühl et al., 2011]. We are now ready to prove Theorem 3.2, thanks to Lemma A, of which the proof is given in Appendix A.1.


For any Sidney decomposition , there exists and a search with support that minimizes .

Lemma 1

Proof of Theorem 3.2 We know from first statement of Theorem 3 that given in Algorithm 1 is consistent with some Sidney decomposition . Let and minimizing of support given by Lemma A. Let with being the restriction of to (and thus is its restriction to ). Let’s prove that is also a minimizer of by showing , thereby concluding the proof. Since , we have

i.e., , and because is increasing on , we have .  

The proof of Lemma A also uses Sidney’s Theorem 3, but this time the second statement. However, although it provides a crucial analysis, with fixed support, concerning the order to choose for minimizing and therefore , nothing is said about the support to choose. Thus, to prove Lemma A, we also need the following Proposition 3, that gives the key support property satisfied by .

Proposition 3 (Support property)

If with , then

Proof  If , We thus suppose . Since


Suppose that . If , then


Thus, we have,

Finally, if , by (3), we have that