A simple parameter-free and adaptive approach to optimization under a minimal local smoothness assumption

A simple parameter-free and adaptive approach to optimization under a minimal local smoothness assumption

\namePeter L. Bartlett \emailpeter@berkeley.edu
\addrUniversity of California, Berkeley, USA \AND\nameVictor Gabillon \emailvictor.gabillon@qut.edu.au
\addrQueensland University of Technology - ACEMS, Australia \AND\nameMichal Valko \emailmichal.valko@inria.fr
\addrSequeL team, INRIA Lille - Nord Europe, France
Abstract

We study the problem of optimizing a function under a budgeted number of evaluations. We only assume that the function is locally smooth around one of its global optima. The difficulty of optimization is measured in terms of 1) the amount of noise of the function evaluation and 2) the local smoothness, , of the function. A smaller results in smaller optimization error. We come with a new, simple, and parameter-free approach. First, for all values of and , this approach recovers at least the state-of-the-art regret guarantees. Second, our approach additionally obtains these results while being agnostic to the values of both and . This leads to the first algorithm that naturally adapts to an unknown range of noise and leads to significant improvements in a moderate and low-noise regime. Third, our approach also obtains a remarkable improvement over the state-of-the-art SOO algorithm when the noise is very low which includes the case of optimization under deterministic feedback (). There, under our minimal local smoothness assumption, this improvement is of exponential magnitude and holds for a class of functions that covers the vast majority of functions that practitioners optimize (). We show that our algorithmic improvement is also borne out in the numerical experiments, where we empirically show faster convergence on common benchmark functions.

\DeclareBoldMathCommand\mathds

1I \DeclareBoldMathCommand\ee \DeclareBoldMathCommand\ff \DeclareBoldMathCommand\gg \DeclareBoldMathCommand\aa \DeclareBoldMathCommand¯b \DeclareBoldMathCommandd \DeclareBoldMathCommand\mm \DeclareBoldMathCommand\pp \DeclareBoldMathCommand\qq \DeclareBoldMathCommandˇv \DeclareBoldMathCommand\VV \DeclareBoldMathCommand\xx \DeclareBoldMathCommand-t \DeclareBoldMathCommandXX \DeclareBoldMathCommand\YY \DeclareBoldMathCommand\zz \DeclareBoldMathCommand\ZZ \DeclareBoldMathCommand\MM \DeclareBoldMathCommand\nn \DeclareBoldMathCommand\ssigmaσ \DeclareBoldMathCommand\SSigmaΣ \DeclareBoldMathCommand\OOmegaΩ \DeclareBoldMathCommand\yy \DeclareBoldMathCommand\UU \DeclareBoldMathCommand\ww \DeclareBoldMathCommand\WW \DeclareBoldMathCommandŁL \DeclareBoldMathCommand\ss \DeclareBoldMathCommand§S \DeclareBoldMathCommand\AA \DeclareBoldMathCommand\BB \DeclareBoldMathCommand\CC \DeclareBoldMathCommand\DD \DeclareBoldMathCommandEE \DeclareBoldMathCommand\GG \DeclareBoldMathCommand˝H \DeclareBoldMathCommandP \DeclareBoldMathCommand\QQ \DeclareBoldMathCommandRR \DeclareBoldMathCommandXX \DeclareBoldMathCommand\mmuμ \DeclareBoldMathCommand\ones1 \DeclareBoldMathCommand\zeros0 \ShortHeadingssimple approach to optimization under a minimal smoothness assumptionBartlett, Gabillon, Valko \firstpageno1

\editor

Satyen Kale and Aurélien Garivier

{keywords}

optimization, tree search, deterministic feedback, stochastic feedback

1 Introduction

In budgeted function optimization, a learner optimizes a function having access to a number of evaluations limited by . For each of the evaluations (or rounds), at round , the learner picks an element and observes a real number , where , where is the noise. Based on , we distinguish two feedback cases:

Deterministic feedback

The evaluations are noiseless, that is , and . Please refer to the work by de Freitas et al. (2012) for a motivation, many applications, and references on the importance of the .

Stochastic feedback

The evaluations are perturbed by a noise of range 111Alternatively, we can turn the boundedness assumption into a sub-Gaussianity assumption equipped with a variance parameter equivalent to our range .: At any round, is a random variable, assumed to be independent of the noise at previous rounds,

(1)

The objective of the learner is to return an element with largest possible value after the evaluations. can be different from the last evaluated element . More precisely, the performance of the algorithm is the loss (or simple regret),

We consider the case that the evaluation is costly. Therefore, we minimize  as a function of . We assume that there exists at least one point such that .

Prior work

Among the large work on optimization, we focus on algorithms that perform well under minimal assumptions as well as minimal knowledge about the function. Relying on minimal assumptions means that we target functions that are particularly hard to optimize. For instance, we may not have access to the gradients of the function, gradients might not be well defined, or the function may not be continuous. While some prior works assume a global smoothness of the function (Pintér, 2013; Strongin and Sergeyev, 2013; Hansen and Walster, 2003; Kearfott, 2013), another line of research assumes only a weak/local smoothness around one global maximum (Kleinberg et al., 2008; Bubeck et al., 2011a). However, within this latter group, some algorithms require the knowledge of the local smoothness such as HOO (Bubeck et al., 2011a), Zooming (Kleinberg et al., 2008), or DOO (Munos, 2011). Among the works relying on an unknown local smoothness, SOO (Munos, 2011; Kawaguchi et al., 2016) represents the state-of-the-art for the deterministic feedback. For the stochastic feedback, StoSOO (Valko et al., 2013) extends SOO for a limited class of functions. POO (Grill et al., 2015) provides more general results. We classify the most related algorithms in the following table.

smoothness deterministic stochastic
known DOO Zooming, HOO
unknown DiRect, SOO, SequOOL StoSOO, POO, StroquOOL

Note that, for more specific assumptions on the smoothness, some works study optimization without the knowledge of smoothness: DiRect (Jones et al., 1993) and others (Slivkins, 2011; Bubeck et al., 2011b; Malherbe and Vayatis, 2017) tackle Lipschitz optimization.

Finally, there are algorithms that instead of simple regret, optimize cumulative regret, for example, HOO (Bubeck et al., 2011a) or HCT (Azar et al., 2014). However, none of them adapt to the unknown smoothness and compared to them, the algorithms for simple regret that are able to do that, such as POO or our StroquOOL, need to explore significantly more, which negatively impacts their cumulative regret (Grill et al., 2015; Locatelli and Carpentier, 2018).

Existing tools

Partitionining and near-optimality dimension: As in most of the previously mentioned work, the search domain is partitioned into cells at different scales (depths), i.e., at a deeper depth, the cells are smaller but still cover all of . The objective of many algorithms is to explore the value of  in the cells of the partition and determine at the deepest depth possible in which cell is a global maximum of the function. The notion of near-optimality dimension  characterizes the complexity of the optimization task. We adopt the definition of near-optimality dimension given recently by Grill et al. (2015) that unlike Bubeck et al. (2011a), Valko et al. (2013), Munos (2011), and Azar et al. (2014), avoids topological notions and does not artificially attempt to separate the difficulty of the optimization from the partitioning. For each depth , it simply counts the number of near-optimal cells , cells whose value is close to , and determines how this number evolves with the depth . The smaller , the more accurate the optimization should be.

New challenges

Adaptations to different data complexities: As did Bubeck and Slivkins (2012), Seldin and Slivkins (2014), and De Rooij et al. (2014) in other contexts, we design algorithms that demonstrate near-optimal behavior under data-generating processes of different nature, obtaining the best of all these possible worlds. In this paper, we consider the two following data complexities for which we bring new improved adaptation.

  • [leftmargin=.5cm]

  • near-optimality dimension : In this case, the number of near-optimal cells is simply bounded by a constant that does not depend on . As shown by Valko et al. (2013), if the function is lower- and upper-bounded by two polynomial envelopes of the same order around a global optimum, then . As discussed in the book by Munos (2014, section 4.2.2), covers the vast majority of functions that practitioners optimize and the functions with given as examples in prior work (Bubeck et al., 2011b; Grill et al., 2015; Valko et al., 2013; Munos, 2011) are carefully engineered. Therefore, the case of is of practical importance. However, even with deterministic feedback, the case with unknown smoothness has not been known to have a learner with a near-optimal guarantee. In this paper, we also provide that. Our approach not only adapts very well to the case and , it also provides an exponential improvement over the state of the art for the simple regret rate.

  • low or moderate noise regime: When facing a noisy feedback, most algorithms assume that the noise is of a known predefined range, often using hard-coded in their use of upper confidence bounds. Therefore, they can’t take advantage of low noise scenarios. Our algorithms have a regret that scales with the range of the noise , without a prior knowledge of . Furthermore, our algorithms ultimately recover the new improved rate of the deterministic feedback suggested in the precedent case ().

Main results

Improved theoretical results and empirical performance: We consider the optimization under an unknown local smoothness. We design two algorithms, SequOOL for the deterministic case in Section 3, and StroquOOL for the stochastic one in Section 4.

  • [leftmargin=.5cm]

  • SequOOL is the first algorithm to obtain a loss under such minimal assumption, with deterministic feedback. The previously known SOO (Munos, 2011) is only proved to achieve a loss of . Therefore, SequOOL achieves, up to log factors, the result of DOO that knows the smoothness. Note that Kawaguchi et al. (2016) designed a new version of SOO, called LOGO, that gives more flexibility in exploring more local scales but it was still only shown to achieve a loss of despite the introduction of a new parameter. Achieving exponentially decreasing regret had previously only been achieved in setting with more assumptions (de Freitas et al., 2012; Malherbe and Vayatis, 2017; Kawaguchi et al., 2015). For example, de Freitas et al. (2012) achieves regret assuming several assumptions, for example that the function is sampled from the Gaussian process with four times differentiable kernel along the diagonal. The consequence of our results is that to achieve rate, none of these strong assumptions is necessary.

  • StroquOOL recovers, in the stochastic feedback, up to log factors, the results of POO, for the same assumption. However, as discussed later, StroquOOL is a simpler approach than POO with also an associated simpler analysis.

  • StroquOOL adapts naturally to different noise range, i.e., the various values of .

  • StroquOOL obtains the best of both worlds in the sense that StroquOOL also obtains, up to log factors, the new optimal rates reached by SequOOL in the deterministic case. StroquOOL obtains this result without being aware a priori of the nature of the data, only for an additional log factor. Therefore, if we neglect the additional log factor, we can just have a single algorithm, StroquOOL, that performs well in both deterministic and stochastic case, without the knowledge of the smoothness in either one of them.

  • In the numerical experiments, StroquOOL naturally adapts to lower noise. SequOOL obtains an exponential regret decay when on common benchmark functions.

Algorithmic contributions and originality of the proofs

Why does it work? Both SequOOL and StroquOOL are simple and parameter-free algorithms. The analysis is also simple and self-contained and does not need to rely on results of other algorithms knowing the smoothness. We now explain the reason behind this combined simplicity and efficiency.

Both SequOOL and StroquOOL are based on a new core idea that the search for the optimum should progress strictly sequentially from an exploration of shallow depths (with large cells) to deeper depths (small and localized cells). This is different from the standard approach in SOO, StoSOO, and the numerous extensions that SOO has inspired (Busoniu et al., 2013; Wang et al., 2014; Al-Dujaili and Suresh, 2018; Qian and Yu, 2016; Kasim and Norreys, 2016; Derbel and Preux, 2015; Preux et al., 2014; Buşoniu and Morărescu, 2014; Kawaguchi et al., 2016). We have identified a bottleneck in SOO (Munos, 2011) and its extensions that open all depths simultaneously (their Lemma ). However, in general, we show that the improved exploration of the shallow depths is beneficial for the deeper depths and therefore, we always complete the exploration of depth before going to depth . As a result, we design a more sequential approach that simplifies our Lemma  to the point of being natural and straightforward.

This desired simplicity is also achieved by being the first to adequately leverage the reduced and natural set of assumptions introduced in the POO paper (Grill et al., 2015). This adequate and simple leverage should not conceal the fact that our local smoothness assumption is minimal and already way weaker than global Lipschitzness. Second, this leveraging was absent in the analysis for POO which additionally relies on the 40 pages proof of HOO (see Shang et al., 2018, for a detailed discussion). Our proofs are succinct222The proof is even redundantly written twice for StroquOOL and SequOOL for completeness while obtaining performance improvement () and a new adaptation (). To obtain these, in an original way, our theorems are now based on solving a transcendental equation with the Lambert function. For StroquOOL, a careful discrimination of the parameters of the equation leads to optimal rates both in the deterministic and stochastic case.

Intriguingly, the amount of evaluations allocated to each depth follows a Zipf law (Powers, 1998), that is, each depth level is simply pulled inversely proportional to its depth index . This is a simple but not a straightforward idea. It provides a parameter-free method to explore the depths without knowing the bound on the number of optimal cells per depth ( when ) and obtain a maximal optimal depth of order . A Zipf law has been used by Audibert et al. (2010) and Abbasi-Yadkori et al. (2018) in pure-exploration bandit problems but without any notion of depth in the search. In this paper, we introduce the Zipf law to tree-search algorithms.

Another novelty is that of not using upper bounds in StroquOOL (unlike StoSOO, HCT, HOO, POO), which results in the contribution of removing the need to know the noise amplitude.

2 Partition, tree, assumption, and near-optimality dimension

Partitioning

The hierarchical partitioning we consider is similar to the ones introduced in prior work (Munos, 2011; Valko et al., 2013): For any depth in the tree representation, the set of cells (or nodes) forms a partition of , where is the number of cells at depth . At depth , the root of the tree, there is a single cell . A cell of depth is split into children subcells of depth . As Grill et al. (2015), our work focuses on a notion of near-optimality dimension that does not directly relate the smoothness property of to a specific metric but directly to the hierarchical partitioning . Indeed, an interesting fundamental question is to determine a good characterization of the difficulty of the optimization for an algorithm that uses a given hierarchical partitioning of the space as its input (see Grill et al., 2015, for a detailed discussion). Given a global maximum of , denotes the index of the unique cell of depth containing , i.e., such that . We follow the work by Grill et al. (2015) and state a single assumption on both the partitioning and the function .

Assumption 1

For any global optimum , there exists and such that ,

{definition}

For any and , the near-optimality dimension333Grill et al. (2015) define with the constant 2 instead of 3. 3 eases the exposition of our results. of  with respect to the partitioning and with associated constant , is

where is the number of cells of depth such that .

Tree-based learner

Tree-based exploration or tree search algorithm is a classical approach that has been widely applied to optimization as well as bandits or planning (Kocsis and Szepesvári, 2006; Coquelin and Munos, 2007; Hren and Munos, 2008), see Munos (2014) for a survey. At each round, the learner selects a cell containing a predefined representative element and asks for its evaluation. We denote its value . denotes the total number of evaluations allocated by the learner to the cell . Our learners collect the evaluations of  and organize them in a tree structure that is simply a subset of , , . We define, specially for the noisy case, the estimated value of the cell . Given the evaluations we have , the empirical average of rewards obtained at this cell. We say that the learner opens a cell with evaluations if it asks for evaluations from each of the children cells of cell . In the deterministic feedback, . For the sake of simplicity, the bounds reported in this paper are in terms of the total number of openings , instead of evaluations. The number of function evaluations is upper bounded by , where is the maximum number of children cells of any cell in .

The Lambert functiona Our results use the Lambert  function. Solving for the variable , the equation gives . is multivalued for . However, in this paper, we consider and , referred to as the standard cannot be expressed in terms of elementary functions. Yet, we have  (Hoorfar and Hassani, 2008). has applications in physics and applied mathematics (Corless et al., 1996).

Finally, let with , , and . denotes the logarithm in base , . Without a subscript, is the natural logarithm in base .

3 Adaptive deterministic optimization and improved rate

3.1 The SequOOL algorithm

  Parameters: , Initialization: Open . For to Open cells of depth with largest values . Output .

Figure 1: The SequOOL Algorithm

The Sequential Optimistic Optimization aLgorithm SequOOL is described in Figure 1. SequOOL explores sequentially the depth one by one, going deeper and deeper with a decreasing number of cells opened per depth : openings at depth . is the maximal depth that is opened. The analysis of SequOOL shows that it is relevant that , where is the -th harmonic number, , with for any positive integer . SequOOL returns the element of the evaluated cell with the highest value, . The budget is set to to preserve the simplicity of the bounds. SequOOL uses no more openings than that as

3.2 Analysis of SequOOL

For any global optimum , let be the depth of the deepest opened node containing  at the end of the opening of depth by SequOOL (an iteration of the for cycle). Note that  is increasing. The proofs of the following statements are given in Appendix A. {lemma}[] For any global optimum with associated as defined in Assumption 1, for any depth , if , we have , while . Lemma 3.2 states that as long as SequOOL opens more cells at depth than the number of near-optimal cells at depth , the cell containing is opened at depth . {theorem}[] Let be the standard Lambert function (see Section 2). For any function  and one of its global optima with associated , and near-optimality dimension , we have, after rounds, the simple regret of SequOOL bounded by

For more readability, Corollary 3.2 uses a lower bound on  (Hoorfar and Hassani, 2008). {corollary} If , assumptions in Theorem 3.2 hold and ,

3.3 Discussion for the deterministic feedback

Comparison with SOO

SOO and SequOOL both address deterministic optimization without knowledge of the smoothness. The regret guarantees of SequOOL are an improvement over SOO. While when both algorithms achieve a regret , when , the regret of SOO is while the regret of SequOOL is which is a significant improvement. As discussed in the introduction and by Valko et al. (2013, Section 5), the case is very common. As pointed out by Munos (2011, Corollary 2), SOO has to actually know whether or not to set the maximum depth of the tree as a parameter for SOO. SequOOL is fully adaptive, does not need to know any of this and actually gets a better rate.444A similar behavior is also achieved by combining two SOO algorithms, by running half of the samples for and half for . However, SequOOL does this naturally and gets a better rate when . The conceptual difference with SOO is that SequOOL is sequential, for a given depth , SequOOL first opens cells at depth and then at depth and so on, without coming back to lower depths. Indeed, an opening at depth is based on the values observed while opening at depth . Therefore, it is natural and less wasteful to do the opening in a sequential order. Moreover, SequOOL is more conservative as it opens more the lower depths while SOO opens every depth equally. However from the depth perspective, SequOOL is more aggressive as it opens depth as high as , while SOO stops at .

Comparison with DOO

Contrarily to SequOOL, DOO knows the smoothness of the function. However this knowledge only improves the logarithmic factor in the current upper bound. When , DOO achieves a regret , when , the loss is .

Lower bounds

As discussed by Munos (2014) for , DOO matches the lower bound and it is even comparable to the lower-bound for concave functions. While SOO was not matching the bound of DOO, with our result, we now know that, up to a log factor, it is possible to achieve the same performance as DOO, without the knowledge of the smoothness.

4 Noisy optimization with adaptation to low noise

4.1 The StroquOOL algorithm

  Parameters: , Init: Open times. . For to Exploration For down to Open times the aaa non-opened cells with highest aaa values and given that . For Cross-validation Evaluate times the candidates: aai . Output

Figure 2: The StroquOOL Algorithm

In the presence of noise, it is natural to evaluate the cells multiple times, not just one time as in the deterministic case. The amount of times a cell should be evaluated to differentiate its value from the optimal value of the function depends on the gap between these two values as well as the range of noise. As we do not want to make any assumptions on knowing these quantities, our algorithm tries to be robust to any potential values by not making a fixed choice on the number of evaluations. Intuitively, StroquOOL implicitly uses modified versions of SequOOL, denoted SequOOL555Again, this is only for the intuition, the algorithm is not a meta-algorithm over SequOOL’s. where each cell is evaluated times, , while in SequOOL . On one side, given one instance of SequOOL, evaluating more each cells ( large) leads to a better quality of the mean estimates in each cell. On the other side, as a tradeoff, it implies that SequOOL is using more evaluations per depth and therefore is not be able to explore deep depths of the partition. The largest depth explored is now . StroquOOL then implicitly performs the same amount of evaluations as it would be performed by instances of SequOOL each with a number of evaluations of , where we have .

The St(r)ochastic sequential Optimization aLgorithm StroquOOL is described in Figure 2. Remember that ‘opening’ a cell means ‘evaluating’ its children. The algorithm opens cells by sequentially diving them deeper and deeper from the root node to a maximal depth of . At depth , we allocate, in a decreasing fashion, different number of evaluations to the cells with highest value of that depth, with starting at down to . The best cell that has been evaluated at least times is opened with evaluations, the two next best cells that have been evaluated at least times are opened with evaluations, the four next best cells that have been evaluated at least times are opened with evaluations and so on, until some next best cells that have been evaluated at least once are opened with one evaluation. More precisely, given, and , we open, with evaluations, the non-previously-opened cells with highest values and given that . The maximum number of evaluations of any cell is . For each , the candidate output is the cell with highest estimated value that has been evaluated at least times, . We set In Appendix B, we prove that StroquOOL uses less than openings.

4.2 Analysis of StroquOOL

The proofs of this section use a similar structure to the ones for the deterministic feedback. Additionally, they take into account the uncertainty created by the noise.The proofs of the following statements are given in Appendix D and E. For any is the depth of the deepest opened node with at least evaluations containing  at the end of the opening of depth of StroquOOL.

{lemma}

[] For any global optimum with associated (see Assumption 1), with probability at least , for all depths , for all , if and if , we have while . Lemma 4.2 gives two conditions so that the cell containing is opened at depth . This holds if (1) StroquOOL opens, with evaluations, more cells at depth than the number of near-optimal cells at depth () and (2) the evaluations are sufficient to discriminate the empirical average of near-optimal cells from the empirical average of sub-optimal cells ().

To state the next theorems, we introduce a positive real number satisfying We have with . The quantity gives the depth of deepest cell opened by StroquOOL that contains with high probability. Consequently, also lets us characterize for which regime of the noise range we recover results similar to the loss of the deterministic case. Discriminating on the noise regime, we now state our results, Theorem 4.2 for a high noise and Theorem 4.2 for a low one. {theorem}[] High-noise regime After rounds, for any function and one of its global optima with associated , and near-optimality dimension denoted for simplicity , if the simple regret of SequOOL obeys

where is the standard Lambert function and . {corollary} With the assumptions of Theorem 4.2 and ,

{theorem}

[] Low-noise regime After rounds, for any function and one of its global optima with associated , and near-optimality dimension denoted for simplicity , if the simple regret of StroquOOL is bounded as follows

{corollary}

With the assumptions of Theorem 4.2, if , then

4.3 Discussion for the stochastic feedback

Worst-case comparison with POO and StoSOO

When is large and known: StroquOOL is an algorithm designed for the noisy feedback while adapting to the smoothness of the function. Therefore, it can be directly compared to POO and StoSOO that both tackle the same problem. The results for StroquOOL, like the ones for POO, hold for , while the theoretical guarantees of StoSOO are only for the case . The general rate of StroquOOL in Corollary 4.2 666Note that the second term in our bound has at most the same rate as the first one. is similar to the ones of POO (for ) and StoSOO (for ) as their loss is . More precisely, looking at the log factors, we can first notice an improvement over StoSOO when . We have . Comparing with POO, we obtain a worse logarithmic factor, as . Despite having this (theoretically) slightly worse logarithmic factor compared to POO, StroquOOL has two nice new features. First, our algorithm is conceptually simple, parameter-free, and does not need to call a sub-algorithm: POO repetitively calls different instances of HOO which makes it a heavy meta-algorithm. Second, our algorithm, as we detail in next paragraphs, naturally adapts to low noise and, even more, recovers the rates of SequOOL in the deterministic case, leading to exponentially decreasing loss when . We do not know if this deterioration of the logarithmic factor from POO to StroquOOL is the unavoidable price to pay to obtain an adaptation to the deterministic feedback case.

Comparison with oracle HOO

HOO is also designed for the noisy optimization setting. However HOO knows the smoothness of , i.e., are input parameters of HOO. Using this extra knowledge HOO is only able to improve the logarithmic factor to achieve a regret of .

Adaptation to the range of the noise without a prior knowledge

A favorable feature of our bound in Corollary 4.2 is that it characterizes how the range of the noise affects the rate of the regret for all . Considering the common case of , the regret in Corollary 4.2 scales linearly with the range of the noise leading to potential large improvement for small . Note that is any real non-negative number and it is unknown by StroquOOL. HOO, POO, and StoSOO, on the other hand, would only obtain a regret scaling with when  is known to them as they directly encode a confidence bound that must include , in the definition of their code. To achieve this result, and contrarily to HOO, StoSOO, or POO, we designed StroquOOL without using upper-confidence bounds (UCBs). Indeed, UCB approaches are overly conservative as they use hard-coded (and often overestimated) upper-bound on . Finally, note that using UCB approaches with empirical estimation of the variance would not achieve the best of both worlds: a result that is discussed in the next paragraph. Indeed, an assumption on the noise is still used in these approaches. This prevents having when and .

Adaptation to the deterministic case and

When the noise is very low, i.e., when , which includes the deterministic feedback, in Theorem 4.2 and Corollary 4.2, StroquOOL recovers the same rate as DOO and SequOOL up to logarithmic factors. Remarkably, StroquOOL obtains an exponentially decreasing regret when while POO, StoSOO or HOO only guarantee a regret of when unaware of the range . Therefore, up to log factors, StroquOOL achieves naturally the best of both worlds without being aware of the nature of the feedback (either stochastic or deterministic). Again, this is a behavior that one cannot expect from HOO, POO, and StoSOO as they explicitly use confidence intervals in their algorithm assuming the range of noise is which limits the maximum depth that can be explored.

5 Experiments

Figure 3: Bottom right: Wrapped-sine function (). The true range of the noise is and the range used by HOO and POO is . Top: left — middle — right. Bottom: left — middle.
Figure 4: Left & center: Deterministic feedback. Right: Garland function for which .

We empirically demonstrate how SequOOL and StroquOOL adapt to the complexity of the data and compare them to SOO, POO, and HOO. We use two functions used by prior work as testbeds for optimization of difficult function without the knowledge of smoothness. The first one is the wrapped-sine function ( Grill et al., 2015, Figure 3, bottom right) with . This function has for the standard partitioning (Grill et al., 2015). The second is the garland function ( Valko et al., 2013, Figure 4, bottom right) with . Function has for the standard partitioning (Valko et al., 2013). Both functions are in one dimension, . We remark that our algorithms work in any dimension, but with the current computational power they would not scale beyond a thousand dimensions.

StroquOOL outperforms POO and HOO and adapts to lower noise.

In Figure 3, we report the results of StroquOOL, POO, and HOO for different values of . As detailed in the caption, we vary the range of noise and the range of noise . used by HOO and POO. In all our experiments, StroquOOL outperforms POO and HOO. StroquOOL adapts to low noise, its performance improves when diminishes. To see that, compare top-left (), top-middle (), and top-right () subfigures. On the other hand, POO and HOO do not naturally adapt to the range of the noise: For a given parameter , the performance is unchanged when the range of the real noise varies as seen by comparing again top-left (), top-middle (), and top-right (). However, note that POO and HOO can adapt to noise and perform empirically well if they have a good estimate of the range as in bottom-left, or if they underestimate the range of the noise, , as in bottom-middle. In Appendix F, we report similar results on the garland function. Finally, StroquOOL demonstrates its adaptation to both worlds in Figure 4 (left), where it achieves exponential decreasing loss in the case and deterministic feedback.

Regrets of SequOOL and StroquOOL have exponential decay when .

In Figure 4, we test in the deterministic feedback case with SequOOL, StroquOOL, SOO and the uniform strategy on the garland function (left) and the wrap-sine function (middle). Interestingly, for the garland function, where , SequOOL outperforms SOO and displays a truly exponential regret decay (y-axis is in log scale). SOO appears to have the regret of . StroquOOL which is expected to have a regret lags behind SOO. Indeed, exceeds for , for which the result is beyond the numerical precision. In Figure 4 (middle), we used the wrapped-sine. While all algorithms have similar theoretical guaranties since here , SOO outperforms the other algorithms.

Acknowledgements

We would like to thank Jean-Bastien Grill for sharing his code. We gratefully acknowledge the support of the NSF through grant IIS-1619362 and of the Australian Research Council through an Australian Laureate Fellowship (FL110100281) and through the Australian Research Council Centre of Excellence for Mathematical and Statistical Frontiers (ACEMS). The research presented was also supported by European CHIST-ERA project DELTA, French Ministry of Higher Education and Research, Nord-Pas-de-Calais Regional Council, Inria and Otto-von-Guericke-Universität Magdeburg associated-team north-european project Allocate, and French National Research Agency projects ExTra-Learn (n.ANR-14-CE24-0010-01) and BoB (n.ANR-16-CE23-0003).

References

Appendix A Regret analysis of SequOOL for deterministic feedback

See 3.2 {proof} We prove Lemma 3.2 by induction over depth .
For , we trivially have .
Now consider and assume that . We want to show that . If we already know and if , we have that for all ,

which means, assuming that the proposition of the lemma is true for that . Therefore, at the end of the processing of depth , during which we were opening the cells of depth we managed to open the cell the optimal node of depth (i.e., such that . During phase , the cells from with highest values are opened. For the purpose of contradiction, let us assume that is not one of them. This would mean that there exist at least cells from , distinct from , satisfying . As by Assumption 1, this means we have (the is for ). However by assumption of the lemma we have . It follows that . This leads contradicts being of near-optimality dimension with associated constant as defined in Definition 2. Indeed the condition in Definition 2 is equivalent to the condition as is an integer.

See 3.2

{proof}

Let be a global optimum with associated . For simplicity, let . We have

where (a) is because and . Note that the tree has depth in the end. From the previous inequality we have . For the rest of the proof, we want to lower bound . Lemma 3.2 provides a sufficient condition on to get lower bounds. This condition is an inequality in which as gets larger (more depth) the condition is more and more likely not to hold. For our bound on the regret of StroquOOL to be small, we want a quantity so that the inequality holds but having as large as possible. So it makes sense to see when the inequality flip signs which is when it turns to equality. This is what we solve next. We solve Equation 2 and then verify that it gives a valid indication of the behavior of our algorithm in term of its optimal . We denote the positive real number satisfying

(2)

where . As , and we have . This gives . Finally as , we have .

If we have . If we have where is the standard Lambert function. Using standard properties of the function, we have

(3)

We always have . If , as discussed above , therefore as is increasing. Moreover because of Lemma 3.2 which assumptions are verified because of Equation 3 and . So in general we have . If we have,

If verifies for ,  [Hoorfar and Hassani, 2008]. Therefore, if we have, denoting ,

Appendix B StroquOOL is not using a budget larger than

Notice, for any given depth , StroquOOL never uses more openings than as

Summing over the depths, StroquOOL never uses more openings than the budget during its depth exploration as

We need to add the additional openings for the evaluation at the end,

Therefore, in total the budget is not more than . Again notice we use the budget of only for the notational convenience, we could also use for the evaluation in the end to fit under (it’s important that the amount of openings is linear in ).

Appendix C Lower bound on the probability of event

In this section, we define and consider event and prove it holds with high probability. {lemma} Let be the set of cells evaluated by StroquOOL during one of its runs. is a random quantity. Let be the event under which all average estimates in the cells receiving at least one evaluation from StroquOOL are within their classical confidence interval, then , where

{proof}

The idea of the proof of this lemma follows the similar line as the proof of the equivalent statement given for StoSOO [Valko et al., 2013]. The crucial point is that while we have potentially exponentially many combinations of cells that can be evaluated, given any particular execution we need to consider only a polynomial number of estimators for which we can use Chernoff-Hoeffding concentration inequality.

The identity of the set of the cells evaluated by StroquOOL, , is random and can change at every run of StroquOOL. However, no cells with a depth larger than are evaluated. Therefore, given , the number of possible sets of cells associated with any run of StroquOOL is finite. Let us denote the set of all such possible sets of cells as . Given any given set of cells , that StroquOOL could open we denote the event when StroquOOL opens exactly all the cells in and define the related event ,