Lower Bounds for Finding Stationary Points I
Abstract
We prove lower bounds on the complexity of finding \epsilonstationary points (points x such that \\nabla f(x)\\leq\epsilon) of smooth, highdimensional, and potentially nonconvex functions f. We consider oraclebased complexity measures, where an algorithm is given access to the value and all derivatives of f at a query point x. We show that for any (potentially randomized) algorithm \mathsf{A}, there exists a function f with Lipschitz pth order derivatives such that \mathsf{A} requires at least \epsilon^{(p+1)/p} queries to find an \epsilonstationary point. Our lower bounds are sharp to within constants, and they show that gradient descent, cubicregularized Newton’s method, and generalized pth order regularization are worstcase optimal within their natural function classes.
lowerboundsfirstorder
1 Introduction
Consider the optimization problem
\mathop{\rm minimize}_{x\in\mathbb{R}^{d}}~{}f(x) 
where f:\mathbb{R}^{d}\rightarrow\mathbb{R} is smooth, but possibly nonconvex. In general, it is intractable to even approximately minimize such f [32, 30], so—following an established line of research—we consider the problem of finding an \epsilonstationary point of f, meaning some x\in\mathbb{R}^{d} such that
\left\{\nabla f(x)}\right\\leq\epsilon.  (1) 
We prove lower bounds on the number of function and derivative evaluations required for algorithms to find a point x satisfying inequality (1). While for arbitrary smooth f, a nearstationary point (1) is certainly insufficient for any type of optimality, there are a number of reasons to study algorithms and complexity for finding stationary points. In several statistical and engineering problems, including regression models with nonconvex penalties and objectives [27, 28], phase retrieval [12, 39], and nonconvex (lowrank) reformulations of semidefinite programs and matrix completion [11, 24, 8], it is possible to show that all first or secondorder stationary points are (near) global minima. The strong empirical success of local search strategies for such problems, as well as for neural networks [25], motivates a growing body of work on algorithms with strong complexity guarantees for finding stationary points [37, 7, 13, 2, 14]. In contrast to this algorithmic progress, algorithmindependent lower bounds (fundamental limits on complexity) for finding stationary points are largely unexplored.
Even for nonconvex functions f, it is possible to find \epsilonstationary points in time polynomial in 1/\epsilon. Of particular interest are methods for which the number of function and derivative evaluations does not depend on d, the dimension of \mathrm{dom}f, but instead depends on measures of f’s regularity. The bestknown method with such a dimensionfree convergence guarantee is classical gradient descent: for every (nonconvex) function f with L_{1}Lipschitz gradient satisfying f(x^{(0)})\inf_{x}f(x)\leq\Delta at the initial point x^{(0)}, gradient descent finds an \epsilonstationary point in at most 2L_{1}\Delta\epsilon^{2} iterations [34]. Under the additional assumption that f has Lipschitz continuous Hessian, our work [13] and Agarwal et al. [2] exhibit randomized firstorder methods that find an \epsilonstationary point in time scaling as \epsilon^{{7}/{4}}\log\frac{d}{\epsilon} (igoring other problemdependent constants). In subsequent work [14], we show a different deterministic accelerated gradient method that achieves dimensionfree complexity \epsilon^{{7}/{4}}\log\frac{1}{\epsilon}, and if f additionally has Lipschitz third derivatives, then \epsilon^{{5}/{3}}\log\frac{1}{\epsilon} iterations suffice to find an \epsilonstationary point.
By evaluation of higher order derivatives, such as the Hessian, it is possible to achieve better \epsilon dependence. Nesterov and Polyak’s cubic regularization of Newton’s method [37, 16] guarantees \epsilonstationarity (1) in \epsilon^{3/2} iterations, but each iteration may be expensive when the dimension d is large. More generally, pthorder regularization methods iterate by sequentially minimizing models of f based on order p Taylor approximations, and Birgin et al. [7] show that these methods converge in \epsilon^{(p+1)/p} iterations. Each iteration requires approximately minimizing a highdimensional, potentially nonconvex, degree p+1 polynomial, which suggests that the methods will be practically challenging for p>2. The methods nonetheless provide fundamental upper complexity bounds.
In this paper and its companion [15], we focus on the converse problem: providing dimensionfree complexity lower bounds for finding \epsilonstationary points. We show fundamental limits on the best achievable \epsilon dependence, as well as dependence on other problem parameters. Together with known upper bounds, our results shed light on the optimal rates of convergence for finding stationary points.
1.1 Related lower bounds
In the case of convex optimization, we have a deep understanding of the complexity of finding \epsilonsuboptimal points, that is, x satisfying f(x)\leq f(x^{\star})+\epsilon for some \epsilon>0, where x^{\star}\in\mathop{\rm arg\hskip 1.0ptmin}_{x}f(x). Here we review only the dimensionfree optimal rates, as those are most relevant for our results. Given a point x^{(0)} satisfying \{x^{(0)}x^{\star}}\\leq D<\infty, if f is convex with L_{1}Lipschitz gradient, Nesterov’s accelerated gradient method finds an \epsilonsuboptimal point in O(\sqrt{L_{1}}D\epsilon^{1/2}) gradient evaluations, which is optimal even among randomized, higherorder algorithms [33, 32, 34, 41].^{1}^{1}1Higher order methods can yield improvements under additional smoothness: if in addition f has L_{2}Lipschitz Hessian and \epsilon\leq L_{1}^{7/3}L_{2}^{4/3}D^{2/3}, an accelerated Newton method achieves the (optimal) rate \Theta((L_{2}D^{3}/\epsilon)^{2/7}) [4, 29]. For nonsmooth problems, that is, when f is L_{0}Lipschitz, subgradient methods achieve the optimal rate of O(L_{0}^{2}D^{2}/\epsilon^{2}) subgradient evaluations (cf. [10, 32, 34]). In Part II of this paper [15], we consider the impact of convexity on the difficulty of finding stationary points using firstorder methods.
Globally optimizing smooth nonconvex functions is of course intractable: Nemirovski and Yudin [32, §1.6] show that for functions f:\mathbb{R}^{d}\to\mathbb{R} with Lipschitz 1,\ldots,pth derivatives, and algorithms receiving all derivatives of f at the query point x, the worst case complexity of finding \epsilonsuboptimal points scales at least as (1/\epsilon)^{d/p}. This exponential scaling in d shows that dimensionfree guarantees for achieving nearoptimality in smooth nonconvex functions are impossible to obtain.
Less is known about lower bounds for finding stationary points in \mathbb{R}^{d}. Nesterov [36] proposes lower bounds for finding stationary points under a box constraint, but his construction does not extend to the unconstrained case when f(x^{(0)})\inf_{x}f(x) is bounded. Cartis et al. [16, 17] show that for important but specific algorithms, namely gradient descent and cubic regularization of Newton’s method, the respective performance guarantees are tight in the worst case. They also extend these results to certain structured classes of methods [18, 19]. We provide the first algorithmindependent lower bounds for finding finding stationary points in the unconstrained setting.
1.2 Our contributions
In this paper, we consider the class of all randomized algorithms that access the function f through an information oracle that returns the function value, gradient, Hessian and all higherorder derivatives of f at a queried point x. Our main result (Theorem 2 in Section 5), is as follows. Let p\in\mathbb{N} and \Delta,L_{p}, and \epsilon>0. Then, for any randomized algorithm \mathsf{A} based on the oracle described above, there exists a function f that has L_{p}Lipschitz pth derivative and satisfies f(x^{(0)})f(x^{\star})\leq\Delta, such that, with high probability, \mathsf{A} requires at least
(p+1)^{O(1)}\cdot\Delta L_{p}^{1/p}\epsilon^{(p+1)/p} 
oracle queries to find an \epsilonstationary point of f. The constructed function f has domain of dimension polynomial in 1/\epsilon.
For every p, our lower bound matches (up to a constant) known upper bounds, thereby characterizing the optimal complexity of finding stationary points. For p=1, our results imply that gradient descent [34, 36] is optimal among all methods (even randomized, highorder methods) operating on functions with Lipschitz continuous gradient and bounded initial suboptimality. Therefore, to strengthen the guarantees of gradient descent one must introduce additional assumptions, such as convexity of f or Lipschitz continuity of \nabla^{{2}}f. Similarly, in the case p=2 we establish that cubic regularization of Newton’s method [37, 16] achieves the optimal rate \epsilon^{3/2}, and for general p we show that pth order Taylorapproximation methods [7] are optimal.
These results say little about the potential of firstorder methods on functions with higherorder Lipschitz derivatives, where firstorder methods attain rates better than \epsilon^{2} [14]. In Part II of this series [15], we address this issue and show lower bounds for deterministic algorithms using only firstorder information. The lower bounds exhibit a fundamental gap between first and secondorder methods, and nearly match the known upper bounds [14].
1.3 Our approach and paper organization
In Section 2 we introduce the classes of functions and algorithms we consider as well as our notion of complexity. Then, in Section 3, we present the generic technique we use to prove lower bound for deterministic algorithms in both this paper and Part II [15]. While essentially present in previous work, our technique abstracts away and generalizes the central arguments in many lower bounds [32, 31, 41, 4]. The technique applies to higherorder methods and provides lower bounds for general optimization goals, including finding stationary points (our main focus), approximate minimizers, and secondorder stationary points. It is also independent of whether the functions under consideration are convex, applying to any function class with appropriate rotational invariance [32]. The key building blocks of the technique are Nesterov’s notion of a “chainlike” function [34], which is difficult for a certain subclass of algorithms, and a “resisting oracle” [32, 34] reduction that turns a lower bound for this subclass into a lower bound for all deterministic algorithms.
In Section 4 we apply this generic method to produce lower bounds for deterministic methods (Theorem 1). The deterministic results underpin our analysis for randomized algorithms, which culminates in Theorem 2 in Section 5. Following Woodworth and Srebro [41], we consider random rotations of our deterministic construction, and show that for any algorithm such a randomly rotated function is, with high probability, difficult. For completeness, in Section 6 we provide lower bounds on finding stationary points of functions where \{x^{(0)}x^{\star}}\ is bounded, rather than the function value gap f(x^{(0)})f(x^{\star}); these bounds have the same \epsilon dependence as their bounded function value counterparts.
Notation
Before continuing, we provide the conventions we adopt throughout the paper. For a sequence of vectors, subscripts denote coordinate index, while parenthesized superscripts denote element index, e.g. x^{(i)}_{j} is the jth coordinate of the ith entry in the sequence x^{(1)},x^{(2)},\ldots. For any p\geq 1 and p times continuously differentiable f:\mathbb{R}^{d}\to\mathbb{R}, we let \nabla^{{p}}f(x) denote the tensor of pth order partial derivatives of f at point x, so \nabla^{{p}}f(x) is an order p symmetric tensor with entries
\left[\nabla^{{p}}f(x)\right]_{i_{1},\ldots,i_{p}}=\nabla^{{p}}_{i_{1},\ldots,% i_{p}}f(x)=\frac{{\partial}^{p}f}{{\partial}x_{i_{1}}\cdots{\partial}x_{i_{p}}% }(x)~{}~{}\mbox{for~{}}i_{j}\in\{1,\ldots,d\}. 
Equivalently, we may write \nabla^{{p}}f(x) as a multilinear operator \nabla^{{p}}f(x):(\mathbb{R}^{d})^{p}\to\mathbb{R},
\nabla^{{p}}f(x)\left[v^{(1)},\ldots,v^{(p)}\right]=\sum_{i_{1}=1}^{d}\cdots% \sum_{i_{p}=1}^{d}v_{i_{1}}^{(1)}\cdots v_{i_{p}}^{(p)}\frac{{\partial}^{p}f}{% {\partial}x_{i_{1}}\cdots{\partial}x_{i_{p}}}(x)=\left\langle\nabla^{{p}}f(x),% v^{(1)}\otimes\cdots\otimes v^{(p)}\right\rangle, 
where \langle\cdot,\cdot\rangle is the Euclidean inner product on tensors, defined for order k tensors T and M by \langle T,M\rangle=\sum_{i_{1},\ldots,i_{k}}T_{i_{1},\ldots,i_{k}}M_{i_{1},% \ldots,i_{k}}, and \otimes denotes the Kronecker product. We let \otimes^{k}{d} denote d\times\cdots\times d, k times, so that T\in\mathbb{R}^{\otimes^{k}{d}} denotes an order k tensor.
For a vector v\in\mathbb{R}^{d} we let \left\{v}\right\:=\sqrt{\langle v,v\rangle} denote the Euclidean norm of v. For a tensor T\in\mathbb{R}^{\otimes^{k}{d}}, the Euclidean operator norm of T is
\left\{T}\right\_{\rm op}:=\sup_{v^{(1)},\ldots,v^{(k)}}\Big{\{}\langle T,v^% {(1)}\otimes\cdots\otimes v^{(k)}\rangle=\sum_{i_{1},\ldots,i_{k}}T_{i_{1},% \ldots,i_{p}}v_{i_{1}}^{(1)}\cdots v_{i_{k}}^{(k)}\mid\{v^{(i)}}\\leq 1,i=1,% \ldots,k\Big{\}}. 
If T is a symmetric order k tensor, meaning that T_{i_{1},\ldots,i_{k}} is invariant to permutations of the indices (for example, \nabla^{{k}}f(x) is always symmetric), then Zhang et al. [43, Thm. 2.1] show that
\left\{T}\right\_{\rm op}=\sup_{\left\{v}\right\=1}\big{}\langle T,v^{% \otimes k}\rangle\big{},~{}~{}~{}\mbox{where}~{}~{}~{}v^{\otimes k}=% \underbrace{v\otimes v\otimes\cdots\otimes v}_{k~{}{\rm times}}.  (2) 
For vectors, the Euclidean and operator norms are identical.
For any n\in\mathbb{N}, we let [n]:=\{1,\ldots,n\} denote the set of positive integers less than or equal to n. We let \mathcal{C}^{\infty} denote the set of infinitely differentiable functions. We denote the ith standard basis vector by e^{(i)}, and let I_{d}\in\mathbb{R}^{d\times d} denote the d\times d identity matrix; we drop the subscript d when it is clear from context. For any set \mathcal{S} and functions g,h:\mathcal{S}\to\left[{0},{\infty}\right) we write g\lesssim h or g=O(h) if there exists c>0 such that g(s)\leq c\cdot h(s) for every s\in\mathcal{S}. We write g=\widetilde{O}\left(h\right) if g\lesssim h\log(h+2).
2 Preliminaries
We begin our development with definitions of the classes of functions (§ 2.1), classes of algorithms (§ 2.2), and notions of complexity (§ 2.3) that we study.
2.1 Function classes
Measures of function regularity are crucial for the design and analysis of optimization algorithms [34, 9, 32]. We focus on two types of regularity conditions: Lipschitzian properties of derivatives and bounds on function value.
We first list a few equivalent definitions of Lipschitz continuity. A function f:\mathbb{R}^{d}\to\mathbb{R} has L_{p}Lipschitz pth order derivatives if it is p times continuously differentiable, and for every x\in\mathbb{R}^{d} and direction v\in\mathbb{R}^{d},\left\{v}\right\\leq 1, the directional projection f_{x,v}(t):=f(x+t\cdot v) of f, defined for t\in\mathbb{R}, satisfies
\leftf_{x,v}^{(p)}(t)f_{x,v}^{(p)}(t^{\prime})\right\leq L_{p}\lefttt^{% \prime}\right 
for all t,t^{\prime}\in\mathbb{R}, where f_{x,v}^{(p)}(\cdot) is the pth derivative of t\mapsto f_{x,v}(t). If f is p+1 times continuously differentiable, this is equivalent to requiring
\leftf_{x,v}^{(p+1)}(0)\right\leq L_{p}~{}~{}~{}\mbox{or}~{}~{}~{}\left\{% \nabla^{{p+1}}f(x)}\right\_{\rm op}\leq L_{p} 
for all x,v\in\mathbb{R}^{d}, \left\{v}\right\\leq 1. We occasionally refer to a function with Lipschitz pth order derivatives as pth order smooth.
Complexity guarantees for finding stationary points of nonconvex functions f typically depend on the function value bound f(x^{(0)})\inf_{x}f(x), where x^{(0)} is a prespecified point. Without loss of generality, we take the prespecified point to be 0 for the remainder of the paper. With that in mind, we define the following classes of functions.
Definition 1.
Let p\geq 1, \Delta>0 and L_{p}>0. Then the set
\mathcal{F}_{p}(\Delta,L_{p}) 
denotes the union, over d\in\mathbb{N}, of the collection of \mathcal{C}^{\infty} functions f:\mathbb{R}^{d}\to\mathbb{R} with L_{p}Lipschitz pth derivative and f(0)\inf_{x}f(x)\leq\Delta.
The function classes \mathcal{F}_{p}(\Delta,L_{p}) include functions on \mathbb{R}^{d} for all d\in\mathbb{N}, following the established study of “dimension free” problems [32, 34]. This definition allows clean presentation of our results: we construct explicit functions f:\mathbb{R}^{d}\to\mathbb{R} that are difficult to optimize, where the dimension d is finite, but our choice of d grows inversely in the desired accuracy of the solution. That \mathcal{F}_{p}(\Delta,L_{p}) contains only \mathcal{C}^{\infty} functions is no real restriction, because our lower bounds become only stronger if we additionally allow less smooth functions.
For our results, we also require the following important invariance notion, proposed (in the context of optimization) by Nemirovski and Yudin [32, Ch. 7.2].
Definition 2 (Orthogonal invariance).
A class of functions \mathcal{F} is orthogonally invariant if for every f\in\mathcal{F}, f:\mathbb{R}^{d}\to\mathbb{R}, and every matrix U\in\mathbb{R}^{d^{\prime}\times d} such that U^{\top}U=I_{d}, the function f_{U}:\mathbb{R}^{d^{\prime}}\to\mathbb{R} defined by f_{U}(x)=f(U^{\top}x) belongs to \mathcal{F}.
Every function class we consider is orthogonally invariant, as f(0)\inf_{x}f(x)=f_{U}(0)\inf_{x}f_{U}(x) and f_{U} has the same Lipschitz constants to all orders as f, as their collections of associated directional projections are identical.
2.2 Algorithm classes
We also require careful definition of the classes of optimization procedures we consider. For any dimension d\in\mathbb{N}, an algorithm \mathsf{A} (also referred to as method or procedure) maps functions f:\mathbb{R}^{d}\to\mathbb{R} to a sequence of iterates in \mathbb{R}^{d}; that is, \mathsf{A} is defined separately for every finite d. We let
\mathsf{A}[f]=\{x^{(t)}\}_{t=1}^{\infty} 
denote the sequence x^{(t)}\in\mathbb{R}^{d} of iterates that \mathsf{A} generates when operating on f.
To model the computational cost of an algorithm, we adopt the informationbased complexity framework, which Nemirovski and Yudin [32] develop (see also [40, 1, 10]), and view every every iterate x^{(t)} as a query to an information oracle. Typically, one places restrictions on the information the oracle returns (e.g. only the function value and gradient at the query point) and makes certain assumptions on how the algorithm uses this information (e.g. deterministically). Our approach is syntactically different but semantically identical: we build the oracle restriction, along with any other assumption, directly into the structure of the algorithm. To formalize this, we define
\nabla^{{(0,\ldots,p)}}f(x):=\{f(x),{\nabla}f(x),\nabla^{2}f(x),\ldots,\nabla^% {{p}}f(x)\} 
as shorthand for the response of a pth order oracle to a query at point x. When p=\infty this corresponds to an oracle that reveals all derivatives at x. Our algorithm classes follow.
Deterministic algorithms
For any p\geq 0, a pthorder deterministic algorithm \mathsf{A} operating on f:\mathbb{R}^{d}\to\mathbb{R} is one producing iterates of the form
x^{(i)}=\mathsf{A}^{(i)}\left(\nabla^{{(0,\ldots,p)}}f(x^{(1)}),\ldots,\nabla^% {{(0,\ldots,p)}}f(x^{(i1)})\right)~{}~{}\text{for }i\in\mathbb{N}, 
where \mathsf{A}^{(i)} is a measurable mapping to \mathbb{R}^{d} (the dependence on dimension d is implicit). We denote the class of pthorder deterministic algorithms by \mathcal{A}_{\textnormal{{det}}}^{(p)} and let \mathcal{A}_{\textnormal{{det}}}:=\mathcal{A}_{\textnormal{{det}}}^{(\infty)} denote the class of all deterministic algorithms based on derivative information.
As a concrete example, for any p\geq 1 and L>0 consider the algorithm {\mathsf{REG}_{p,L}}\in\mathcal{A}_{\textnormal{{det}}}^{(p)} that produces iterates by minimizing the sum of a pth order Taylor expansion and an order p+1 proximal term:
x^{(k+1)}:=\mathop{\rm arg\hskip 1.0ptmin}_{x}\bigg{\{}f(x^{(k)})+\sum_{q=1}^{% p}\langle\nabla^{{q}}f(x^{(k)}),x^{\otimes q}\rangle+\frac{L}{(p+1)!}\{xx^{(% k)}}\^{p+1}\bigg{\}}.  (3) 
For p=1, {\mathsf{REG}_{p,L}} is gradient descent with stepsize 1/L, for p=2 it is cubicregularized Newton’s method [37], and for general p it is a simplified form of the scheme that Birgin et al. [7] propose.
Randomized algorithms (and functioninformed processes)
A pthorder randomized algorithm \mathsf{A} is a distribution on pthorder deterministic algorithms. We can write any such algorithm as a deterministic algorithm given access to a random uniform variable on [0,1] (i.e. infinitely many random bits). Thus the algorithm operates on f by drawing \xi\sim\mathsf{Uni}[0,1] (independently of f), then producing iterates of the form
x^{(i)}=\mathsf{A}^{(i)}\left(\xi,\nabla^{{(0,\ldots,p)}}f(x^{(1)}),\ldots,% \nabla^{{(0,\ldots,p)}}f(x^{(i1)})\right)~{}~{}\text{for }i\in\mathbb{N},  (4) 
where \mathsf{A}^{(i)} are measurable mappings into \mathbb{R}^{d}. In this case, \mathsf{A}[f] is a random sequence, and we call a random process \{x^{(t)}\}_{t\in\mathbb{N}} informed by f if it has the same law as \mathsf{A}[f] for some randomized algorithm \mathsf{A}. We let \mathcal{A}_{\textnormal{{rand}}}^{(p)} denote the class of pthorder randomized algorithms and \mathcal{A}_{\textnormal{{rand}}}:=\mathcal{A}_{\textnormal{{rand}}}^{(\infty)} denote the class of randomized algorithms that use derivativebased information.
Zerorespecting sequences and algorithms
While deterministic and randomized algorithm are the natural collections for which we prove lower bounds, it is useful to define an additional structurally restricted class. This class forms the backbone of our lower bound strategy (Sec. 3), as it is both ‘small’ enough to uniformly underperform on a single function, and ‘large’ enough to imply lower bounds on the natural algorithm classes.
For v\in\mathbb{R}^{d} we let \mathop{\mathrm{supp}}\left\{v\right\}:=\{i\in[d]\mid v_{i}\neq 0\} denote the support (nonzero indices) of v. We extend this to tensors as follows. Let T\in\mathbb{R}^{\otimes^{k}{d}} be an order k tensor, and for i\in\{1,\ldots,d\} let T_{i}\in\mathbb{R}^{\otimes^{k1}{d}} be the order (k1) tensor defined by [T_{i}]_{j_{1},\ldots,j_{k1}}=T_{i,j_{1},\ldots,j_{k1}}. With this notation, we define
\mathop{\mathrm{supp}}\left\{T\right\}:=\{i\in\{1,\dots,d\}\mid T_{i}\neq 0\}. 
Then for p\in\mathbb{N} and any f:\mathbb{R}^{d}\to\mathbb{R}, we say that the sequence x^{(1)},x^{(2)},\ldots is pth order zerorespecting with respect to f if
\mathop{\mathrm{supp}}\left\{x^{(t)}\right\}\subseteq\bigcup_{q\in[p]}\bigcup_% {s<t}\mathop{\mathrm{supp}}\left\{\nabla^{{q}}f(x^{(s)})\right\}~{}~{}\mbox{% for each }t\in\mathbb{N}.  (5) 
The definition (5) says that x^{(t)}_{i}=0 if all partial derivatives involving the ith coordinate of f (up to the pth order) are zero. For p=1, this definition is equivalent to the requirement that for every t and j\in[d], if {\nabla}_{j}f(x^{(s)})=0 for s<t, then x^{(t)}_{j}=0. The requirement (5) implies that x^{(1)}=0.
An algorithm \mathsf{A}\in\mathcal{A}_{\textnormal{{rand}}} is pth order zerorespecting if for any f:\mathbb{R}^{d}\to\mathbb{R}, the (potentially random) iterate sequence \mathsf{A}[f] is pth order zero respecting w.r.t. f. Informally, an algorithm is zerorespecting if it never explores coordinates which appear not to affect the function. When initialized at the origin, most common first and secondorder optimization methods are zerorespecting, including gradient descent (with and without Nesterov acceleration), conjugate gradient [22], BFGS and LBFGS [26, 38],^{2}^{2}2if the initial Hessian approximation is a diagonal matrix, as is typical Newton’s method (with and without cubic regularization [37]) and trustregion methods [21]. We denote the class of pth order zerorespecting algorithms by \mathcal{A}_{\textnormal{{zr}}}^{(p)}, and let \mathcal{A}_{\textnormal{{zr}}}:=\mathcal{A}_{\textnormal{{zr}}}^{(\infty)}.
In the literature on lower bounds for firstorder convex optimization, it is common to assume that methods only query points in the span of the gradients they observe [34, 3]. Our notion of zerorespecting algorithms generalizes this assumption to higherorder methods, but even firstorder zerorespecting algorithms are slightly more general. For example, coordinate descent methods [35] are zerorespecting, but they generally do not remain in the span of the gradients.
2.3 Complexity measures
With the definitions of function and algorithm class in hand, we turn to formalizing our notion of complexity: what is the best performance an algorithm in class \mathcal{A} can achieve for all functions in class \mathcal{F}? As we consider finding stationary points of f, the natural performance measure is the number of iterations (oracle queries) required to find a point x such that \left\{\nabla f(x)}\right\\leq\epsilon. Thus for a deterministic sequence \{x^{(t)}\}_{t\in\mathbb{N}} we define
\mathsf{T}_{\epsilon}\big{(}\{x^{(t)}\}_{t\in\mathbb{N}},f\big{)}:=\inf\left\{% t\in\mathbb{N}\mid\big{\}{{\nabla}f(x^{(t)})}\big{\}\leq\epsilon\right\}, 
and refer to it as the complexity of \{x^{(t)}\}_{t\in\mathbb{N}} on f. As we consider randomized algorithms as well, for a random process \{x^{(t)}\}_{t\in\mathbb{N}} with distribution {\mathbb{P}}, we define
\mathsf{T}_{\epsilon}\big{(}{\mathbb{P}},f\big{)}:=\inf\left\{t\in\mathbb{N}% \mid{\mathbb{P}}\left(\big{\}{{\nabla}f(x^{(s)})}\big{\}>\epsilon~{}\mbox{% for~{}all~{}}s\leq t\right)\leq\frac{1}{2}\right\},  (6) 
where the randomness is over x^{(t)}, according to {\mathbb{P}}. The complexity \mathsf{T}_{\epsilon}\big{(}{\mathbb{P}},f\big{)} is also the median of the random variable \mathsf{T}_{\epsilon}\big{(}\{x^{(t)}\}_{t\in\mathbb{N}},f\big{)}. By Markov’s inequality, definition (6) lower bounds expectationbased alternatives, as
\inf\left\{t\in\mathbb{N}\mid{\mathbb{E}}\,\{{\nabla}f(x^{(t)})}\\leq% \epsilon\right\}\geq\mathsf{T}_{2\epsilon}\big{(}{\mathbb{P}},f\big{)}~{}~{}% \mbox{and}~{}~{}{\mathbb{E}}\,\mathsf{T}_{\epsilon}\big{(}\{x^{(t)}\}_{t\in% \mathbb{N}},f\big{)}\geq\frac{1}{2}\mathsf{T}_{\epsilon}\big{(}{\mathbb{P}},f% \big{)}. 
To measure the performance of algorithm \mathsf{A} on function f, we evaluate the iterates it produces from f, and with mild abuse of notation, we define
\mathsf{T}_{\epsilon}\big{(}\mathsf{A},f\big{)}:=\mathsf{T}_{\epsilon}\big{(}% \mathsf{A}[f],f\big{)} 
as the complexity of \mathsf{A} on f. With this setup, we define the complexity of algorithm class \mathcal{A} on function class \mathcal{F} as
\mathcal{T}_{\epsilon}\big{(}\mathcal{A},\mathcal{F}\big{)}:=\inf_{\mathsf{A}% \in\mathcal{A}}\sup_{f\in\mathcal{F}}\mathsf{T}_{\epsilon}\big{(}\mathsf{A},f% \big{)}.  (7) 
Many algorithms guarantee “dimension independent” convergence [34] and thus provide upper bounds for the quantity (7). A careful tracing of constants in the analysis of Birgin et al. [7] implies that the generalized regularization scheme {\mathsf{REG}_{p,L}} defined by the recursion (3) guarantees
\mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{det}}}^{(p)}\cap% \mathcal{A}_{\textnormal{{zr}}}^{(p)},\mathcal{F}_{p}(\Delta,L_{p})\big{)}\leq% \sup_{f\in\mathcal{F}_{p}(\Delta,L_{p})}\mathsf{T}_{\epsilon}\big{(}{\mathsf{% REG}_{p,L_{p}}},f\big{)}\lesssim\Delta L_{p}^{1/p}\epsilon^{(1+p)/p}  (8) 
for all p\in\mathbb{N}. In this paper we prove these rates are sharp to within (pdependent) constant factors.
While definition (7) is our primary notion of complexity, our proofs provide bounds on smaller quantities than (7) that also carry meaning. For zerorespecting algorithms, we exhibit a single function f and bound \inf_{\mathsf{A}\in\mathcal{A}_{\textnormal{{zr}}}}\mathsf{T}_{\epsilon}\big{(% }\mathsf{A},f\big{)} from below, in effect interchanging the \inf and \sup in (7). This implies that all zerorespecting algorithms share a common vulnerability. For randomized algorithms, we exhibit a distribution P supported on functions of a fixed dimension d, and we lower bound the average \inf_{\mathsf{A}\in\mathcal{A}_{\textnormal{{rand}}}}\int\mathsf{T}_{\epsilon}% \big{(}\mathsf{A},f\big{)}dP(f), bounding the distributional complexity [32, 10], which is never greater than worstcase complexity (and is equal for randomized and deterministic algorithms). Even randomized algorithms share a common vulnerability: functions drawn from P.
3 Anatomy of a lower bound
In this section we present a generic approach to proving lower bounds for optimization algorithms. The basic techniques we use are wellknown and applied extensively in the literature on lower bounds for convex optimization [32, 34, 41, 4]. However, here we generalize and abstract away these techniques, showing how they apply to highorder methods, nonconvex functions, and various optimization goals (e.g. \epsilonstationarity, \epsilonoptimality).
We begin by defining zerochain functions, that generalize a construction due to Nesterov [34]. Zerochains limit the rate with which zerorespecting algorithms (Sec. 2.2) gather information (Observation 1). This immediately suggests a strategy for proving lower bounds on the complexity of zerorespecting algorithms. By using the orthogonal invariance of our function classes, we construct a resisting oracle [32, Ch. 7.2] (Proposition 1) that reduces any deterministic algorithm to a zerorespecting one. While this approach fails for randomized algorithms, we discuss how to make the basic ideas “robust” so that they apply to any randomized algorithm and local information oracle.
3.1 Zerochains
Nesterov [34, Chapter 2.1.2] proves lower bounds for smooth convex optimization problems using the “chainlike” quadratic function
f(x):=\frac{1}{2}(x_{1}1)^{2}+\frac{1}{2}\sum_{i=1}^{d1}(x_{i}x_{i+1})^{2},  (9) 
which he calls the “worst function in the world.” The important property of f is that for every i\in[d], {\nabla}_{i}f(x)=0 whenever x_{i1}=x_{i}=x_{i+1}=0 (with x_{0}:=1 and x_{d+1}:=0). Thus, if we “know” only the first t1 coordinates of f, i.e. are able to query only vectors x such x_{t}=x_{t+1}=\cdots=x_{d}=0, then any x we query satisfies {\nabla}_{s}f(x)=0 for s>t; we only “discover” a single new coordinate t. We generalize this chain structure to higherorder derivatives as follows.
Definition 3.
For p\in\mathbb{N}, a function f:\mathbb{R}^{d}\rightarrow\mathbb{R} is a pthorder zerochain if for every x\in\mathbb{R}^{d},
\mathop{\mathrm{supp}}\left\{x\right\}\subseteq\{1,\dots,i1\}~{}~{}\mbox{% implies}~{}~{}\bigcup_{q\in[p]}{\mathop{\mathrm{supp}}\left\{\nabla^{{q}}f(x)% \right\}}\subseteq\{1,\dots,i\}. 
We say f is a zerochain if it is a pthorder zerochain for every p\in\mathbb{N}.
In our terminology, Nesterov’s function (9) is a firstorder zerochain but not a secondorder zerochain, as \mathop{\mathrm{supp}}\left\{\nabla^{{2}}f(0)\right\}=[d]. Informally, at a point for which x_{i1}=x_{i}=\cdots=x_{d}=0, a zerochain appears constant in x_{i},x_{i+1},\ldots,x_{d}. Zerochains structurally limit the rate with which zerorespecting algorithms acquire information from derivatives. We formalize this in the following observation, whose proof is a straightforward induction.
Observation 1.
Let f:\mathbb{R}^{d}\rightarrow\mathbb{R} be a pth order zerochain and let x^{(1)}=0,x^{(2)},\ldots be a pth order respecting sequence with respect to f. Then x^{(t)}_{j}=0 for j\geq t and all t\leq d.
Proof.
We show by induction on k that \mathop{\mathrm{supp}}\left\{x^{(t)}\right\}\subseteq[t1] for every t\leq k; the case k=d is the required result. The case k=1 holds since x^{(1)}=0. If the hypothesis holds for some k<d then by Definition 3 we have \cup_{q\in[p]}{\mathop{\mathrm{supp}}\left\{\nabla^{{q}}f(x^{(t)})\right\}}% \subseteq\{1,\dots,t\} for every t\leq k. Therefore, by the zerorespecting property (5), we have \mathop{\mathrm{supp}}\left\{x^{(k+1)}\right\}\subseteq\cup_{q\in[p]}\cup_{t<k% +1}\mathop{\mathrm{supp}}\left\{\nabla^{{q}}f(x^{(t)})\right\}\subseteq[k], completing the induction. ∎
3.2 A lower bound strategy
The preceding discussion shows that zerorespecting algorithms take many iterations to “discover” all the coordinates of a zerochain. Therefore, for any function class \mathcal{F} and p,T\in\mathbb{N}, to show that \mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{zr}}}^{(p)},\mathcal{F}% \big{)}>T it suffices to find f_{\epsilon}:\mathbb{R}^{T}\rightarrow\mathbb{R} such that

f_{\epsilon} is a pthorder zerochain,

f_{\epsilon} belongs to the function class, i.e. f_{\epsilon}\in\mathcal{F}, and

\left\{{\nabla}f_{\epsilon}(x)}\right\>\epsilon for every x such that x_{T}=0.^{3}^{3}3 We can readily adapt this property for lower bounds on other termination criteria, e.g. require f(x)\inf_{y}f(y)>\epsilon for every x such that x_{T}=0.
For \mathsf{A}\in\mathcal{A}_{\textnormal{{zr}}}^{(p)} and \{x^{(t)}\}_{t\in\mathbb{N}}=\mathsf{A}[f] we have by Observation 1 that x^{(t)}_{T}=0 for all t\leq T and the large gradient property (iii) then implies \left\{{\nabla}f_{\epsilon}(x^{(t)})}\right\>\epsilon for all t\leq T. Therefore \mathsf{T}_{\epsilon}\big{(}\mathsf{A},f_{\epsilon}\big{)}>T, and since this holds for any \mathsf{A}\in\mathcal{A}_{\textnormal{{zr}}}^{(p)} we have
\mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{zr}}}^{(p)},\mathcal{F}% \big{)}=\inf_{\mathsf{A}\in\mathcal{A}_{\textnormal{{zr}}}^{(p)}}\sup_{f\in% \mathcal{F}}\mathsf{T}_{\epsilon}\big{(}\mathsf{A},f\big{)}\geq\sup_{f\in% \mathcal{F}}\inf_{\mathsf{A}\in\mathcal{A}_{\textnormal{{zr}}}^{(p)}}\mathsf{T% }_{\epsilon}\big{(}\mathsf{A},f\big{)}\geq\inf_{\mathsf{A}\in\mathcal{A}_{% \textnormal{{zr}}}^{(p)}}\mathsf{T}_{\epsilon}\big{(}\mathsf{A},f_{\epsilon}% \big{)}>T. 
If f is a zerochain, then so is the function x\mapsto\mu f(x/\sigma) for any multiplier \mu and scale parameter \sigma. This is useful for our development, as we construct zerochains \{g_{T}\}_{T\in\mathbb{N}} such that \left\{{\nabla}g_{T}(x)}\right\>c for every x with x_{T}=0 and some constant c>0. By setting f_{\epsilon}(x)=\mu g_{T}(x/\sigma), then choosing T, \mu, and \sigma to satisfy conditions (ii) and (iii), we obtain a lower bound. As our choice of T is also the final lower bound, it must grow to infinity as \epsilon tends to zero. Thus, the hard functions we construct are fundamentally highdimensional, making this strategy suitable only for dimensionfree lower bounds.
3.3 From deterministic to zerorespecting algorithms
Zerochains allow us to generate strong lower bounds for zerorespecting algorithms. The following reduction shows that these lower bounds are valid for deterministic algorithms as well.
Proposition 1.
Let p\geq 1 and \mathcal{F} be an orthogonally invariant function class. Then
\mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{det}}}^{(p)},\mathcal{F% }\big{)}\geq\mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{zr}}}^{(p)}% ,\mathcal{F}\big{)} 
and
\mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{det}}},\mathcal{F}\big{% )}\geq\mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{zr}}},\mathcal{F}% \big{)}. 
The proof of Proposition 1, which we detail in Appendix A, builds on the classical notion of a “resisting oracle” [32, 34], which we briefly sketch here. Let \mathsf{A}\in\mathcal{A}_{\textnormal{{det}}}, and let f\in\mathcal{F}, f:\mathbb{R}^{d}\to\mathbb{R}. We may sequentially construct an orthogonal matrix U\in\mathbb{R}^{d^{\prime}\times d} (for some finite d^{\prime}>d), such that for the function f_{U}(z):=f(U^{\top}z)\in\mathcal{F}, the sequence U^{\top}\mathsf{A}[f_{U}]\subset\mathbb{R}^{d} is zerorespecting with respect to f. We do this by choosing the columns of U to be orthogonal to components in \mathsf{A}[f] that would otherwise violate the zerorespecting property of U^{\top}\mathsf{A}[f]. Thus, there exists an algorithm \mathsf{Z}_{\mathsf{A}}\in\mathcal{A}_{\textnormal{{det}}}\cap\mathcal{A}_{% \textnormal{{zr}}} such that \mathsf{Z}_{\mathsf{A}}[f]=U^{\top}\mathsf{A}[f_{U}], implying \mathsf{T}_{\epsilon}\big{(}\mathsf{A},f_{U}\big{)}=\mathsf{T}_{\epsilon}\big{% (}\mathsf{Z}_{\mathsf{A}},f\big{)}. Therefore,
\inf_{\mathsf{A}\in\mathcal{A}_{\textnormal{{det}}}}\sup_{f\in\mathcal{F}}{% \mathsf{T}_{\epsilon}\big{(}\mathsf{A},f\big{)}}\geq\inf_{\mathsf{A}\in% \mathcal{A}_{\textnormal{{det}}}}\sup_{f\in\mathcal{F},U}{\mathsf{T}_{\epsilon% }\big{(}\mathsf{A},f_{U}\big{)}}=\inf_{\mathsf{A}\in\mathcal{A}_{\textnormal{{% det}}}}\sup_{f\in\mathcal{F}}{\mathsf{T}_{\epsilon}\big{(}\mathsf{Z}_{\mathsf{% A}},f\big{)}}\geq\inf_{\mathsf{A}\in\mathcal{A}_{\textnormal{{zr}}}}\sup_{f\in% \mathcal{F}}{\mathsf{T}_{\epsilon}\big{(}\mathsf{A},f\big{)}}, 
giving Proposition 1.
The adversarial rotation argument that yields Proposition 1 is more or less apparent in the proofs of previous lower bounds in convex optimization [32, 41, 4] for deterministic algorithms. We believe it is instructive to separate the proof of lower bounds on \mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{zr}}},\mathcal{F}\big{)} and the reduction from \mathcal{A}_{\textnormal{{det}}} to \mathcal{A}_{\textnormal{{zr}}}, as the latter holds in great generality. Indeed, Proposition 1 holds for any complexity measure \mathsf{T}_{\epsilon}\big{(}\cdot,\cdot\big{)} that satisfies

Orthogonal invariance: for every f:\mathbb{R}^{d}\to\mathbb{R}, every U\in\mathbb{R}^{d^{\prime}\times d} such that U^{\top}U=I_{d} and every sequence \{z^{(t)}\}_{t\in\mathbb{N}}\subset\mathbb{R}^{d^{\prime}}, we have
\mathsf{T}_{\epsilon}\big{(}\{z^{(t)}\}_{t\in\mathbb{N}},f(U^{\top}\cdot)\big{% )}=\mathsf{T}_{\epsilon}\big{(}\{U^{\top}z^{(t)}\}_{t\in\mathbb{N}},f\big{)}. 
“Stopping time” invariance: for any T_{0}\in\mathbb{N}, if \mathsf{T}_{\epsilon}\big{(}\{x^{(t)}\}_{t\in\mathbb{N}},f\big{)}\leq T_{0} then \mathsf{T}_{\epsilon}\big{(}\{x^{(t)}\}_{t\in\mathbb{N}},f\big{)}=\mathsf{T}_{% \epsilon}\big{(}\{\hat{x}^{(t)}\}_{t\in\mathbb{N}},f\big{)} for any sequence \{\hat{x}^{(t)}\}_{t\in\mathbb{N}} such that \hat{x}^{(t)}=x^{(t)} for t\leq T_{0}.
These properties hold for the typical performance measures used in optimization. Examples include time to \epsilonoptimality, in which case \mathsf{T}_{\epsilon}\big{(}\{x^{(t)}\}_{t\in\mathbb{N}},f\big{)}=\inf\{t\in% \mathbb{N}\mid f(x^{(t)})\inf_{x}f(x)\leq\epsilon\}, and the secondorder stationarity desired in many nonconvex optimization problems [13, 23], where for \epsilon_{1},\epsilon_{2}>0 we define \mathsf{T}_{\epsilon}\big{(}\{x^{(t)}\}_{t\in\mathbb{N}},f\big{)}=\inf\{t\in% \mathbb{N}\mid\{{\nabla}f(x^{(t)})}\\leq\epsilon_{1}\text{ and }\nabla^{2}f(% x^{(t)})\succeq\epsilon_{2}I\}.
3.4 Randomized algorithms
Proposition 1 does not apply to randomized algorithms, as it requires the adversary (maximizing choice of f) to simulate the action of \mathsf{A} on f. To handle randomized algorithms, we strengthen the notion of a zerochain as follows.
Definition 4.
A function f:\mathbb{R}^{d}\rightarrow\mathbb{R} is a robust zerochain if for every x\in\mathbb{R}^{d},
x_{j}<1/2,\ \forall j\geq i~{}~{}\mbox{implies}~{}~{}f(y)=f(y_{1},\ldots,y_{% i},0,\ldots,0)~{}~{}\mbox{for all $y$ in a neighborhood of }x. 
A robust zerochain is also an “ordinary” zerochain. In Section 5 we replace the adversarial rotation U of § 3.3 with an orthogonal matrix drawn uniformly at random, and consider the random function f_{U}(x)=f(U^{\top}x), where f is a robust zerochain. Adapting a lemma by Woodworth and Srebro [41] to our setting, we show that for every \mathsf{A}\in\mathcal{A}_{\textnormal{{rand}}}, A[f_{U}] satisfies an approximate form of Observation 1 (w.h.p.) whenever the iterates \mathsf{A}[f_{U}] have bounded norm. With further modification of f_{U} to handle unbounded iterates, our zerochain strategy yields a strong distributional complexity lower bound on \mathcal{A}_{\textnormal{{rand}}}.
4 Lower bounds for zerorespecting and deterministic algorithms
For our first main results, we provide lower bounds on the complexity of all deterministic algorithms for finding stationary points of smooth, potentially nonconvex functions. By § 3.2 and Proposition 1, to prove a lower bound on deterministic algorithms it is sufficient to construct a function that is difficult for zerorespecting algorithms. For fixed T>0 , we define the (unscaled) hard instance \bar{f}_{T}:\mathbb{R}^{d}\to\mathbb{R} as
\bar{f}_{T}(x)=\Psi\left(1\right)\Phi\left(x_{1}\right)+\sum_{i=2}^{T}\left[% \Psi\left(x_{i1}\right)\Phi\left(x_{i}\right)\Psi\left(x_{i1}\right)\Phi% \left(x_{i}\right)\right],  (10) 
where the component functions are
\Psi(x):=\begin{cases}0&x\leq 1/2\\ \exp\left(1\frac{1}{\left(2x1\right)^{2}}\right)&x>1/2\end{cases}~{}~{}\mbox% {and}~{}~{}\Phi(x)=\sqrt{e}\int_{\infty}^{x}e^{\frac{1}{2}t^{2}}dt. 
We illustrate the construction in Figure 1. The choices of \Psi and \Phi are not particularly special; any pair of increasing, bounded, \mathcal{C}^{\infty} functions for which \Psi(x)=0 for x below some constant yield similar results.
Our construction has two key properties. First is that f is a zerochain (Observation 2 in the sequel). Second, as we show in Lemma 2, \{{\nabla}\bar{f}_{T}(x)}\ is large unless x_{i}\geq 1 for every i\in[T]. These properties make it hard for any zerorespecting method to find a stationary point of scaled versions of \bar{f}_{T}, and coupled with Proposition 1, this gives a lower bound for deterministic algorithms.
4.1 Properties of the hard instance
Before turning to the main theorem of this section, we catalogue the important properties of the functions \Psi, \Phi and \bar{f}_{T}.
Lemma 1.
The functions \Psi and \Phi satisfy the following.

For all x\leq\frac{1}{2} and all k\in\mathbb{N}, \Psi^{(k)}(x)=0.

For all x\geq 1 and y<1, \Psi(x)\Phi^{\prime}(y)>1.

Both \Psi and \Phi are infinitely differentiable, and for all k\in\mathbb{N} we have
\sup_{x}\Psi^{(k)}(x)\leq\exp\left(\frac{5k}{2}\log(4k)\right)~{}~{}\mbox{% and}~{}~{}\sup_{x}\Phi^{(k)}(x)\leq\exp\left(\frac{3k}{2}\log\frac{3k}{2}% \right). 
The functions and derivatives \Psi,\Psi^{\prime},\Phi and \Phi^{\prime} are nonnegative and bounded, with
0\leq\Psi<e,~{}~{}0\leq\Psi^{\prime}\leq\sqrt{54/e},~{}~{}0<\Phi<\sqrt{2\pi e}% ,~{}~{}\mbox{and}~{}~{}0<\Phi^{\prime}\leq\sqrt{e}.
The key consequence of Lemma 1.i is that the function f is a robust zerochain (see Definition 4) and consequently also a zerochain (Definition 3):
Observation 2.
For any j>1, if x_{j1},x_{j}<1/2 then \bar{f}_{T}(y)=\bar{f}_{T}(y_{1},\ldots,y_{j1},0,y_{j+1},\ldots,y_{T}) for all y in a neighborhood of x.
Applying Observation 2 for j=i+1,\ldots,T gives that \bar{f}_{T} is a robust zerochain by Definition 4. Taking derivatives of \bar{f}_{T}(x_{1},\ldots,x_{i},0,\ldots,0) with respect to x_{j}, j>i, shows that \bar{f}_{T} is also a zerochain by Definition 3. Thus, Observation 1 then shows that any zerorespecting algorithm operating on \bar{f}_{T} requires T+1 iterations to find a point where x_{T}\neq 0.
Next, we establish the “large gradient property” that \nabla\bar{f}_{T}(x) must be large if any coordinate of x is near zero.
Lemma 2.
If x_{i}<1 for any i\leq T, then there exists j\leq i such that x_{j}<1 and
\left\{{\nabla}\bar{f}_{T}(x)}\right\\geq\left\frac{{\partial}}{{\partial}x% _{j}}\bar{f}_{T}(x)\right>1. 
Proof.
We take j\leq i to be the smallest j for which x_{j}<1, so that x_{j1}\geq 1 (where we use the shorthand x_{0}\equiv 1). Therefore, we have
\displaystyle\frac{{\partial}\bar{f}_{T}}{{\partial}x_{j}}(x)  \displaystyle=\Psi\left(x_{j1}\right)\Phi^{\prime}\left(x_{j}\right)\Psi% \left(x_{j1}\right)\Phi^{\prime}\left(x_{j}\right)\Psi^{\prime}\left(x_{j}% \right)\Phi\left(x_{j+1}\right)\Psi^{\prime}\left(x_{j}\right)\Phi\left(x_{j% +1}\right)  (11)  
\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\Psi\left(x_{j1}\right)% \Phi^{\prime}\left(x_{j}\right)\Psi\left(x_{j1}\right)\Phi^{\prime}\left(x_% {j}\right)\stackrel{{\scriptstyle(ii)}}{{=}}\Psi(x_{j1})\Phi^{\prime}\left% (x_{j}\mathop{\mathrm{missing}}{sign}(x_{j1})\right)\stackrel{{\scriptstyle(% iii)}}{{<}}1. 
In the chain of inequalities, inequality (i) follows because \Psi^{\prime}(x)\Phi(y)\geq 0 for every x,y; inequality (ii) follows because \Psi(x)=0 for x\leq 1/2, while equality (iii) follows from Lemma 1.ii and the pairing of x_{j}<1 and x_{j1}\geq 1. ∎
Finally, we verify that \bar{f}_{T} meets the smoothness and boundedness requirements of the function classes we consider.
Lemma 3.
The function \bar{f}_{T} satisfies the following.

We have \bar{f}_{T}(0)\inf_{x}\bar{f}_{T}(x)\leq 12T.

For all x\in\mathbb{R}^{d}, \left\{{\nabla}\bar{f}_{T}(x)}\right\\leq 23\sqrt{T}.

For every p\geq 1, the pth order derivatives of \bar{f}_{T} are \ell_{p}Lipschitz continuous, where \ell_{p}\leq\exp(\frac{5}{2}p\log p+cp) for a numerical constant c<\infty.
4.2 Lower bounds for zerorespecting and deterministic algorithms
We can now state and prove a lower bound for finding stationary points of pth order smooth functions using full derivative information and zerorespecting algorithms (the class \mathcal{A}_{\textnormal{{zr}}}). Proposition 1 transforms this bound into one on all deterministic algorithms (the class \mathcal{A}_{\textnormal{{det}}}).
Theorem 1.
There exist numerical constants 0<c_{0},c_{1}<\infty such that the following lower bound holds. Let p\geq 1, p\in\mathbb{N}, and let \Delta, L_{p}, and \epsilon be positive. Then
\mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{det}}},\mathcal{F}_{p}(% \Delta,L_{p})\big{)}\geq\mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{% {zr}}},\mathcal{F}_{p}(\Delta,L_{p})\big{)}\geq c_{0}\cdot\Delta\left(\frac{L_% {p}}{\ell_{p}}\right)^{1/p}\epsilon^{\frac{1+p}{p}} 
where \ell_{p}\leq e^{\frac{5}{2}p\log p+c_{1}p}.
Before we prove the theorem, a few remarks are in order. First, our lower bound matches the upper bound (8) that pthorder regularization schemes achieve [7], up to a constant depending polynomially on p. Thus, although our lower bound applies to algorithms given access to \nabla^{{q}}f(x) for all q\in\mathbb{N}, only the first p derivatives are necessary to achieve minimax optimal scaling in \Delta,L_{p}, and \epsilon.
Second, inspection of the proof shows that we actually bound smaller quantities than the complexity defined in Eq. (7). Indeed, we show that taking T\gtrsim\Delta(L_{p}/\ell_{p})^{1/p}\epsilon^{\frac{1+p}{p}} in the construction (10) and appropriately scaling \bar{f}_{T} yields a function f:\mathbb{R}^{T}\to\mathbb{R} that has L_{p}Lipschitz continuous pth derivative, and for which any zerorespecting algorithm generates iterates such that \{{\nabla}f(x^{(t)})}\>\epsilon for every t\leq T. That is,
\inf_{\mathsf{A}\in\mathcal{A}_{\textnormal{{zr}}}}\mathsf{T}_{\epsilon}\big{(% }\mathsf{A},f\big{)}>T\gtrsim\Delta L_{p}^{1/p}\epsilon^{\frac{1+p}{p}}, 
which is stronger than a lower bound on \mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{zr}}},\mathcal{F}_{p}(% \Delta,L_{p})\big{)}. Combined with the reduction in Proposition 1, this implies that for any deterministic algorithm \mathsf{A}\in\mathcal{A}_{\textnormal{{det}}} there exists an orthogonal U\in\mathbb{R}^{2T\times T} for which f_{U}(x)=f(U^{\top}x) is difficult, i.e. \mathsf{T}_{\epsilon}\big{(}\mathsf{A},f(U^{\top}\cdot)\big{)}>T.
Finally, the scaling of \ell_{p} with p may appear strange, or perhaps extraneous. We provide two viewpoints on this. First, one expects that the smoothness constants L_{p} should grow quickly as p grows; for \mathcal{C}^{\infty} functions such as \phi(t)=e^{t^{2}} or \phi(t)=\log(1+e^{t}), \sup_{t}\phi^{(p)}(t) grows superexponentially in p. Indeed, \ell_{p} is the Lipschitz constant of the pth derivative of \bar{f}_{T}. Second, it seems that attaining optimal rate of convergence for a given p requires evaluation of the full pth order derivative; in the companion [15] we prove that the rate \epsilon^{3/2}=\epsilon^{(p+1)/p}, for p=2, is unachievable with firstorder methods, and we believe that similarly \epsilon^{(p+1)/p} is unachievable with only the first p1 derivatives. The cases of practical interest are thus p\in\{1,2\}, in which case \ell_{p}^{1/p}\lesssim p^{\frac{5}{2}} is a numerical constant.
4.3 Proof of Theorem 1
To prove Theorem 1, we set up the hard instance f:\mathbb{R}^{T}\to\mathbb{R} for some integer T by appropriately scaling \bar{f}_{T} defined in Eq. (10),
f(x):=\frac{L_{p}\sigma^{p+1}}{\ell_{p}}\bar{f}_{T}(x/\sigma)\,, 
for some scale parameter \sigma>0 to be determined, where \ell_{p}\leq e^{2.5p\log p+c_{1}} is as in Lemma 3.iii. Fix \mathsf{A}\in\mathcal{A}_{\textnormal{{zr}}} and let x^{(1)}=0,x^{(2)},\ldots,x^{(T)} be the iterates produced by \mathsf{A} applied on f. Since f is also a zerochain, by Observation 1 we have x^{(t)}_{T}=0 for all t\leq T. Applying Lemma 2 guarantees that \left\{{\nabla}\bar{f}_{T}(x^{(t)}/\sigma)}\right\>1, and therefore
\left\{{\nabla}f(x^{(t)})}\right\=\frac{L_{p}\sigma^{p}}{\ell_{p}}\left\{{% \nabla}\bar{f}_{T}(x^{(t)}/\sigma)}\right\>\frac{L_{p}\sigma^{p}}{\ell_{p}}.  (12) 
It remains to choose T and \sigma based on \epsilon such that \left\{{\nabla}f(x^{(t)})}\right\>\epsilon and f\in\mathcal{F}_{p}(\Delta,L_{p}). By the lower bound (12), the choice \sigma=(\ell_{p}\epsilon/L_{p})^{1/p} guarantees \left\{{\nabla}f(x^{(t)})}\right\>\epsilon and hence \mathsf{T}_{\epsilon}\big{(}f,A\big{)}\geq T+1. We note that \nabla^{{p+1}}f(x)=(L_{p}/\ell_{p})\nabla^{{p+1}}f(x/\sigma) and therefore by Lemma 3.iii we have that the pth order derivatives of f are L_{p}Lipschitz continuous. Thus, to ensure f\in\mathcal{F}_{p}(\Delta,L_{p}) it suffices to show that f(0)\inf_{x}f(x)\leq\Delta. By the first part of Lemma 3 we have
f(0)\inf_{x}f(x)=\frac{L_{p}\sigma^{p+1}}{\ell_{p}}(\bar{f}_{T}(0)\inf_{x}% \bar{f}_{T}(x))\leq\frac{12L_{p}\sigma^{p+1}}{\ell_{p}}T=\frac{12\ell_{p}^{1/p% }\epsilon^{\frac{1+p}{p}}}{L_{p}^{1/p}}T, 
where in the last transition we substituted \sigma=(\ell_{p}\epsilon/L_{p})^{1/p}. We conclude that f\in\mathcal{F}_{p}(\Delta,L_{p}) for
T=\left\lfloor\frac{\Delta L_{p}^{1/p}}{12\ell_{p}^{1/p}}\epsilon^{\frac{1+p}% {p}}\right\rfloor~{}~{}\mbox{so}~{}~{}\mathcal{T}_{\epsilon}\big{(}\mathcal{A}% _{\textnormal{{zr}}},\mathcal{F}_{p}(\Delta,L_{p})\big{)}\geq\inf_{A\in% \mathcal{A}_{\textnormal{{zr}}}}\mathsf{T}_{\epsilon}\big{(}A,f\big{)}\geq 1+T% \geq\frac{\Delta L_{p}^{1/p}}{12\ell_{p}^{1/p}\epsilon^{\frac{1+p}{p}}}, 
5 Lower bounds for randomized algorithms
With our lower bounds on the complexity of deterministic algorithms established, we turn to the class of all randomized algorithms. We provide strong distributional complexity lower bounds by exhibiting a distribution on functions such that a function drawn from it is “difficult” for any randomized algorithm, with high probability. We do this via the composition of a random orthogonal transformation with the function \bar{f}_{T} defined in (10).
The key steps in our deterministic bounds are (a) to show that any algorithm can “discover” at most one coordinate per iteration and (b) finding an approximate stationary point requires “discovering” T coordinates. In the context of randomized algorithms, we must elaborate this development in two ways. First, in Section 5.1 we provide a “robust” analogue of Observation 1 (step (a) above): we show that for a random orthogonal matrix U, any sequence of bounded iterates \{x^{(t)}\}_{t\in\mathbb{N}} based on derivatives of \bar{f}_{T}(U^{\top}\cdot) must (with high probability) satisfy that \langle x^{(t)},u^{(j)}\rangle\leq\frac{1}{2} for all t and j\geq t, so that by Lemma 2, \left\{{\nabla}\bar{f}_{T}(U^{\top}x^{(t)})}\right\ must be large (step (b)). Second, in Section 5.2 we further augment our construction to force boundedness of the iterates by composing \bar{f}_{T}(U^{\top}\cdot) with a soft projection, so that an algorithm cannot “cheat” with unbounded iterates. Finally, we present our general lower bounds in Section 5.3.
5.1 Random rotations and bounded iterates
To transform our hard instance (10) into a hard instance distribution, we introduce an orthogonal matrix U\in\mathbb{R}^{d\times T} (with columns u^{(1)},\ldots,u^{(T)}), and define
\tilde{f}_{T;U}(x):=\bar{f}_{T}(U^{\top}x)=\bar{f}_{T}(\langle u^{(1)},x% \rangle,\ldots,\langle u^{(T)},x\rangle),  (13) 
We assume throughout that U is chosen uniformly at random from the space of orthogonal matrices \mathsf{O}(d,T)=\{V\in\mathbb{R}^{d\times T}\mid V^{\top}V=I_{T}\}; unless otherwise stated, the probabilistic statements we give are respect to this uniform U in addition to any randomness in the algorithm that produces the iterates. With this definition, we have the following extension of Observation 1 to randomized iterates, which we prove for \bar{f}_{T} but is valid for any robust zerochain (Definition 4). Recall that a sequence is informed by f if it has the same distribution as \mathsf{A}[f] for some randomized algorithm f (with iteration (4)).
Lemma 4.
Let \delta>0 and R\geq\sqrt{T}, and let x^{(1)},\ldots,x^{(T)} be informed by \tilde{f}_{T;U} and bounded, so that \{x^{(t)}}\\leq R for each T. If d\geq 52TR^{2}\log\frac{2T^{2}}{\delta} then with probability at least 1\delta, for all t\leq T and each j\in\{t,\ldots,T\}, we have
\langle u^{(j)},x^{(t)}\rangle<1/2. 
The result of Lemma 4 is identical (to constant factors) to an important result of Woodworth and Srebro [41, Lemma 7], but we must be careful with the sequential conditioning of randomness between the iterates x^{(t)}, the random orthogonal U, and how much information the sequentially computed derivatives may leak. Because of this additional care, we require a modification of their original proof,^{4}^{4}4 In a recent note Woodworth and Srebro [42] independently provide a revision of their proof that is similar, but not identical, to the one we propose here. which we provide in Section B.3, giving a rough outline here. For a fixed t<T, assume that \langle u^{(j)},x^{(s)}\rangle<1/2 holds for every pair s\leq t and j\in\{s,\ldots,T\}; we argue that this (roughly) implies that \langle u^{(j)},x^{(t+1)}\rangle<1/2 for every j\in\{t+1,\ldots,T\} with high probability, completing the induction. When the assumption that \langle u^{(j)},x^{(s)}\rangle<1/2 holds, the robust zerochain property of \bar{f}_{T} (Definition 4 and Observation 2) implies that for every s\leq t we have
\tilde{f}_{T;U}(y)=\bar{f}_{T}(\langle u^{(1)},y\rangle,\ldots,\langle u^{(s)}% ,y\rangle,0,\ldots,0) 
for all y in a neighborhood of x^{(s)}. That is, we can compute all the derivatives of \tilde{f}_{T;U} at x^{(s)} from x^{(s)} and u^{(1)},\ldots,u^{(s)}, as \bar{f}_{T} is known. Therefore, given u^{(1)},x^{(1)},\ldots,u^{(t)},x^{(t)} it is possible to reconstruct all the information the algorithm has collected up to iteration t. This means that beyond possibly revealing u^{(1)},\ldots,u^{(t)}, these derivatives contain no additional information on u^{(t+1)},\ldots,u^{(T)}. Consequently, any component of x^{(t+1)} outside the span of u^{(1)},x^{(1)},\ldots,u^{(t)},x^{(t)} is a complete “shot in the dark.”
To give “shot in the dark” a more precise meaning, let \hat{u}^{(j)} be the projection of u^{(j)} to the orthogonal complement of \mathrm{span}\{u^{(1)},x^{(1)},\ldots,u^{(t)},x^{(t)}\}. We show that conditioned on u^{(1)},\ldots,u^{(T)}, and the induction hypothesis, \hat{u}^{(j)} has a rotationally symmetric distribution in that subspace, and that it is independent of x^{(t+1)}. Therefore, by concentration of measure arguments on the sphere [5], we have \langle\hat{u}^{(j)},x^{(t+1)}\rangle\lesssim\{x^{(t+1)}}\/\sqrt{d}\leq R/% \sqrt{d} for any individual j\geq t+1, with high probability. Using an appropriate induction hypothesis, this is sufficient to guarantee that for every t+1\leq j\leq T, \langle u^{(j)},x^{(t+1)}\rangle\lesssim R\sqrt{(T\log T)/d}, which is bounded by 1/2 for sufficiently large d.
5.2 Handling unbounded iterates
In the deterministic case, the adversary (choosing the hard function f) can choose the rotation matrix U to be exactly orthogonal to all past iterates; this is impossible for randomized algorithms. The construction (13) thus fails for unbounded random iterates, since as long as x^{(t)} and u^{(j)} are not exactly orthogonal, their inner product will exceed 1/2 for sufficiently large \{x^{(t)}}\, thus breaching the “dead zone” of \Psi and providing the algorithm with information on u^{(j)}. To prevent this, we force the algorithm to only access \tilde{f}_{T;U} at points with bounded norm, by first passing the iterates through a smooth mapping from \mathbb{R}^{d} to a ball around the origin. We denote our final hard instance construction by \hat{f}_{T;U}:\mathbb{R}^{d}\to\mathbb{R}, and define it as
\hat{f}_{T;U}(x)=\tilde{f}_{T;U}(\rho(x))+\frac{1}{10}\left\{x}\right\^{2},~% {}\mbox{where}~{}\rho(x)=\frac{x}{\sqrt{1+\left\{x}\right\^{2}/R^{2}}}~{}% \mbox{and}~{}R=230\sqrt{T}\,.  (14) 
The quadratic term in \hat{f}_{T;U} guarantees that all points beyond a certain norm have a large gradient, which prevents the algorithm from trivially making the gradient small by increasing the norm of the iterates. The following lemma captures the hardness of \hat{f}_{T;U} for randomized algorithms.
Lemma 5.
Let \delta>0, and let x^{(1)},\ldots,x^{(T)} be informed by \hat{f}_{T;U}. If d\geq 52\cdot 230^{2}\cdot T^{2}\log\frac{2T^{2}}{\delta} then, with probability at least 1\delta,
\big{\}{{\nabla}\hat{f}_{T;U}(x^{(t)})}\big{\}>1/2~{}~{}\mbox{for all}~{}t% \leq T. 
Proof.
For t\leq T, set y^{(t)}:=\rho(x^{(t)}). For every p\geq 0 and t\in\mathbb{N}, the quantity \nabla^{{p}}\hat{f}_{T;U}(x^{(t)}) is measurable with respect x^{(t)} and \{\nabla^{{i}}\tilde{f}_{T;U}(y^{(t)})\}_{i=0}^{p} (the chain rule shows it can be computed from these variables without additional dependence on U, as \rho is fixed). Therefore, the process y^{(1)},\ldots,y^{(T)} is informed by \tilde{f}_{T;U} (recall defining iteration (4)). Since \{y^{(t)}}\=\{\rho(x^{(t)})}\\leq R for every t, we may apply Lemma 4 with R=230\sqrt{T} to obtain that with probability at least 1\delta,
\langle u^{(T)},y^{(t)}\rangle<1/2~{}~{}\mbox{for every }t\leq T. 
Therefore, by Lemma 2 with i=T, for each t there exists j\leq T such that
\left\left\langle u^{(j)},y^{(t)}\right\rangle\right<1~{}\mbox{and}~{}\left% \left\langle u^{(j)},{\nabla}\tilde{f}_{T;U}(y^{(t)})\right\rangle\right>1.  (15) 
To show that \{{\nabla}\hat{f}_{T;U}(x^{(t)})}\ is also large, we consider separately the cases \{x^{(t)}}\\leq R/2 and \{x^{(t)}}\\geq R/2. For the first case, we use \frac{{\partial}\rho}{{\partial}x}(x)=\frac{I\rho(x)\rho(x)^{\top}/R^{2}}{% \sqrt{1+\left\{x}\right\^{2}/R^{2}}} to write
\displaystyle\left\langle u^{(j)},{\nabla}\hat{f}_{T;U}(x^{(t)})\right\rangle=% \left\langle u^{(j)},\frac{{\partial}\rho}{{\partial}x}(x^{(t)}){\nabla}\tilde% {f}_{T;U}(y^{(t)})\right\rangle+\frac{1}{5}\left\langle u^{(j)},x^{(t)}\right\rangle  
\displaystyle\qquad=\frac{\langle u^{(j)},{\nabla}\tilde{f}_{T;U}(y^{(t)})% \rangle\langle u^{(j)},y^{(t)}\rangle\langle y^{(t)},{\nabla}\tilde{f}_{T;U}(% y^{(t)})\rangle/R^{2}}{\sqrt{1+\{x^{(t)}}\^{2}/R^{2}}}+\frac{1}{5}\langle u^% {(j)},y^{(t)}\rangle\sqrt{1+\{x^{(t)}}\^{2}/R^{2}}. 
Therefore, for \{y^{(t)}}\\leq\{x^{(t)}}\\leq R/2 we have
\left\left\langle u^{(j)},{\nabla}\hat{f}_{T;U}(x^{(t)})\right\rangle\right% \geq\frac{2}{\sqrt{5}}\left\left\langle u^{(j)},{\nabla}\tilde{f}_{T;U}(y^{(t% )})\right\rangle\right\left\left\langle u^{(j)},y^{(t)}\right\rangle\right% \left(\frac{\{{\nabla}\tilde{f}_{T;U}(y^{(t)})}\}{2R}+\frac{1}{2\sqrt{5}}% \right). 
By Lemma 3.ii we have \{{\nabla}\tilde{f}_{T;U}(y^{(t)})}\\leq 23\sqrt{T}=R/10, which combined with (15) and the above display yields \{{\nabla}\hat{f}_{T;U}(x^{(T)})}\\geq\langle u^{(j)},{\nabla}\hat{f}_{T;U}% (x^{(T)})\rangle\geq\frac{2}{\sqrt{5}}\frac{1}{20}\frac{1}{2\sqrt{5}}>\frac% {1}{2}.
In the second case, \left\{x^{(t)}}\right\\geq R/2, we have for any x satisfying \left\{x}\right\\geq R/2 and y=\rho(x) that
\left\{{\nabla}\hat{f}_{T;U}(x)}\right\\geq\frac{1}{5}\left\{x}\right\% \left\{\frac{{\partial}\rho}{{\partial}x}(x)}\right\_{\rm op}\left\{{\nabla% }\tilde{f}_{T;U}(y)}\right\\geq\frac{R}{10}\frac{2}{\sqrt{5}}\frac{R}{10}>% \sqrt{T}\geq 1,  (16) 
where we used \{\frac{{\partial}\rho}{{\partial}x}(x)}\_{\rm op}\leq\frac{1}{\sqrt{1+\left% \{x}\right\^{2}/R^{2}}}\leq 2/\sqrt{5} and that \{{\nabla}\tilde{f}_{T;U}(y)}\\leq 23\sqrt{T}=R/10. ∎
As our lower bounds repose on appropriately scaling the function \hat{f}_{T;U}, it remains to verify that \hat{f}_{T;U} satisfies the few boundedness properties we require. We do so in the following lemma.
Lemma 6.
The function \hat{f}_{T;U} satisfies the following.

We have \hat{f}_{T;U}(0)\inf_{x}\hat{f}_{T;U}(x)\leq 12T.

For every p\geq 1, the pth order derivatives of \hat{f}_{T;U} are \hat{\ell}_{p}Lipschitz continuous, where \hat{\ell}_{p}\leq\exp(cp\log p+c) for a numerical constant c<\infty.
We defer the (computationally involved) proof of this lemma to Section B.5.
5.3 Final lower bounds
With Lemmas 5 and 6 in hand, we can state our a lower bound for all algorithms, randomized or otherwise, given access to all derivatives of a \mathcal{C}^{\infty} function. Note that our construction also implies an identical lower bound for (slightly) more general algorithms that use any local oracle [32, 10], meaning that the information the oracle returns about a function f when queried at a point x is identical to that it returns when a function g is queried at x whenever f(z)=g(z) for all z in a neighborhood of x.
Theorem 2.
There exist numerical constants 0<c_{0},c_{1}<\infty such that the following lower bound holds. Let p\geq 1,p\in\mathbb{N}, and let \Delta, L_{p}, and \epsilon be positive. Then
\mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{rand}}},\mathcal{F}_{p}% (\Delta,L_{p})\big{)}\geq c_{0}\cdot\Delta\left(\frac{L_{p}}{\hat{\ell}_{p}}% \right)^{1/p}\epsilon^{\frac{1+p}{p}}, 
where \hat{\ell}_{p}\leq e^{c_{1}p\log p+c_{1}}.
We return to the proof of Theorem 2 in Sec. 5.4, following the same outline as that of Theorem 1, and provide some commentary here. An inspection of the proof to come shows that we actually demonstrate a stronger result than that claimed in the theorem. For any \delta\in(0,1) let d\geq\left\lceil 52\cdot(230)^{2}\cdot T^{2}\log(2T^{2}/\delta)\right\rceil where T=\lfloor{c_{0}\Delta({L_{p}}/{\hat{\ell}_{p}})^{1/p}\epsilon^{\frac{1+p}{p}}}\rfloor as in the theorem statement. In the proof we construct a probability measure \mu on functions in \mathcal{F}_{p}(\Delta,L_{p}), of fixed dimension d, such that
\inf_{\mathsf{A}\in\mathcal{A}_{\textnormal{{rand}}}}\int{\mathbb{P}}_{\mathsf% {A}}\left(\big{\}{\nabla f(x^{(t)})}\big{\}>\epsilon~{}\mbox{for~{}all~{}}t% \leq T\mid f\right)d\mu(f)>1\delta,  (17) 
where the randomness in {\mathbb{P}}_{\mathsf{A}} depends only on \mathsf{A}. Therefore, by definition (6), for any \mathsf{A}\in\mathcal{A}_{\textnormal{{rand}}} a function f drawn from \mu satisfies
\mathsf{T}_{\epsilon}\big{(}\mathsf{A},f\big{)}>T~{}\mbox{with probability % greater than }12\delta,  (18) 
implying Theorem 2 for any \delta\geq 1/2. Thus, we exhibit a randomized procedure for finding hard instances for any randomized algorithm, that requires no knowledge of the algorithm itself.
Theorem 2 is stronger than Theorem 1, as it applies for a broader class of algorithms. However, our probabilistic analysis requires that the functions constructed to prove Theorem 2 be of dimension at least T^{2}. In contrast to Theorem 1, which requires dimension 2T, this is a drawback; we do not speculate whether such an increase in dimension is necessary.
5.4 Proof of Theorem 2
We set up our hard instance distribution f_{U}:\mathbb{R}^{d}\to\mathbb{R}, indexed by a uniformly distributed orthogonal matrix U\in\mathsf{O}(d,T), by appropriately scaling \hat{f}_{T;U} defined in (14),
f_{U}(x):=\frac{L_{p}\sigma^{p+1}}{\hat{\ell}_{p}}\hat{f}_{T;U}(x/\sigma), 
where the integer T and scale parameter \sigma>0 are to be determined, d=\lceil{52\cdot(230)^{2}T^{2}\log(4T^{2})}\rceil, and the quantity \hat{\ell}_{p}\leq\exp(c_{1}p\log p+c_{1}) for a numerical constant c_{1} is defined in Lemma 6.ii.
Fix \mathsf{A}\in\mathcal{A}_{\textnormal{{rand}}} and let x^{(1)},x^{(2)},\ldots,x^{(T)} be the iterates produced by \mathsf{A} applied on f_{U}. Since f and \hat{f}_{T;U} differ only by scaling, the iterates x^{(1)}/\sigma,x^{(2)}/\sigma,\ldots,x^{(T)}/\sigma are informed by \hat{f}_{T;U} (recall Sec. 2.2), and therefore we may apply Lemma 5 with \delta=1/2 and our large enough choice of dimension d to conclude that
{\mathbb{P}}_{\mathsf{A},U}\left(\big{\}{{\nabla}\hat{f}_{T;U}\left(x^{(t)}/% \sigma\right)}\big{\}>\frac{1}{2}~{}\mbox{for~{}all}~{}t\leq T\right)>\frac{1% }{2}, 
where the probability is taken over both the random orthogonal U and any randomness in \mathsf{A}. As \mathsf{A} is arbitrary, taking \sigma=(2\hat{\ell}_{p}\epsilon/L_{p})^{1/p}, this inequality becomes the desired strong inequality (17) with \delta=1/2 and \mu induced by the distribution of U. Thus, by (18), for every \mathsf{A}\in\mathcal{A}_{\textnormal{{rand}}} there exists U_{\mathsf{A}}\in\mathsf{O}(d,T) such that \mathsf{T}_{\epsilon}\big{(}\mathsf{A},f_{U_{\mathsf{A}}}\big{)}\geq 1+T, so
\inf_{\mathsf{A}\in\mathcal{A}_{\textnormal{{det}}}}\sup_{U\in\mathsf{O}(d,T)}% \mathsf{T}_{\epsilon}\big{(}\mathsf{A},f_{U}\big{)}\geq 1+T. 
It remains to choose T to guarantee that f_{U} belongs to the relevant function class (bounded and smooth) for every orthogonal U. By Lemma 6.ii, f_{U} has L_{p}Lipschitz continuous pth order derivatives. By Lemma 6.i, we have
f_{U}(0)\inf_{x}f_{U}(x)\leq\frac{L_{p}\sigma^{p+1}}{\hat{\ell}_{p}}\left(% \bar{f}_{T}(0)\inf_{x}\bar{f}_{T}(x)\right)\leq\frac{12L_{p}\sigma^{p+1}}{% \hat{\ell}_{p}}T=\frac{24(2\hat{\ell}_{p})^{1/p}\epsilon^{\frac{p+1}{p}}}{L_{p% }^{1/p}}T, 
where in the last transition we have substituted \sigma=(2\ell_{p}\epsilon/L_{p})^{1/p}. Setting T=\lfloor{\frac{\Delta}{48}({L_{p}}/{\hat{\ell}_{p}})^{1/p}\epsilon^{\frac{1+% p}{p}}}\rfloor gives f_{U}(0)\inf_{x}f_{U}(x)\leq\Delta, and f_{U}\in\mathcal{F}_{p}(\Delta,L_{p}), yielding the theorem.
6 Distancebased lower bounds
We have so far considered finding approximate stationary points of smooth functions with bounded suboptimality at the origin, i.e. f(0)\inf_{x}f(x)\leq\Delta. In convex optimization, it is common to consider instead functions with bounded distance between the origin and a global minimum. We may consider a similar restriction for nonconvex functions; for p\geq 1 and positive L_{p},D, let
\mathcal{F}^{\rm dist}_{p}(D,L_{p}) 
be the class of \mathcal{C}^{\infty} functions with L_{p}Lipschitz pth order derivatives satisfying
\sup_{x}\left\{\left\{x}\right\\mid x\in\mathop{\rm arg\hskip 1.0ptmin}f% \right\}\leq D,  (19) 
that is, all global minima have bounded distance to the origin.
In this section we give a lower bound on the complexity of this function class that has the same \epsilon dependence as our bound for the class \mathcal{F}_{p}(\Delta,L_{p}). This is in sharp contrast to convex optimization, where distancebounded functions enjoy significantly better \epsilon dependence than their valuebounded counterparts (see Section Pii.LABEL:sec:convex). Qualitatively, the reason for this difference is that the lack of convexity allows us to “hide” global minima close to the origin that are difficult to find for any algorithm with local function access [32].
We postpone the construction and proof to Appendix C, and move directly to the final bound.
Theorem 3.
There exist numerical constants 0<c_{0},c_{1}<\infty such that the following lower bound holds. For any p\geq 1, let D,L_{p}, and \epsilon be positive. Then
\mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{rand}}},\mathcal{F}^{% \rm dist}_{p}(D,L_{p})\big{)}\geq c_{0}\cdot D^{1+p}\left(\frac{L_{p}}{\ell_{p% }^{\prime}}\right)^{\frac{1+p}{p}}\epsilon^{\frac{1+p}{p}}, 
where \ell_{p}^{\prime}\leq e^{c_{1}p\log p+c_{1}}.
While we do not have a matching upper bound for Theorem 3, we can match its \epsilon dependence in the smaller function class
\mathcal{F}^{\rm dist}_{1,p}(D,L_{1},L_{p})=\mathcal{F}^{\rm dist}_{1}(D,L_{1}% )\cap\mathcal{F}^{\rm dist}_{p}(D,L_{p}), 
due to the fact that for any f:\mathbb{R}^{d}\to\mathbb{R} with L_{1}Lipschitz continuous gradient and global minimizer x^{\star}, we have f(x)f(x^{\star})\leq\frac{1}{2}L_{1}\left\{xx^{\star}}\right\^{2} for all x\in\mathbb{R}^{d} [c.f. 9, Eq. (9.13)]. Hence \mathcal{F}^{\rm dist}_{1,p}(D,L_{1},L_{p})\subset\mathcal{F}_{p}(\Delta,L_{p}), with \Delta:=\frac{1}{2}L_{1}D^{2}, and consequently by the bound (8) we have
\mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{det}}}^{(p)}\cap% \mathcal{A}_{\textnormal{{zr}}}^{(p)},\mathcal{F}^{\rm dist}_{1,p}(D,L_{1},L_{% p})\big{)}\lesssim D^{2}L_{1}L_{p}^{1/p}\epsilon^{\frac{p+1}{p}}. 
Acknowledgments
OH was supported by the PACCAR INC fellowship. YC and JCD were partially supported by the SAILToyota Center for AI Research and NSFCAREER award 1553086. YC was partially supported by the Stanford Graduate Fellowship and the Numerical Technologies Fellowship.
References
 Agarwal et al. [2012] A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright. Informationtheoretic lower bounds on the oracle complexity of convex optimization. IEEE Transactions on Information Theory, 58(5):3235–3249, 2012.
 Agarwal et al. [2017] N. Agarwal, Z. AllenZhu, B. Bullins, E. Hazan, and T. Ma. Finding approximate local minima faster than gradient descent. In Proceedings of the FortyNinth Annual ACM Symposium on the Theory of Computing, 2017. URL https://arxiv.org/abs/1611.01146.
 Arjevani et al. [2016] Y. Arjevani, S. ShalevShwartz, and O. Shamir. On lower and upper bounds in smooth and strongly convex optimization. Journal of Machine Learning Research, 17(126):1–51, 2016. URL http://jmlr.org/papers/v17/15106.html.
 Arjevani et al. [2017] Y. Arjevani, O. Shamir, and R. Shiff. Oracle complexity of secondorder methods for smooth convex optimization. arXiv:1705.07260 [math.OC], 2017.
 Ball [1997] K. Ball. An elementary introduction to modern convex geometry. In S. Levy, editor, Flavors of Geometry, pages 1–58. MSRI Publications, 1997.
 Berend and Tassa [2010] D. Berend and T. Tassa. Improved bounds on Bell numbers and on moments of sums of random variables. Probability and Mathematical Statistics, 30(2):185–205, 2010.
 Birgin et al. [2017] E. G. Birgin, J. L. Gardenghi, J. M. Martínez, S. A. Santos, and P. L. Toint. Worstcase evaluation complexity for unconstrained nonlinear optimization using highorder regularized models. Mathematical Programming, 163(1–2):359–368, 2017.
 Boumal et al. [2016] N. Boumal, V. Voroninski, and A. Bandeira. The nonconvex BurerMonteiro approach works on smooth semidefinite programs. In Advances in Neural Information Processing Systems 29, pages 2757–2765, 2016.
 Boyd and Vandenberghe [2004] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
 Braun et al. [2017] G. Braun, C. Guzmán, and S. Pokutta. Lower bounds on the oracle complexity of nonsmooth convex optimization via information theory. IEEE Transactions on Information Theory, 63(7), 2017.
 Burer and Monteiro [2003] S. Burer and R. D. Monteiro. A nonlinear programming algorithm for solving semidefinite programs via lowrank factorization. Mathematical Programming, 95(2):329–357, 2003.
 Candès et al. [2015] E. J. Candès, X. Li, and M. Soltanolkotabi. Phase retrieval via Wirtinger flow: Theory and algorithms. IEEE Transactions on Information Theory, 61(4):1985–2007, 2015.
 Carmon et al. [2016] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Accelerated methods for nonconvex optimization. arXiv:1611.00756 [math.OC], 2016.
 Carmon et al. [2017a] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Convex until proven guilty: dimensionfree acceleration of gradient descent on nonconvex functions. In Proceedings of the 33rd International Conference on Machine Learning, 2017a.
 Carmon et al. [2017b] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. \Hy@raisedlinkLower bounds for finding stationary points II: Firstorder methods. arXiv: [math.OC], 2017b.
 Cartis et al. [2010] C. Cartis, N. I. Gould, and P. L. Toint. On the complexity of steepest descent, Newton’s and regularized Newton’s methods for nonconvex unconstrained optimization problems. SIAM Journal on Optimization, 20(6):2833–2852, 2010.
 Cartis et al. [2012a] C. Cartis, N. I. Gould, and P. L. Toint. Complexity bounds for secondorder optimality in unconstrained optimization. Journal of Complexity, 28(1):93–108, 2012a.
 Cartis et al. [2012b] C. Cartis, N. I. M. Gould, and P. L. Toint. How much patience do you have? A worstcase perspective on smooth nonconvex optimization. Optima, 88, 2012b.
 Cartis et al. [2017] C. Cartis, N. I. M. Gould, and P. L. Toint. Worstcase evaluation complexity and optimality of secondorder methods for nonconvex smooth optimization. arXiv:1709.07180 [math.OC], 2017.
 Chowla et al. [1951] S. Chowla, I. N. Herstein, and W. K. Moore. On recursions connected with symmetric groups I. Canadian Journal of Mathematics, 3:328–334, 1951.
 Conn et al. [2000] A. R. Conn, N. I. M. Gould, and P. L. Toint. Trust Region Methods. MPSSIAM Series on Optimization. SIAM, 2000.
 Hager and Zhang [2006] W. W. Hager and H. Zhang. A survey of nonlinear conjugate gradient methods. Pacific Journal of Optimization, 2(1):35–58, 2006.
 Jin et al. [2017] C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jordan. How to escape saddle points efficiently. In Proceedings of the 33rd International Conference on Machine Learning, 2017.
 Keshavan et al. [2010] R. H. Keshavan, A. Montanari, and S. Oh. Matrix completion from noisy entries. Journal of Machine Learning Research, 11:2057–2078, 2010.
 LeCun et al. [2015] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
 Liu and Nocedal [1989] D. Liu and J. Nocedal. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45(1):503–528, 1989.
 Loh and Wainwright [2012] P.L. Loh and M. J. Wainwright. Highdimensional regression with noisy and missing data: provable guarantees with nonconvexity. Annals of Statistics, 40(3):1637–1664, 2012.
 Loh and Wainwright [2013] P.L. Loh and M. J. Wainwright. Regularized Mestimators with nonconvexity: Statistical and algorithmic theory for local optima. Journal of Machine Learning Research, 16:559–616, 2013.
 Monteiro and Svaiter [2013] R. D. Monteiro and B. F. Svaiter. An accelerated hybrid proximal extragradient method for convex optimization and its implications to secondorder methods. SIAM Journal on Optimization, 23(2):1092–1125, 2013.
 Murty and Kabadi [1987] K. Murty and S. Kabadi. Some NPcomplete problems in quadratic and nonlinear programming. Mathematical Programming, 39:117–129, 1987.
 Nemirovski [1994] A. Nemirovski. Efficient methods in convex programming. Technion: The Israel Institute of Technology, 1994.
 Nemirovski and Yudin [1983] A. Nemirovski and D. Yudin. Problem Complexity and Method Efficiency in Optimization. Wiley, 1983.
 Nesterov [1983] Y. Nesterov. A method of solving a convex programming problem with convergence rate {O}(1/k^{2}). Soviet Mathematics Doklady, 27(2):372–376, 1983.
 Nesterov [2004] Y. Nesterov. Introductory Lectures on Convex Optimization. Kluwer Academic Publishers, 2004.
 Nesterov [2012a] Y. Nesterov. Efficiency of coordinate descent methods on hugescale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012a.
 Nesterov [2012b] Y. Nesterov. How to make the gradients small. Optima, 88, 2012b.
 Nesterov and Polyak [2006] Y. Nesterov and B. Polyak. Cubic regularization of Newton method and its global performance. Mathematical Programming, Series A, 108:177–205, 2006.
 Nocedal and Wright [2006] J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 2006.
 Sun et al. [2017] J. Sun, Q. Qu, and J. Wright. A geometric analysis of phase retrieval. Foundations of Computational Mathematics, To appear, 2017.
 Traub et al. [1988] J. Traub, H. Wasilkowski, and H. Wozniakowski. InformationBased Complexity. Academic Press, 1988.
 Woodworth and Srebro [2016] B. E. Woodworth and N. Srebro. Tight complexity bounds for optimizing composite objectives. In Advances in Neural Information Processing Systems 29, pages 3639–3647, 2016.
 Woodworth and Srebro [2017] B. E. Woodworth and N. Srebro. Lower bound for randomized first order convex optimization. arXiv:1709.03594 [math.OC], 2017. URL https://arxiv.org/pdf/1709.03594.
 Zhang et al. [2012] X. Zhang, C. Ling, and L. Qi. The best rank1 approximation of a symmetric tensor and related spherical optimization problems. SIAM Journal on Matrix Analysis and Applications, 33(3):806–821, 2012.
Appendix A Proof of Proposition 1
See 1
Proof.
We may assume that \mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{det}}}^{(p)},\mathcal{F% }\big{)}<T_{0} for some integer T_{0}<\infty, as otherwise we have \mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{det}}}^{(p)},\mathcal{F% }\big{)}=\infty and the result holds trivially. It is therefore sufficient to consider only algorithms with worstcase complexity better than T_{0}, i.e. \mathsf{A}\in\mathcal{A}_{\textnormal{{det}}}^{(p)} such that \sup_{f\in\mathcal{F}}\mathsf{T}_{\epsilon}\big{(}\mathsf{A},f\big{)}<T_{0}<\infty. Fixing such \mathsf{A}, we construct an algorithm \mathsf{Z}_{\mathsf{A}}\in\mathcal{A}_{\textnormal{{zr}}}^{(p)} with the following property: for every every f:\mathbb{R}^{d}\to\mathbb{R} in \mathcal{F}, there exists an orthogonal U\in\mathbb{R}^{(d+T_{0})\times d}, U^{\top}U=I_{d}, such that f_{U}(x):=f(U^{\top}x) satisfies that the first T_{0} iterates in sequences \mathsf{Z}_{\mathsf{A}}[f] and U^{\top}\mathsf{A}[f_{U}] are identical. (Recall the notation \mathsf{A}[f]=\{a^{(t)}\}_{t\in\mathbb{N}} where a^{(t)} are the iterates of \mathsf{A} on f, and we use the obvious shorthand U^{\top}\{a^{(t)}\}_{t\in\mathbb{N}}=\{U^{\top}a^{(t)}\}_{t\in\mathbb{N}}.) Before explaining the construction of the zerorespecting algorithm \mathsf{Z}_{\mathsf{A}}, let us see how its defining property implies Proposition 1.
First, note that if \mathsf{Z}_{\mathsf{A}}[f] and U^{\top}\mathsf{A}[f_{U}] are identical over their first T_{0} iterates,
\mathsf{T}_{\epsilon}\big{(}\mathsf{A},f_{U}\big{)}:=\mathsf{T}_{\epsilon}\big% {(}\mathsf{A}[f_{U}],f_{U}\big{)}\stackrel{{\scriptstyle(i)}}{{=}}\mathsf{T}_{% \epsilon}\big{(}U^{\top}\mathsf{A}[f_{U}],f\big{)}\stackrel{{\scriptstyle(ii)}% }{{=}}\mathsf{T}_{\epsilon}\big{(}\mathsf{Z}_{\mathsf{A}},f\big{)}.  (20) 
The equality (i) follows because \left\{Ug}\right\=\left\{g}\right\ for all orthogonal U, so for any sequence \{a^{(t)}\}_{t\in\mathbb{N}}
\displaystyle\mathsf{T}_{\epsilon}\big{(}\{a^{(t)}\}_{t\in\mathbb{N}},f_{U}% \big{)}  \displaystyle=\inf\left\{t\in\mathbb{N}\mid\{{\nabla}f_{U}(a^{(t)})}\\leq% \epsilon\right\}  
\displaystyle=\inf\left\{t\in\mathbb{N}\mid\{{\nabla}f(U^{\top}a^{(t)})}\% \leq\epsilon\right\}=\mathsf{T}_{\epsilon}\big{(}\{U^{\top}a^{(t)}\}_{t\in% \mathbb{N}},f\big{)} 
and in equality (i) we let \{a^{(t)}\}_{t\in\mathbb{N}}=\mathsf{A}[f_{U}]. The equality (ii) holds because \mathsf{T}_{\epsilon}\big{(}\cdot,\cdot\big{)} is a “stopping time”: if \mathsf{T}_{\epsilon}\big{(}U^{\top}\mathsf{A}[f_{U}],f\big{)}\leq T_{0} then the first T_{0} iterates of U^{\top}\mathsf{A}[f_{U}] determine \mathsf{T}_{\epsilon}\big{(}U^{\top}\mathsf{A}[f_{U}],f\big{)}, and these T_{0} iterates are identical to the first T_{0} iterates of \mathsf{Z}_{\mathsf{A}}[f] by assumption. With the identity (20) in hand, we obtain our result by straightforward manipulation as
\displaystyle\mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{det}}}^{(p% )},\mathcal{F}\big{)}  \displaystyle=\inf_{\mathsf{A}\in\mathcal{A}_{\textnormal{{det}}}^{(p)}}\sup_{% f\in\mathcal{F}}\mathsf{T}_{\epsilon}\big{(}\mathsf{A},f\big{)}\stackrel{{% \scriptstyle(i)}}{{\geq}}\inf_{\mathsf{A}\in\mathcal{A}_{\textnormal{{det}}}^{% (p)}}\sup_{f\in\mathcal{F}}\mathsf{T}_{\epsilon}\big{(}\mathsf{A},f_{U}\big{)}% \stackrel{{\scriptstyle(ii)}}{{=}}\inf_{\mathsf{A}\in\mathcal{A}_{\textnormal{% {det}}}^{(p)}}\sup_{f\in\mathcal{F}}\mathsf{T}_{\epsilon}\big{(}\mathsf{Z}_{% \mathsf{A}},f\big{)}  
\displaystyle\stackrel{{\scriptstyle(iii)}}{{\geq}}\inf_{\mathsf{B}\in\mathcal% {A}_{\textnormal{{zr}}}^{(p)}}\sup_{f\in\mathcal{F}}\mathsf{T}_{\epsilon}\big{% (}\mathsf{B},f\big{)}=\mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{% zr}}}^{(p)},\mathcal{F}\big{)}, 
where inequality (i) uses that f_{U}\in\mathcal{F} because \mathcal{F} is orthogonally invariant, step (ii) uses equality (20) and step (iii) is due to \mathsf{Z}_{\mathsf{A}}\in\mathcal{A}_{\textnormal{{zr}}}^{(p)} by construction.
It remains to construct the zerorespecting algorithm \mathsf{Z}_{\mathsf{A}} with iterates matching those of \mathsf{A} under appropriate rotation. We do this by describing its operation inductively on any given f:\mathbb{R}^{d}\to\mathbb{R}, which we denote \{z^{(t)}\}_{t\in\mathbb{N}}=\mathsf{Z}_{\mathsf{A}}[f]. Letting d^{\prime}=d+T_{0}, the state of the algorithm \mathsf{Z}_{\mathsf{A}} at iteration t is determined by a support S_{t}\subseteq[d] and orthonormal vectors \{u^{(i)}\}_{i\in S_{t}}\subset\mathbb{R}^{d^{\prime}} identified with this support. The support condition (5) defines the set S_{t},
S_{t}=\bigcup_{q\in[p]}\bigcup_{s<t}\mathop{\mathrm{supp}}\left\{\nabla^{{q}}f% (z^{(s)})\right\}, 
so that \emptyset=S_{1}\subseteq S_{2}\subseteq\cdots and the collection \{u^{(i)}\}_{i\in S_{t}} grows with t. We let U\in\mathbb{R}^{d^{\prime}\times d} be the orthogonal matrix whose ith column is u^{(i)}—even though U may not be completely determined throughout the runtime of \mathsf{Z}_{\mathsf{A}}, our partial knowledge of it will suffice to simulate the operation of \mathsf{A} on f_{U}(a)=f(U^{\top}a). Letting \{a^{(t)}\}_{t\in\mathbb{N}}=\mathsf{A}[f_{U}], our requirements \mathsf{Z}_{\mathsf{A}}[f]=U^{\top}\mathsf{A}[f_{U}] and \mathsf{Z}_{\mathsf{A}}\in\mathcal{A}_{\textnormal{{zr}}} are equivalent to
z^{(t)}=U^{\top}a^{(t)}\mbox{ and }\mathop{\mathrm{supp}}\{z^{(t)}\}\subseteq S% _{t}  (21) 
for every t\leq T_{0} (we set z^{(i)}=0 for every i>T_{0} without loss of generality).
Let us proceed with the inductive argument. The iterate a^{(1)}\in\mathbb{R}^{d^{\prime}} is an arbitrary (but deterministic) vector in \mathbb{R}^{d^{\prime}}. We thus satisfy (21) at t=1 by requiring that \langle u^{(j)},a^{(1)}\rangle=0 for every j\in[d], whence the first iterate of \mathsf{Z}_{\mathsf{A}} satisfies z^{(1)}=0\in\mathbb{R}^{d}. Assume now the equality and containment (21) holds for every s<t, where t\leq T_{0} (implying that \mathsf{Z}_{\mathsf{A}} has emulated the iterates a^{(2)},\ldots,a^{(t1)} of \mathsf{A}); we show how \mathsf{Z}_{\mathsf{A}} can emulate a^{(t)}, the t’th iterate of \mathsf{A}, and from it can construct z^{(t)} that satisfies (21). To obtain a^{(t)}, note that for every q\leq p, and every s<t, the derivatives \nabla^{{q}}f_{U}(a^{(s)}) are a function of \nabla^{{q}}f(z^{(s)}) and orthonormal the vectors \{u^{(i)}\}_{i\in S_{s+1}}, because \mathop{\mathrm{supp}}\{\nabla^{{q}}f(z^{(s)})\}\subseteq S_{s+1} and therefore the chain rule implies
\left[\nabla^{{q}}f_{U}(a^{(s)})\right]_{j_{1},...,j_{q}}=\sum_{i_{1},\ldots,i% _{q}\in S_{s+1}}\left[\nabla^{{q}}f(z^{(s)})\right]_{i_{1},...,i_{q}}u^{(i_{1}% )}_{j_{1}}\cdots u^{(i_{q})}_{j_{q}}. 
Since \mathsf{A}\in\mathcal{A}_{\textnormal{{det}}}^{(p)} is deterministic, a^{(t)} is a function of \nabla^{{q}}f(z^{(s)}) for q\in[p] and s\in[t1], and thus \mathsf{Z}_{\mathsf{A}} can simulate and compute it. To satisfy the support condition \mathop{\mathrm{supp}}\{z^{(t)}\}\subseteq S_{t} we require that \langle u^{(j)},a^{(t)}\rangle=0 for every j\not\in S_{t}. This also means that to compute z^{(t)}=U^{\top}a^{(t)} we require only the columns of U indexed by the support S_{t}.
Finally, we need to show that after computing S_{t+1} we can find the vectors \{u^{(i)}\}_{i\in S_{t+1}\setminus S_{t}} satisfying \langle u^{(j)},a^{(s)}\rangle=0 for every s\leq t and j\in S_{t+1}\setminus S_{t}, and additionally that U be orthogonal. Thus, we need to choose \{u^{(i)}\}_{i\in S_{t+1}\setminus S_{t}} in the orthogonal complement of \mathrm{span}\left\{a^{(1)},...,a^{(t)},\{u^{(i)}\}_{i\in S_{t}}\right\}. This orthogonal complement has dimension at least d^{\prime}tS_{t}=S_{t}^{c}+T_{0}t\geqS_{t}^{c}. Since S_{t+1}\setminus S_{t}\leqS_{t}^{c}, there exist orthonormal vectors \{u^{(i)}\}_{i\in S_{t+1}\setminus S_{t}} that meet the requirements. This completes the induction.
Note that the arguments above hold unchanged for p=\infty, so we have the second result \mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{det}}},\mathcal{F}\big{% )}=\mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{det}}}^{(\infty)},% \mathcal{F}\big{)}\geq\mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{% zr}}}^{(\infty)},\mathcal{F}\big{)}=\mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{% \textnormal{{zr}}},\mathcal{F}\big{)}. ∎
Appendix B Technical Results
B.1 Proof of Lemma 1
See 1
Each of the statements in the lemma is immediate except for part iii. To see this part, we require a few further calculations. We begin by providing bounds on the derivatives of \Phi(x)=e^{\frac{1}{2}}\int_{\infty}^{x}e^{\frac{1}{2}t^{2}}dt. To avoid annoyances with scaling factors, we define \phi(t)=e^{\frac{1}{2}t^{2}}. We have the following lemma, whose proof we defer temporarily.
Lemma 7.
For all k\in\mathbb{N}, there exist constants c_{i}^{(k)} satisfying c_{i}^{(k)}\leq(2\max\{i,1\})^{k}, and
\phi^{(k)}(t)=\bigg{(}\sum_{i=0}^{k}c_{i}^{(k)}t^{i}\bigg{)}\phi(t). 
With this result, we find that for any k\geq 1,
\Phi^{(k)}(x)=\sqrt{e}\bigg{(}\sum_{i=0}^{k1}c_{i}^{(k1)}x^{i}\bigg{)}\phi(x). 
The function \log(x^{i}\phi(x))=i\log x\frac{1}{2}x^{2} is maximized at x=\sqrt{i}, so that x^{i}\phi(x)\leq\exp(\frac{i}{2}\log\frac{i}{e}). We thus obtain the numerically verifiable upper bound
\displaystyle\Phi^{(k)}(x)  \displaystyle\leq\sqrt{e}\sum_{i=0}^{k1}\left(2\max\{i,1\}\right)^{k1}\exp% \left(\frac{i}{2}\log\frac{i}{e}\right)\leq\exp\left(1.5k\log(1.5k)\right). 
Now, we turn to considering the function \Psi(x). We assume w.l.o.g. that x>\frac{1}{2}, as otherwise \Psi^{(k)}(x)=0 for all k. Recall \Psi(x)=\exp\left(1\frac{1}{(2x1)^{2}}\right) for x>\frac{1}{2}. We have the following lemma regarding its derivatives.
Lemma 8.
For all k\in\mathbb{N}, there exist constants c_{i}^{(k)} satisfying c_{i}^{(k)}\leq 6^{k}(2i+k)^{k} such that
\Psi^{(k)}(x)=\bigg{(}\sum_{i=1}^{k}\frac{c_{i}^{(k)}}{(2x1)^{k+2i}}\bigg{)}% \Psi(x). 
As in the derivation immediately following Lemma 7, by replacing t=\frac{1}{2x1}, we have that t^{k+2i}e^{t^{2}} is maximized by t=\sqrt{(k+2i)/2}, so that
\frac{1}{(2x1)^{k+2i}}\Psi(x)\leq\exp\left(1+\frac{k+2i}{2}\log\frac{k+2i}{2e% }\right), 
which yields the numerically verifiable upper bound
\Psi^{(k)}(x)\leq\sum_{i=1}^{k}\exp\left(1+k\log(6k+12i)+\frac{k+2i}{2}\log% \frac{k+2i}{2e}\right)\leq\exp\left(2.5k\log(4k)\right). 
.
Proof of Lemma 7. We prove the result by induction. We have \phi^{\prime}(t)=te^{\frac{1}{2}t^{2}}, so that the base case of the induction is satisfied. Now, assume for our induction that
\phi^{(k)}(t)=\sum_{i=0}^{k}c_{i}^{(k)}t^{i}e^{\frac{1}{2}t^{2}}=\sum_{i=0}^{% k}c_{i}^{(k)}t^{i}\phi(t). 
where c_{i}^{(k)}\leq 2^{k}(\max\{i,1\})^{k}. Then taking derivatives, we have
\phi^{(k+1)}(t)=\sum_{i=1}^{k}\left[i\cdot c_{i}^{(k)}t^{i1}c_{i}^{(k)}t^{i+% 1}\right]\phi(t)c_{0}^{(k)}t\phi(t)=\sum_{i=0}^{k+1}c_{i}^{(k+1)}t^{i}\phi(t) 
where c_{i}^{(k+1)}=(i+1)c_{i+1}^{(k)}c_{i1}^{(k)} (and we treat c_{k+1}^{(k)}=0) and c_{k+1}^{(k+1)}=1. With the induction hypothesis that c_{i}^{(k)}\leq(2\max\{i,1\})^{k}, we obtain
c_{i}^{(k+1)}\leq 2^{k}(i+1)(i+1)^{k}+2^{k}(\max\{i,1\})^{k}\leq 2^{k+1}(i+1% )^{k+1}. 
This gives the result.
∎
Proof of Lemma 8. We provide the proof by induction over k. For k=1, we have that
\Psi^{\prime}(x)=\frac{4}{(2x1)^{3}}\exp\left(1\frac{1}{(2x1)^{2}}\right)=% \frac{4}{(2x1)^{3}}\Psi(x), 
which yields the base case of the induction. Now, assume that for some k, we have
\Psi^{(k)}(x)=\left(\sum_{i=1}^{k}\frac{c_{i}^{(k)}}{(2x1)^{k+2i}}\right)\Psi% (x). 
Then
\displaystyle\Psi^{(k+1)}(x)  \displaystyle=\left(\sum_{i=1}^{k}\frac{2(k+2i)c_{i}^{(k)}}{(2x1)^{k+1+2i}}+% \sum_{i=1}^{k}\frac{4c_{i}^{(k)}}{(2x1)^{k+3+2i}}\right)\Psi(x)  
\displaystyle=\left(\sum_{i=1}^{k+1}\frac{4c_{i1}^{(k)}2(k+2i)c_{i}^{(k)}}{(% 2x1)^{k+1+2i}}\right)\Psi(x), 
where c_{k+1}^{(k)}=0 and c_{0}^{(k)}=0. Defining c_{1}^{1}=4 and c_{i}^{(k+1)}=4c_{i1}^{(k)}2(k+2i)c_{i}^{(k)} for i>1, then, under the inductive hypothesis that c_{i}^{(k)}\leq 6^{k}(2i+k)^{k}, we have
c_{i}^{(k+1)}\leq 4\cdot 6^{k}(k2+2i)^{k}+2\cdot 6^{k}(k+2i)(k+2i)^{k}\leq 6% ^{k+1}(k+2i)^{k+1}\leq 6^{k+1}(k+1+2i)^{k+1} 
which gives the result.
∎
B.2 Proof of Lemma 3
See 3
Proof.
Part i follows because \bar{f}_{T}(0)<0 and, since 0\leq\Psi(x)\leq e and 0\leq\Phi(x)\leq\sqrt{2\pi e},
\bar{f}_{T}(x)\geq\Psi\left(1\right)\Phi\left(x_{1}\right)\sum_{i=2}^{T}\Psi% \left(x_{i1}\right)\Phi\left(x_{i}\right)>T\cdot e\cdot\sqrt{2\pi e}\geq12T. 
Part ii follows additionally from \Psi(x)=0 on x<1/2, 0\leq\Psi^{\prime}(x)\leq\sqrt{54e^{1}} and 0\leq\Phi^{\prime}(x)\leq\sqrt{e}, which when substituted into equality (11) yields
\left\frac{{\partial}\bar{f}_{T}}{{\partial}x_{j}}(x)\right\leq e\cdot\sqrt{% e}+\sqrt{54e^{1}}\cdot\sqrt{2\pi e}\leq 23 
for every x and j. Consequently, \left\{{\nabla}\bar{f}_{T}(x)}\right\\leq\sqrt{T}\leq 23\sqrt{T}.
To establish part iii, fix a point x\in\mathbb{R}^{T} and a unit vector v\in\mathbb{R}^{T}. Define the real function h_{x,v}:\mathbb{R}\to\mathbb{R} by the directional projection of \bar{f}_{T}, h_{x,v}(\theta):=\bar{f}_{T}(x+\theta v). The function \theta\mapsto h_{x,v}(\theta) is infinitely differentiable for every x and v. Therefore, \bar{f}_{T} has \ell_{p}Lipschitz pth order derivatives if and only if h_{x,v}^{(p+1)}(0)\leq\ell_{p} for every x, v. Using the shorthand notation {\partial}_{i_{1}}\cdots{\partial}_{i_{k}} for \frac{{\partial}^{k}}{{\partial}x_{i_{1}}\cdots{\partial}x_{i_{k}}}, we have
h_{x,v}^{\left(p+1\right)}\left(0\right)=\sum_{i_{1},\ldots,i_{p+1}=1}^{T}{% \partial}_{i_{1}}\cdots{\partial}_{i_{p+1}}\bar{f}_{T}\left(x\right)v_{i_{1}}% \cdots v_{i_{p+1}}\,. 
Examining \bar{f}_{T}, we see that {\partial}_{i_{1}}\cdots{\partial}_{i_{p+1}}\bar{f}_{T} is nonzero if and only if \lefti_{j}i_{k}\right\leq 1 for every j,k\in\left[p+1\right]. Consequently, we can rearrange the above summation as
h_{x,v}^{\left(p+1\right)}\left(0\right)=\sum_{\delta_{1},\delta_{2},\ldots,% \delta_{p}\in\left\{0,1\right\}^{p}\cup\left\{0,1\right\}^{p}}\sum_{i=1}^{T}{% \partial}_{i+\delta_{1}}\cdots{\partial}_{i+\delta_{p}}{\partial}_{i}\bar{f}_{% T}\left(x\right)v_{i+\delta_{1}}\cdots v_{i+\delta_{p}}v_{i}, 
where we take v_{0}:=0 and v_{T+1}:=0. Brief calculation show that
\displaystyle\sup_{x\in\mathbb{R}^{T}}  \displaystyle\max_{i\in[T]}\max_{\delta\in\{0,1\}^{p}\cup\{0,1\}^{p}}\left{% \partial}_{i+\delta_{1}}\cdots{\partial}_{i+\delta_{p}}{\partial}_{i}\bar{f}_{% T}(x)\right\leq\max_{k\in[p+1]}\left\{2\sup_{x\in\mathbb{R}}\left\Psi^{(k)}(% x)\right\sup_{x^{\prime}\in\mathbb{R}}\left\Phi^{(p+1k)}(x^{\prime})\right\right\}  
\displaystyle\leq 2\sqrt{2\pi e}\cdot e^{2.5(p+1)\log(4(p+1))}\leq\exp\left(2.% 5p\log p+4p+9\right). 
where the second inequality uses Lemma 1.iii, and \Phi(x^{\prime})\leq\sqrt{2\pi e} for the case k=p+1. Defining \ell_{p}=2^{p+1}e^{2.5p\log p+4p+9}\leq e^{2.5p\log p+5p+10}, we thus have
\lefth_{x,v}^{\left(p+1\right)}\left(0\right)\right\leq\sum_{\delta\in\left% \{0,1\right\}^{p}\cup\left\{0,1\right\}^{p}}2^{\left(p+1\right)}\ell_{p}% \left\sum_{i=1}^{T}v_{i+\delta_{1}}\cdots v_{i+\delta_{p}}v_{i}\right\leq% \left(2^{p+1}1\right)2^{\left(p+1\right)}\ell_{p}\leq\ell_{p}, 
where we have used \sum_{i=1}^{T}v_{i+\delta_{1}}\cdots v_{i+\delta_{p}}v_{i}\leq 1 for every \delta\in\{0,1\}^{p}\cup\{0,1\}^{p}. To see this last claim is true, recall that v is a unit vector and note that
\sum_{i=1}^{T}v_{i+\delta_{1}}\cdots v_{i+\delta_{p}}v_{i}=\sum_{i=1}^{T}v_{i}% ^{p+1\sum_{j=1}^{p}\delta_{j}}v_{i\pm 1}^{\sum_{j=1}^{p}\delta_{j}}. 
If \delta=0 then \sum_{i=1}^{T}v_{i+\delta_{1}}\cdots v_{i+\delta_{p}}v_{i}=\sum_{i=1}^{T}v_% {i}^{p+1}\leq\sum_{i=1}^{T}v_{i}^{2}=1. Otherwise, letting 1\leq\sum_{j=1}^{p}\delta_{j}=n\leq p, the CauchySwartz inequality implies
\left\sum_{i=1}^{T}v_{i+\delta_{1}}\cdots v_{i+\delta_{p}}v_{i}\right=\left% \sum_{i=1}^{T}v_{i}^{p+1n}v_{i+s}^{n}\right\leq\sqrt{\sum_{i=1}^{T}v_{i}^{2% \left(p+1n\right)}}\sqrt{\sum_{i=1}^{T}v_{i+s}^{2n}}\leq\sum_{i=1}^{T}v_{i}^{% 2}=1, 
where s=1 or 1. This gives the result. ∎
B.3 Proof of Lemma 4: main argument
The proof of Lemma 4 contains multiple sublemmas, with some proofs in Section B.4. We use notation defined in the proof of Lemma 4 within these sublemmas.
See 4
Proof.
For t\in\mathbb{N}, let P_{t}\in\mathbb{R}^{d\times d} denote the projection operator to the span of x^{(1)},u^{(1)},\ldots,x^{(t)},u^{(t)}, and let P_{t}^{\perp}=IP_{t} denote its orthogonal complement. We define the event G_{t} as
G_{t}=\left\{\max_{j\in\{t,\ldots,T\}}\left\left\langle u^{(j)},P_{t1}^{% \perp}x^{(t)}\right\rangle\right\leq\alpha\left\{P_{t1}^{\perp}x^{(t)}}% \right\\right\}\text{ where }\alpha=\frac{1}{5R\sqrt{T}}.  (22) 
For every t, define
G_{\leq t}=\cap_{i\leq t}G_{i}\text{ and }G_{<t}=\cap_{i<t}G_{i}\,. 
The following linearalgebraic result justifies the definition (22) of G_{t}; we defer the proof to Section B.4.1.
Lemma 9.
For all t\leq T, G_{\leq t} implies \langle u^{(j)},x^{(s)}\rangle<1/2 for every s\in\{1,\ldots,t\} and every j\in\{s,\ldots,T\}.
By Lemma 9 the event G_{\leq T} implies our result, so using that {\mathbb{P}}(G_{\leq T}^{c})\leq\sum_{t=1}^{T}{\mathbb{P}}(G_{t}^{c}\mid G_{<t}), it suffices to show that
{\mathbb{P}}\left(G_{\leq T}^{c}\right)\leq\sum_{t=1}^{T}{\mathbb{P}}(G_{t}^{c% }\mid G_{<t})\leq\delta.  (23) 
Let us therefore consider {\mathbb{P}}\left(G_{t}^{c}\mid G_{<t}\right). By the union bound and fact that \left\{P_{t1}^{\perp}u^{(j)}}\right\\leq 1 for every t and j,
\displaystyle{\mathbb{P}}(G_{t}^{c}\mid G_{<t})  \displaystyle\leq\sum_{j\in\{t,\ldots,T\}}{\mathbb{P}}\left(\left\left\langle u% ^{(j)},\frac{P_{t1}^{\perp}x^{(t)}}{\left\{P_{t1}^{\perp}x^{(t)}}\right\}% \right\rangle\right>\alpha\mid G_{<t}\right)  
\displaystyle=\sum_{j\in\{t,\ldots,T\}}{\mathbb{E}}_{\xi,U_{(<t)}}{\mathbb{P}}% \left(\left\left\langle u^{(j)},\frac{P_{t1}^{\perp}x^{(t)}}{\left\{P_{t1}% ^{\perp}x^{(t)}}\right\}\right\rangle\right>\alpha\mid\xi,U_{(<t)},G_{<t}\right)  
\displaystyle\leq\sum_{j\in\{t,\ldots,T\}}{\mathbb{E}}_{\xi,U_{(<t)}}{\mathbb{% P}}\left(\left\left\langle\frac{P_{t1}^{\perp}u^{(j)}}{\left\{P_{t1}^{% \perp}u^{(j)}}\right\},\frac{P_{t1}^{\perp}x^{(t)}}{\left\{P_{t1}^{\perp}x% ^{(t)}}\right\}\right\rangle\right>\alpha\mid\xi,U_{(<t)},G_{<t}\right),  (24) 
where U_{(<t)} is shorthand for u^{(1)},\ldots,u^{(t1)} and \xi is the random variable generating x^{(1)},\ldots,x^{(T)}.
In the following lemma, we state formally that conditioned on G_{<i}, the iterate x^{(i)} depends on U only through its first (i1) columns.
Lemma 10.
For every i\leq T, there exist measurable functions \mathsf{A}^{(i)}_{+} and \mathsf{A}^{(i)}_{} such that
x^{(i)}=\mathsf{A}^{(i)}_{+}\left(\xi,U_{(<i)}\right)1_{\left({G_{<i}}\right)}% +\mathsf{A}^{(i)}_{}\left(\xi,U\right)1_{\left({G_{<i}^{c}}\right)}.  (25) 
Proof.
Since the iterates are informed by \tilde{f}_{T;U}, we may write each one as (recall definition (4))
x^{(i)}=\mathsf{A}^{(i)}\left(\xi,\nabla^{{(0,\ldots,p)}}\tilde{f}_{T;U}(x^{(1% )}),\ldots,\nabla^{{(0,\ldots,p)}}\tilde{f}_{T;U}(x^{(i1)})\right)=\mathsf{A}% ^{(i)}_{}\left(\xi,U\right), 
for measurable functions \mathsf{A}^{(i)},\mathsf{A}^{(i)}_{}, where we recall the shorthand \nabla^{{(0,\ldots,p)}}h(x) for the derivatives of h at x to order p. Crucially, by Lemma 9, G_{<i} implies \langle u^{(j)},x^{(s)}\rangle<\frac{1}{2} for every s<i and every j\geq s. As \bar{f}_{T} is a fixed robust zerochain (Definition 4), for any s<i, the derivatives of \tilde{f}_{T;U} at x^{(s)} can therefore be expressed as functions of x^{(s)} and u^{(1)},\ldots,u^{(s1)}, and—applying this argument recursively—we see that x^{(i)} is of the form (25) for every i\leq T. ∎
Consequently (as G_{<t} implies G_{<i} for every i\leq t), conditioned on \xi,U_{(<t)} and G_{<t}, the iterates x^{(1)},\ldots,x^{(t)} are deterministic, and so is P_{t1}^{\perp}x^{(t)}. If P_{t1}^{\perp}x^{(t)}=0 then G_{t} holds and {\mathbb{P}}(G_{t}^{c}\mid G_{<t})=0, so we may assume without loss of generality that P_{t1}^{\perp}x^{(t)}\neq 0. We may therefore regard {P_{t1}^{\perp}x^{(t)}}/{\left\{P_{t1}^{\perp}x^{(t)}}\right\} in (24) as a deterministic unit vector in the subspace P_{t1}^{\perp} projects to. We now characterize the conditional distribution of {P_{t1}^{\perp}u^{(j)}}/{\left\{P_{t1}^{\perp}u^{(j)}}\right\}.
Lemma 11.
Let t\leq T, and j\in\{t,\ldots,T\}. Then conditioned on \xi,U_{(<t)} and G_{<t}, the vector \frac{P_{t1}^{\perp}u^{(j)}}{\left\{P_{t1}^{\perp}u^{(j)}}\right\} is uniformly distributed on the unit sphere in the subspace to which P_{t1}^{\perp} projects.
Lemma 11 is subtle. The vectors u^{(j)}, j\geq t, conditioned on U_{(<t)}, are certainly uniformly distributed on the unit sphere in the subspace orthogonal to U_{(<t)}. However, the additional conditioning on G_{<t} requires careful handling; we prove the lemma in Section B.4.2.
Summarizing the discussion above, the conditional probability in (24) measures the inner product of two unit vectors in a subspace of \mathbb{R}^{d} of dimension d^{\prime}={\mathrm{tr}}\left(P_{t1}^{\perp}\right)\geq d2\left(t1\right), with one of the vectors deterministic and the other uniformly distributed. We may write this as
{\mathbb{P}}\left(\left\left\langle\frac{P_{t1}^{\perp}u^{(j)}}{\left\{P_{t% 1}^{\perp}u^{(j)}}\right\},\frac{P_{t1}^{\perp}x^{(t)}}{\left\{P_{t1}^{% \perp}x^{(t)}}\right\}\right\rangle\right>\alpha\mid\xi,U_{(<t)},G_{<t}% \right)={\mathbb{P}}(v_{1}>\alpha), 
where v is uniformly distributed on the unit sphere in \mathbb{R}^{d^{\prime}}. By a standard concentration of measure bound on the sphere [5, Lecture 8],
{\mathbb{P}}(v_{1}>\alpha)\leq 2e^{d^{\prime}\alpha^{2}/2}\leq 2e^{\frac{% \alpha^{2}}{2}\left(d2t\right)}. 
Substituting this bound back into the probability (24) gives
{\mathbb{P}}\left(G_{t}^{c}\mid G_{<t}\right)\leq 2\left(Tt+1\right)e^{\frac% {\alpha^{2}}{2}\left(d2t\right)}\leq 2Te^{\frac{\alpha^{2}}{2}\left(d2T% \right)}. 
Substituting this in turn into the bound (23), we have {\mathbb{P}}(G_{\leq T}^{c})\leq\sum_{t=1}^{T}{\mathbb{P}}(G_{t}^{c}\mid G_{<t% })\leq 2T^{2}e^{\frac{\alpha^{2}}{2}(d2T)}. Setting d\geq 52TR^{2}\log\frac{2T^{2}}{\delta}\geq\frac{2}{\alpha^{2}}\log\frac{2T^{2% }}{\delta}+2T establishes {\mathbb{P}}(G_{\leq T}^{c})\leq\delta, concluding Lemma 4. ∎
B.4 Proof of Lemma 4: auxiliary arguments
B.4.1 Proof of Lemma 9
See 9
Proof.
First, notice that since G_{\leq t} implies G_{\leq s} for every s\leq t, it suffices to show that G_{\leq t} implies \langle u^{(j)},x^{(t)}\rangle<1/2 for every j\in\{t,\ldots,T\}. We will in fact prove a stronger statement:
For every t, G_{<t} implies \left\{P_{t1}u^{(j)}}\right\^{2}\leq 2\alpha^{2}\left(t1\right) for every j\in\{t,\ldots,T\},  (26) 
where we recall that P_{t}\in\mathbb{R}^{d\times d} is the projection operator to the span of x^{(1)},u^{(1)},\ldots,x^{(t)},u^{(t)}, P_{t}^{\perp}=I_{d}P_{t} and \alpha=1/(5R\sqrt{T}). Before proving (26), let us show that it implies our result. Fixing j\in\{t,\ldots,T\}, we have
\left\left\langle u^{(j)},x^{(t)}\right\rangle\right\leq\left\left\langle u% ^{(j)},P_{t1}^{\perp}x^{(t)}\right\rangle\right+\left\left\langle u^{(j)},P% _{t1}x^{(t)}\right\rangle\right. 
Since G_{t} holds, its definition (22) implies \langle u^{(j)},P_{t1}^{\perp}x^{(t)}\rangle\leq\alpha\left\{P_{t1}^{% \perp}x^{(t)}}\right\\leq\alpha\left\{x^{(t)}}\right\. Moreover, by CauchySchwarz and the implication (26), we have \langle u^{(j)},P_{t1}x^{(t)}\rangle\leq\left\{P_{t1}u^{(j)}}\right\% \left\{x^{(t)}}\right\\leq\sqrt{2\alpha^{2}(t1)}\left\{x^{(t)}}\right\. Combining the two bounds, we obtain the result of the lemma,
\left\left\langle u^{(j)},x^{(t)}\right\rangle\right\leq\left\{x^{(t)}}% \right\(\alpha+\sqrt{2\alpha^{2}(t1)})<\frac{5}{2}\sqrt{t}R\alpha\leq\frac{1% }{2}, 
where we have used \left\{x^{(t)}}\right\\leq R and \alpha=1/(5R\sqrt{T}).
We prove bound (26) by induction. The basis of the induction, t=1, is trivial, as P_{0}=0. We shall assume (26) holds for s\in\{1,\ldots,t1\} and show that it consequently holds for s=t as well. We may apply the GrahamSchmidt procedure on the sequence x^{(1)},u^{(1)},\ldots,x^{(t1)},u^{(t1)} to write
\left\{P_{t1}u^{(j)}}\right\^{2}=\sum_{i=1}^{t1}\left\left\langle\frac{P_% {i1}^{\perp}x^{(i)}}{\left\{P_{i1}^{\perp}x^{(i)}}\right\},u^{(j)}\right% \rangle\right^{2}+\sum_{i=1}^{t1}\left\left\langle\frac{\hat{P}_{i1}^{% \perp}u^{(i)}}{\left\{\hat{P}_{i1}^{\perp}u^{(i)}}\right\},u^{(j)}\right% \rangle\right^{2}  (27) 
where \hat{P}_{k} is the projection to the span of \{x^{(1)},u^{(1)},\ldots,x^{(k)},u^{(k)},x^{(k+1)}\},
\hat{P}_{k}=P_{k}+\frac{1}{\left\{P_{k}^{\perp}x^{(k+1)}}\right\^{2}}\left(P% _{k}^{\perp}x^{(k+1)}\right)\left(P_{k}^{\perp}x^{(k+1)}\right)^{\top}. 
Then for every j>i we have
\left\langle\hat{P}_{i1}^{\perp}u^{(i)},u^{(j)}\right\rangle=\left\langle% \hat{P}_{i1}u^{(i)},u^{(j)}\right\rangle=\left\langle{P}_{i1}u^{(i)},u^{(j)% }\right\rangle\frac{\left\langle u^{(i)},P_{i1}^{\perp}x^{(i)}\right\rangle% \left\langle u^{(j)},P_{i1}^{\perp}x^{(i)}\right\rangle}{\left\{P_{i1}^{% \perp}x^{(i)}}\right\^{2}}, 
where the equalities hold by \left\langle u^{(i)},u^{(j)}\right\rangle=0, \hat{P}_{i1}^{\perp}=I\hat{P}_{i1}, and the definition of \hat{P}_{i1}.
The P_{i} matrices are projections, so {P}_{i1}^{2}={P}_{i1}, and CauchySwartz and the induction hypothesis imply
\left\left\langle{P}_{i1}u^{(i)},u^{(j)}\right\rangle\right=\left\left% \langle{P}_{i1}u^{(i)},{P}_{i1}u^{(j)}\right\rangle\right\leq\left\{P_{i1% }u^{(i)}}\right\\left\{P_{i1}u^{(j)}}\right\\leq 2\alpha^{2}\cdot\left(i1% \right). 
Moreover, the event G_{i} implies \left\langle u^{(i)},P_{i1}^{\perp}x^{(i)}\rangle\langle u^{(j)},P_{i1}^{% \perp}x^{(i)}\rangle\right\leq\alpha^{2}\left\{P_{i1}^{\perp}x^{(i)}}\right% \^{2}, so
\left\left\langle\hat{P}_{i1}^{\perp}u^{(i)},u^{(j)}\right\rangle\right\leq% \left\left\langle{P}_{i1}u^{(i)},u^{(j)}\right\rangle\right+\left\frac{% \left\langle u^{(i)},P_{i1}^{\perp}x^{(i)}\right\rangle\left\langle u^{(j)},P% _{i1}^{\perp}x^{(i)}\right\rangle}{\left\{P_{i1}^{\perp}x^{(i)}}\right\^{2% }}\right\leq\alpha^{2}\left(2i1\right)\leq\frac{\alpha}{2},  (28a)  
where the last transition uses \alpha=\frac{1}{5R\sqrt{T}}\leq\frac{1}{4i} because R\geq\sqrt{T}\geq i. We also have the lower bound  
\left\{\hat{P}_{i1}^{\perp}u^{(i)}}\right\^{2}=\left\left\langle\hat{P}_{i% 1}^{\perp}u^{(i)},u^{(i)}\right\rangle\right=1\left\{P_{i1}u^{(i)}}\right% \^{2}\frac{\left(\left\langle u^{(i)},P_{i1}^{\perp}x^{(i)}\right\rangle% \right)^{2}}{\left\{P_{i1}^{\perp}x^{(i)}}\right\^{2}}\geq 1\alpha^{2}% \left(2i1\right)\geq\frac{1}{2},  (28b) 
where the first equality uses (P_{i1}^{\perp})^{2}=P_{i1}^{\perp}, the second the definition of \hat{P}_{i1}, and the inequality uses \langle u^{(j)},P_{i1}^{\perp}x^{(i)}\rangle\leq\alpha\{P_{i1}^{\perp}x^{(i% )}}\ and \{P_{i1}u^{(j)}}\^{2}\leq 2\alpha^{2}\left(i1\right).
Combining the observations (28a) and (28b), we can bound each summand in the second summation in (27). Since the summands in the first summation are bounded by \alpha^{2} by definition (22) of G_{i}, we obtain
\big{\}{P_{t1}u^{(j)}}\big{\}^{2}\leq\sum_{i=1}^{t1}\alpha^{2}+\sum_{i=1}^% {t1}\frac{\left(\alpha/2\right)^{2}}{1/2}=\alpha^{2}\left(t1+\frac{t1}{2}% \right)\leq 2\alpha^{2}\left(t1\right), 
which completes the induction. ∎
B.4.2 Proof of Lemma 11
See 11
Proof.
Throughout the proof we fix t\leq T and j\in\{t,\ldots,T\}. We begin by noting that by (26), G_{<t} implies
\left\{P_{t1}^{\perp}u^{(j)}}\right\^{2}=1\left\{P_{t1}u^{(j)}}\right\^% {2}\geq 12\alpha^{2}(t1)>0. 
Therefore, when G_{<t} holds we have P_{t1}^{\perp}u^{(j)}\neq 0 so {P_{t1}^{\perp}u^{(j)}}/{\left\{P_{t1}^{\perp}u^{(j)}}\right\} is welldefined.
To establish our result, we will show that the density of U_{(\geq t)}=[u^{(t)},\ldots,u^{(T)}] conditioned on \xi,U_{(<t)},G_{<t} is invariant to rotations that preserve the span of x^{(1)},u^{(1)},\ldots,x^{(t1)},u^{(t1)}. More formally, let p_{\geq t} denote the density of U_{(\geq t)} conditional on \xi,U_{(<t)} and G_{<t}. We wish to show that
p_{\geq t}\left(U_{(\geq t)}\mid\xi,U_{(<t)},G_{<t}\right)=p_{\geq t}\left(ZU_% {(\geq t)}\mid\xi,U_{(<t)},G_{<t}\right)  (29) 
for every rotation Z\in\mathbb{R}^{d\times d}, Z^{\top}Z=I_{d}, satisfying
Zv=v=Z^{\top}v~{}~{}\mbox{for~{}all}~{}~{}v\in\left\{x^{(1)},u^{(1)},\ldots,x^% {(t1)},u^{(t1)}\right\}. 
Throughout, we let Z denote such a rotation. Letting p_{\xi,U} and p_{U} denote the densities of (\xi,U) and U, respectively, we have
p_{\geq t}\left(U_{(\geq t)}\mid\xi,U_{(<t)},G_{<t}\right)=\frac{{\mathbb{P}}% \left(G_{<t}\mid\xi,U\right)p_{\xi,U}\left(\xi,U\right)}{{\mathbb{P}}\left(G_{% <t}\mid\xi,U_{(<t)}\right)p_{\xi,U_{(<t)}}\left(\xi,U_{(<t)}\right)}=\frac{{% \mathbb{P}}\left(G_{<t}\mid\xi,U\right)p_{U}\left(U\right)}{{\mathbb{P}}\left(% G_{<t}\mid\xi,U_{(<t)}\right)p_{U_{(<t)}}\left(U_{(<t)}\right)} 
where the first equality holds by the definition of conditional probability and second by the independence of \xi and U. We have ZU_{(<t)}=U_{(<t)} and therefore, by the invariance of U to rotations, p_{U}([U_{(<t)},ZU_{(\geq t)}])=p_{U}(ZU)=p_{U}(U). Hence, replacing U with ZU in the above display yields
p_{\geq t}\left(ZU_{(\geq t)}\mid\xi,U_{(<t)},G_{<t}\right)=\frac{{\mathbb{P}}% \left(G_{<t}\mid\xi,ZU\right)p_{U}\left(U\right)}{{\mathbb{P}}\left(G_{<t}\mid% \xi,U_{(<t)}\right)p_{U_{(<t)}}\left(U_{(<t)}\right)}. 
Therefore if we prove {\mathbb{P}}(G_{<t}\mid\xi,U)={\mathbb{P}}(G_{<t}\mid\xi,ZU)—as we proceed to do—then we can conclude the equality (29) holds.
First, note that {\mathbb{P}}\left(G_{<t}\mid\xi,U\right) is supported on \{0,1\} for every \xi,U, as they completely determine x^{(1)},\ldots,x^{(T)}. It therefore suffices to show that {\mathbb{P}}(G_{<t}\mid\xi,U)=1 if and only if {\mathbb{P}}\left(G_{<t}\mid\xi,ZU\right)=1. Set U^{\prime}=ZU, observing that {u^{\prime}}^{(i)}=Zu^{(i)}=u^{(i)} for any i<t, and let {x^{\prime}}^{(1)},\ldots,{x^{\prime}}^{(T)} be the sequence generated from \xi and U^{\prime}. We will prove by induction on i that {\mathbb{P}}(G_{<t}\mid\xi,U)=1 implies {\mathbb{P}}(G_{<i}\mid\xi,U^{\prime})=1 for every i\leq t. The basis of the induction is trivial as G_{<1} always holds. Suppose now that {\mathbb{P}}(G_{<i}\mid\xi,U^{\prime})=1 for i<t, and therefore {x^{\prime}}^{(1)},\ldots,{x^{\prime}}^{(i)} can be written as functions of \xi and {u^{\prime}}^{(1)},\ldots,{u^{\prime}}^{(i1)}=u^{(1)},\ldots,u^{(i1)} by Lemma 10. Consequently, {x^{\prime}}^{(l)}=x^{(l)} for any l\leq i and also P_{i1}^{\prime\perp}{x^{\prime}}^{(i)}=P_{i1}^{\perp}x^{(i)}. Therefore, for any l\geq i,
\left\left\langle{u^{\prime}}^{(l)},\frac{P_{i1}^{\prime\perp}{x^{\prime}}^{% (i)}}{\left\{P_{i1}^{\prime\perp}{x^{\prime}}^{(i)}}\right\}\right\rangle% \right\stackrel{{\scriptstyle(i)}}{{=}}\left\left\langle u^{(l)},Z^{\top}% \frac{P_{i1}^{\perp}{x}^{(i)}}{\left\{P_{i1}^{\perp}{x}^{(i)}}\right\}% \right\rangle\right\stackrel{{\scriptstyle(ii)}}{{=}}\left\left\langle u^{(l% )},\frac{P_{i1}^{\perp}{x}^{(i)}}{\left\{P_{i1}^{\perp}{x}^{(i)}}\right\}% \right\rangle\right\stackrel{{\scriptstyle(iii)}}{{\leq}}\alpha, 
where in (i) we substituted {u^{\prime}}^{(l)}=Zu^{(l)} and P_{i1}^{\prime\perp}{x^{\prime}}^{(i)}=P_{i1}^{\perp}x^{(i)}, (ii) is because P_{i1}^{\perp}x^{(i)}=x^{(i)}P_{i1}x^{(i)} is in the span of \left\{x^{(1)},u^{(1)},\ldots,x^{(i1)},u^{(i1)},x^{(i)}\right\} and therefore not modified by Z^{\top}, and (iii) is by our assumption that G_{<t} holds, and so G_{i} holds. Therefore {\mathbb{P}}\left(G_{i}\mid\xi,U^{\prime}\right)=1 and {\mathbb{P}}\left(G_{<i+1}\mid\xi,U^{\prime}\right)=1, concluding the induction. An analogous argument shows that {\mathbb{P}}\left(G_{<t}\mid\xi,U^{\prime}\right)=1 implies {\mathbb{P}}\left(G_{<t}\mid\xi,U\right)={\mathbb{P}}\left(G_{<t}\mid\xi,Z^{% \top}U^{\prime}\right)=1 and thus {\mathbb{P}}\left(G_{<t}\mid\xi,U\right)={\mathbb{P}}\left(G_{<t}\mid\xi,ZU\right) as required.
Marginalizing the density (29) to obtain a density for u^{(j)} and recalling that P_{t1}^{\perp} is measurable \xi,U_{(<t)},G_{<t}, we conclude that, conditioned on \xi,U_{(<t)},G_{<t} the random variable \frac{P_{t1}^{\perp}u^{(j)}}{\left\{P_{t1}^{\perp}u^{(j)}}\right\} has the same density as \frac{P_{t1}^{\perp}Zu^{(j)}}{\left\{P_{t1}^{\perp}Zu^{(j)}}\right\}. However, P_{t1}^{\perp}Z=ZP_{t1}^{\perp} by assumption on Z, and therefore
\frac{P_{t1}^{\perp}Zu^{(j)}}{\left\{P_{t1}^{\perp}Zu^{(j)}}\right\}=Z% \frac{P_{t1}^{\perp}u^{(j)}}{\left\{P_{t1}^{\perp}u^{(j)}}\right\}. 
We conclude that the conditional distribution of the unit vector \frac{P_{t1}^{\perp}u^{(j)}}{\left\{P_{t1}^{\perp}u^{(j)}}\right\} is invariant to rotations in the subspace to which P_{t1}^{\perp} projects. ∎
B.5 Proof of Lemma 6
See 6
Proof.
Part i holds because \hat{f}_{T;U}\left(0\right)=\bar{f}_{T}\left(0\right) and \hat{f}_{T;U}\left(x\right)\geq\tilde{f}_{T;U}\left(\rho(x)\right) for every x, so
\inf_{x\in\mathbb{R}^{d}}\hat{f}_{T;U}\left(x\right)\geq\inf_{x\in\mathbb{R}^{% d}}\tilde{f}_{T;U}\left(\rho(x)\right)=\inf_{\left\{x}\right\\leq R}\bar{f}_% {T}\left(x\right)\geq\inf_{x\in\mathbb{R}^{d}}\bar{f}_{T}\left(x\right), 
and therefore by Lemma 3.i, we have \hat{f}_{T;U}(0)\inf_{x}\hat{f}_{T;U}(x)\leq\bar{f}_{T}(0)\inf_{x}\bar{f}_{T% }(x)\leq 12T.
Establishing part ii requires substantially more work. Since smoothness with respect to Euclidean distances is invariant under orthogonal transformations, we take U to be the first T columns of the ddimensional identity matrix, denoted U=I_{d,T}. Recall the scaling \rho(x)=Rx/\sqrt{R^{2}+\left\{x}\right\^{2}} with “radius” R=230\sqrt{T} and the definition \hat{f}_{T;U}(x)=\bar{f}_{T}(U^{\top}\rho(x))+\frac{1}{10}\left\{x}\right\^{2}. The quadratic \frac{1}{10}\left\{x}\right\^{2} term in \hat{f}_{T;U} has \frac{1}{5}Lipschitz first derivative and 0Lipschitz higher order derivatives (as they are all constant or zero), and we take U=I_{d,T} without loss of generality, so we consider the function
\hat{f}_{T;I}(x):=\bar{f}_{T}(\rho(x))=\bar{f}_{T}\left(\left[\rho\left(x% \right)\right]_{1},\ldots,\left[\rho\left(x\right)\right]_{T}\right). 
We now compute the partial derivatives of \hat{f}_{T;I}. Defining y=\rho(x), let \widetilde{\nabla}^{{k}}_{j_{1},...,j_{k}}:=\frac{{\partial}^{k}}{{\partial}y_% {j_{1}}\cdots{\partial}y_{j_{k}}} denote derivatives with respect to y. In addition, define \mathcal{P}_{k} to be the set of all partitions of [k]=\{1,\ldots,k\}, i.e. (S_{1},\ldots,S_{L})\in\mathcal{P}_{k} if and only if the S_{i} are disjoint and \cup_{l}S_{l}=[k]. Using the chain rule, we have for any k and set of indices i_{1},\ldots,i_{k}\leq T that
\nabla^{{k}}_{i_{1},...,i_{k}}\hat{f}_{T;I}(x)=\sum_{(S_{1},\ldots,S_{L})\in% \mathcal{P}_{k}}\sum_{j_{1},...,j_{L}=1}^{T}\bigg{(}\prod_{l=1}^{L}\nabla^{{S% _{l}}}_{i_{{S_{l}}}}\rho_{j_{l}}(x)\bigg{)}\widetilde{\nabla}^{{L}}_{j_{1},..% .,j_{L}}\bar{f}_{T}(y),~{}~{}y=\rho(x),  (30) 
where we have used the shorthand \nabla^{{S}}_{i_{S}} to denote the partial derivatives with respect to each of x_{i_{j}} for j\in S. We use the equality (30) to argue that (recall the identity (2))
\left\{\nabla^{{p+1}}\hat{f}_{T;I}(x)}\right\_{\rm op}=\sup_{\left\{v}% \right\=1}\langle\nabla^{{p+1}}\hat{f}_{T;I}(x),v^{\otimes(p+1)}\rangle:=\hat% {\ell}_{p}\frac{1}{5}{\mathbb{I}}_{p=1}\leq e^{cp\log p+c}, 
for some numerical constant^{5}^{5}5 To simplify notation we allow c to change from equation to equation throughout the proof, always representing a finite numerical constant independent of d, T, k or p., 0<c<\infty and every p\geq 1. As explained in Section 2.1, this implies \hat{f}_{T;U} has e^{cp\log p+c}Lipschitz pth order derivative, giving part ii of the lemma.
To do this, we begin by considering the partitioned sum (30). Let v\in\mathbb{R}^{d} be an arbitrary direction with \left\{v}\right\=1. Then for j\in[d] and k\in\mathbb{N} we define the quantity
\widetilde{v}_{j}^{k}=\widetilde{v}_{j}^{k}(x):=\langle\nabla^{{k}}\rho_{j}(x)% ,v^{\otimes k}\rangle, 
algebraic manipulations and rearrangement of the sum (30) yield
\displaystyle\langle\nabla^{{k}}\hat{f}_{T;I}(x),v^{\otimes k}\rangle  \displaystyle=\sum_{(S_{1},\ldots,S_{L})\in\mathcal{P}_{k}}\sum_{i_{1},\ldots,% i_{k}=1}^{d}v_{i_{1}}v_{i_{2}}\cdots v_{i_{k}}\sum_{j_{1},...,j_{L}=1}^{T}% \bigg{(}\prod_{l=1}^{L}\nabla^{{S_{l}}}_{i_{{S_{l}}}}\rho_{j_{l}}(x)\bigg{)}% \widetilde{\nabla}^{{L}}_{j_{1},...,j_{L}}\bar{f}_{T}(y)  
\displaystyle=\sum_{(S_{1},\ldots,S_{L})\in\mathcal{P}_{k}}\sum_{j_{1},...,j_{% L}=1}^{T}\widetilde{v}_{j_{1}}^{S_{1}}\cdots\widetilde{v}_{j_{L}}^{S_{L}}% \widetilde{\nabla}^{{L}}_{j_{1},...,j_{L}}\bar{f}_{T}(y)  
\displaystyle=\sum_{(S_{1},\ldots,S_{L})\in\mathcal{P}_{k}}\left\langle% \widetilde{\nabla}^{{L}}\bar{f}_{T}(y),\widetilde{v}^{S_{1}}\otimes\cdots% \otimes\widetilde{v}^{S_{L}}\right\rangle. 
We claim that there exists a numerical constant c<\infty such that for all k\in\mathbb{N},
\sup_{x}\{\widetilde{v}^{k}(x)}\\leq\exp(ck\log k+c)R^{1k}.  (31) 
Before proving inequality (31), we show how it implies the desired lemma. By the preceding display, we have
\langle\nabla^{{p+1}}\hat{f}_{T;I}(x),v^{\otimes(p+1)}\rangle\leq\sum_{(S_{1% },\ldots,S_{L})\in\mathcal{P}_{p+1}}\left\{\widetilde{\nabla}^{{L}}\bar{f}_{T% }(y)}\right\_{\rm op}\prod_{l=1}^{L}\{\widetilde{v}^{S_{l}}}\. 
Lemma 3 shows that there exists a numerical constant c<\infty such that
\left\{\nabla^{(L)}\bar{f}_{T}(y)}\right\_{\rm op}\leq\ell_{L1}\leq\exp(cL% \log L+c)~{}\mbox{for all }L\geq 2. 
When the number of partitions L=1, we have S_{1}=p+1\geq 2, and so Lemma 3.ii yields
\left\{{\nabla}\bar{f}_{T}(y)}\right\_{\rm op}\{\widetilde{v}^{S_{1}}}\=% \left\{{\nabla}\bar{f}_{T}(y)}\right\\{\widetilde{v}^{S_{1}}}\\leq 23% \sqrt{T}\cdot R^{p}\exp(cp\log p+c)\leq\exp(cp\log p+c), 
where we have used R=230\sqrt{T}. Using S_{1}+\cdots+S_{L}=p+1 and the fact that q(x)=(x+1)\log(x+1) satisfies q(x)+q(y)\leq q(x+y) for every x,y>0, we have
\left\{\widetilde{\nabla}^{{L}}\bar{f}_{T}(y)}\right\_{\rm op}\prod_{l=1}^{L% }\{\widetilde{v}^{S_{l}}}\\leq\exp(cp\log p+c) 
for some c<\infty and every (S_{1},\ldots,S_{L})\in\mathcal{P}_{p+1}. Bounds on Bell numbers [6, Thm. 2.1] give that there are at most \exp(k\log k) partitions in \mathcal{P}_{k}, which combined with the bound above gives desired result.
Let us return to the derivation of inequality (31). We begin by recalling Faà di Bruno’s formula for the chain rule. Let f,g:\mathbb{R}\to\mathbb{R} be appropriately smooth functions. Then
\frac{d^{k}}{dt^{k}}f(g(t))=\sum_{P\in\mathcal{P}_{k}}f^{(P)}(g(t))\cdot% \prod_{S\in P}g^{(S)}(t),  (32) 
where P denotes the number of disjoint elements of partition P\in\mathcal{P}_{k}. Define the function \overline{\rho}(\xi)=\xi/\sqrt{1+\{\xi}\^{2}}, and let \lambda(\xi)=\sqrt{1+\{\xi}\^{2}} so that \overline{\rho}(\xi)=\nabla\lambda(\xi) and \rho(\xi)=R\overline{\rho}(\xi/R). Let \overline{v}_{j}^{k}(\xi)=\langle\nabla^{{k}}\overline{\rho}_{j}(\xi),v^{% \otimes k}\rangle, so that
\overline{v}^{k}(\xi)=\nabla\langle\nabla^{{k}}\lambda(\xi),v^{\otimes k}% \rangle~{}~{}\mbox{and}~{}~{}\widetilde{v}^{k}=R^{1k}\overline{v}^{k}(x/R).  (33) 
With this in mind, we consider the quantity \langle\nabla^{{k}}\lambda(\xi),v^{\otimes k}\rangle. Defining temporarily the functions \alpha(r)=\sqrt{1+2r} and \beta(t)=\frac{1}{2}\{\xi+tv}\^{2}, and their composition h(t)=\alpha(\beta(t)), we evidently have
h^{(k)}(0)=\langle\nabla^{{k}}\lambda(\xi),v^{\otimes k}\rangle=\sum_{P\in% \mathcal{P}_{k}}\alpha^{(P)}(\beta(0))\cdot\prod_{S\in P}\beta^{(S)}(0), 
where the second equality used Faá di Bruno’s formula (32). Now, we note the following immediate facts:
\alpha^{(l)}(r)=(1)^{l}\frac{(2l1)!!}{(1+2r)^{l1/2}}~{}~{}\mbox{and}~{}~{}% \beta^{(l)}(t)=\begin{cases}\langle v,\xi\rangle+t\left\{v}\right\^{2}&l=1\\ \left\{v}\right\^{2}&l=2\\ 0&l>2.\end{cases} 
Thus, if we let \mathcal{P}_{k,2} denote the partitions of [k] consisting only of subsets with one or two elements, we have
h^{(k)}(0)=\sum_{P\in\mathcal{P}_{k,2}}(1)^{P}\frac{(2P1)!!}{(1+\left\{% \xi}\right\^{2})^{P1/2}}\langle\xi,v\rangle^{\mathsf{C}_{1}(P)}\left\{v}% \right\^{2\mathsf{C}_{2}(P)} 
where \mathsf{C}_{i}(P) denotes the number of sets in P with precisely i elements. Noting that \left\{v}\right\=1, we may rewrite this as
\langle\nabla^{{k}}\lambda(\xi),v^{\otimes k}\rangle=\sum_{l=1}^{k}\sum_{P\in% \mathcal{P}_{k,2},\mathsf{C}_{1}(P)=l}(1)^{P}\frac{(2P1)!!}{(1+\left\{% \xi}\right\^{2})^{P1/2}}\langle\xi,v\rangle^{l}. 
Taking derivatives we obtain
\widetilde{v}^{k}=\nabla\langle\nabla^{{k}}\lambda(\xi),v^{\otimes k}\rangle=% \bigg{(}\sum_{l=1}^{k}a_{l}(\xi)\langle\xi,v\rangle^{l1}\bigg{)}v+\bigg{(}% \sum_{l=1}^{k}b_{l}(\xi)\langle\xi,v\rangle^{l}\bigg{)}\xi 
where
a_{l}(\xi)=l\cdot\sum_{P\in\mathcal{P}_{k,2},\mathsf{C}_{1}(P)=l}\frac{(1)^{% P}(2P1)!!}{(1+\left\{\xi}\right\^{2})^{P1/2}}~{}~{}\mbox{and}~{}~{}b_% {l}(\xi)=\sum_{P\in\mathcal{P}_{k,2},\mathsf{C}_{1}(P)=l}\frac{(1)^{P+1}(2% P+1)!!}{(1+\left\{\xi}\right\^{2})^{P+1/2}}. 
We would like to bound a_{l}(\xi)\langle\xi,v\rangle^{l1} and b_{l}(\xi)\langle\xi,v\rangle^{l}\xi. Note that P\geq\mathsf{C}_{1}(P) for every P\in\mathcal{P}_{k}, so P\geq l in the sums above. Moreover, bounds for Bell numbers [6, Thm. 2.1] show that there are at most \exp(k\log k) partitions of [k], and (2k1)!!\leq\exp(k\log k) as well. As a consequence, we obtain
\sup_{\xi}a_{l}(\xi)\langle\xi,v\rangle^{l1}\leq\exp(cl\log l)\sup_{\xi}% \frac{\langle\xi,v\rangle^{l1}}{(1+\left\{\xi}\right\^{2})^{(l1)/2}}<% \exp(cl\log l), 
where we have used \langle\xi,v\rangle\leq\left\{\xi}\right\ due to \left\{v}\right\=1. We similarly bound \sup_{\xi}b_{l}(\xi)\langle\xi,v\rangle^{l}\left\{\xi}\right\. Returning to expression (33), we have
\sup_{x}\{\widetilde{v}^{k}(x)}\\leq\exp\left(ck\log k+c\right)R^{1k}, 
for a numerical constant c<\infty. This is the desired bound (31), completing the proof. ∎
Appendix C Proof of Theorem 3
See 3
We divide the proof of the theorem into two parts, as in our previous results, first providing a few building blocks, then giving the theorem. The basic idea is to introduce a negative “bump” that is challenging to find, but which is close to the origin.
To make this precise, let e^{(j)} denote the jth standard basis vector. Then we define the bump function \bar{h}_{T}:\mathbb{R}^{T}\to\mathbb{R} by
\displaystyle\bar{h}_{T}(x)  \displaystyle=\Psi\left(1\frac{25}{2}\left\{x\frac{4}{5}e^{(T)}}\right\^{2% }\right)=\begin{cases}0&\left\{x\frac{4}{5}e^{(T)}}\right\\geq\frac{1}{5}\\ \exp\left(1\frac{1}{\left(125\left\{x\frac{4}{5}e^{(T)}}\right\^{2}\right% )^{2}}\right)&\mbox{otherwise.}\end{cases}  (34) 
As Figure 2 shows, \bar{h}_{T} features a unitheight peak centered at \frac{4}{5}e^{(T)}, and it is identically zero when the distance from that peak exceeds \frac{1}{5}. The volume of the peak vanishes exponentially with T, making it hard to find by querying \bar{h}_{T} locally. We list the properties of \bar{h}_{T} necessary for our analysis.
Lemma 12.
The function \bar{h}_{T} satisfies the following.

\bar{h}_{T}\left(0.8e^{(T)}\right)=1 and \bar{h}_{T}(x)\in[0,1] for all x\in\mathbb{R}^{T}.

\bar{h}_{T}(x)=0 on the set \{x\in\mathbb{R}^{d}\,\,x_{T}\leq\frac{3}{5}\textnormal{ or }\left\{x}\right% \\geq 1\}.

For p\geq 1, the pth order derivatives of \bar{h}_{T} are \tilde{\ell}_{p}Lipschitz continuous, where \tilde{\ell}_{p}<e^{3p\log p+cp} for some numerical constant c<\infty.
We prove the lemma in Section C.2; the proof is similar to that of Lemma 6. With these properties in hand, we can prove Theorem 3.
C.1 Proof of Theorem 3
For some T\in\mathbb{N} and \sigma>0 to be specified, and d=\left\lceil 52\cdot 230^{2}\cdot T^{2}\log(4T^{2})\right\rceil, consider the function f_{U}:\mathbb{R}^{d}\to\mathbb{R} indexed by orthogonal matrix U\in\mathbb{R}^{d\times T} and defined as
f_{U}(x)=\frac{L_{p}\sigma^{p+1}}{\ell_{p}^{\prime}}\hat{f}_{T;U}(x/\sigma)% \frac{L_{p}D^{p+1}}{\ell_{p}^{\prime}}\bar{h}_{T}(U^{\top}x/D), 
where \hat{f}_{T;U}(x)=\tilde{f}_{T;U}(\rho(x))+\frac{1}{10}\left\{x}\right\^{2} is the randomized hard instance construction (14) with \rho(x)=x/\sqrt{1+\left\{x/R}\right\^{2}}, \bar{h}_{T} is the bump function (34) and \ell_{p}^{\prime}=\hat{\ell}_{p}+\tilde{\ell}_{p}, for \hat{\ell}_{p} and \tilde{\ell}_{p} as in Lemmas 6.ii and 12.iii, respectively. By the lemmas, f_{U} has L_{p}Lipschitz pth order derivatives and \ell_{p}^{\prime}\leq e^{c_{1}p\log p+c_{1}} for some c_{1}<\infty. We assume that \sigma\leq D; our subsequent choice of \sigma will obey this constraint.
Following our general proof strategy, we first demonstrate that f_{U}\in\mathcal{F}^{\rm dist}_{p}(D,L_{p}), for which all we need do is guarantee that the global minimizers of f_{U} have norm at most D. By the constructions (14) and (10) of \hat{f}_{T;U} and \tilde{f}_{T;U}, Lemma 12.i implies
\displaystyle f_{U}\left((0.8D)u^{(T)}\right)  \displaystyle=\frac{L_{p}\sigma^{p+1}}{\ell_{p}^{\prime}}\bar{f}_{T}(\rho(e^{(% T)}))+\frac{L_{p}\sigma^{p+1}}{10\ell_{p}^{\prime}}\left\{\frac{4Du^{(T)}}{5% \sigma}}\right\^{2}\frac{L_{p}D^{p+1}}{\ell_{p}^{\prime}}\bar{h}_{T}(0.8e^{(% T)})  
\displaystyle=\frac{L_{p}\sigma^{p+1}}{\ell_{p}^{\prime}}\bar{f}_{T}(0)+\frac{% 8L_{p}\sigma^{p1}D^{2}}{125\ell_{p}^{\prime}}+\frac{L_{p}D^{p+1}}{\ell_{p}^{% \prime}}<\frac{117}{125}\frac{L_{p}D^{p+1}}{\ell_{p}^{\prime}}+\frac{L_{p}% \sigma^{p+1}}{\ell_{p}^{\prime}}\bar{f}_{T}(0) 
with the final inequality using our assumption \sigma\leq D. On the other hand, for any x such that \bar{h}_{T}(U^{\top}x/D)=0, we have by Lemma 6.i (along with \hat{f}_{T;U}(0)=0) that
f_{U}(x)\geq\frac{L_{p}\sigma^{p+1}}{\ell_{p}^{\prime}}\inf_{x}\hat{f}_{T;U}(x% )\geq12\frac{L_{p}\sigma^{p+1}}{\ell_{p}^{\prime}}T+\frac{L_{p}\sigma^{p+1}}{% \ell_{p}^{\prime}}\bar{f}_{T}(0). 
Combining the two displays above, we conclude that if
12\frac{L_{p}\sigma^{p+1}}{\ell_{p}^{\prime}}T\leq\frac{117}{125}\frac{L_{p}D^% {p+1}}{\ell_{p}^{\prime}}, 
then all global minima x^{\star} of f_{U} must satisfy \bar{h}_{T}(U^{\top}x^{\star}/D)>0. Inspecting the definition (19) of \bar{h}_{T}, this implies \left\{x^{\star}/D0.8u^{(T)}}\right\<\frac{1}{5}, and therefore \left\{x^{\star}}\right\\leq D. Thus, by setting
T=\left\lfloor\frac{D^{p+1}}{13\sigma^{p+1}}\right\rfloor,  (35) 
we guarantee that f_{U}\in\mathcal{F}^{\rm dist}_{p}(D,L_{p}) as long as \sigma\leq D.
It remains to show that, for an appropriately chosen \sigma, any randomized algorithm requires (with high probability) more than T iterations to find x such that \{{\nabla}f_{U}(x)}\<\epsilon. We claim that when \sigma\leq D, for any x\in\mathbb{R}^{d},
\langle u^{(T)},\rho(x/\sigma)\rangle<\frac{1}{2}~{}~{}\mbox{implies}~{}~{}% \bar{h}_{T}(U^{\top}y/D)=0~{}~{}\mbox{for}~{}y~{}\mbox{in a neighborhood of }x.  (36) 
We defer the proof of claim (36) to the end of this section.
Now, let U\in\mathbb{R}^{d\times T} be an orthogonal matrix chosen uniformly at random from \mathsf{O}(d,T). Let x^{(1)},\ldots,x^{(t)} be a sequence of iterates generated by algorithm \mathsf{A}\in\mathcal{A}_{\textnormal{{rand}}} applied on f_{U}. We argue that \langle u^{(T)},\rho(x^{(t)}/\sigma)\rangle<1/2 for all t\leq T, with high probability. To do so, we briefly revisit the proof of Lemma 4 (Sec. B.3) where we replace \tilde{f}_{T;U} with f_{U} and x^{(t)} with \rho(x^{(t)}/\sigma). By Lemma 9 we have that for every t\leq T the event G_{\leq t} implies \langle u^{(T)},\rho(x^{(s)}/\sigma)\rangle<1/2 for all s\leq t, and therefore by the claim (36) we have that Lemma 10 holds (as we may replace the terms \bar{h}_{T}(U^{\top}x^{(s)}/D), s<t, with 0 whenever G_{<t} holds). The rest of the proof of Lemma 9 proceeds unchanged and gives us that with probability greater than 1/2 (over any randomness in \mathsf{A} and the uniform choice of U),
\langle u^{(T)},\rho(x^{(t)}/\sigma)\rangle<\frac{1}{2}~{}~{}\mbox{for all}~% {}t\leq T. 
By claim (36), this implies {\nabla}\bar{h}_{T}(U^{\top}x^{(t)}/D)=0, and by Lemma 5, \{{\nabla}\hat{f}_{T;U}(x^{(t)}/\sigma)}\>1/2. Thus, after scaling,
\left\{{\nabla}f_{U}(x^{(t)})}\right\>\frac{L_{p}\sigma^{p}}{2\ell_{p}^{% \prime}} 
for all t\leq T, with probability greater that 1/2. As in the proof of Theorem 2, By taking \sigma=(2\ell_{p}^{\prime}\epsilon/L_{p})^{1/p} we guarantee
\inf_{\mathsf{A}\in\mathcal{A}_{\textnormal{{det}}}}\sup_{U\in\mathsf{O}(d,T)}% \mathsf{T}_{\epsilon}\big{(}\mathsf{A},f_{U}\big{)}\geq 1+T. 
where T=\left\lfloor D^{p+1}/13\sigma^{p+1}\right\rfloor is defined in Eq. (35). Thus, as f_{U}\in\mathcal{F}^{\rm dist}_{p}(D,L_{p}) for our choice of T, we immediately obtain
\mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{rand}}},\mathcal{F}^{% \rm dist}_{p}(D,L_{p})\big{)}\geq T+1\geq\frac{D^{1+p}}{52}\left(\frac{L_{p}}{% \ell_{p}^{\prime}}\right)^{\frac{1+p}{p}}\epsilon^{\frac{1+p}{p}}, 
as long as our initial assumption \sigma\leq D holds. When \sigma>D, we have that \frac{2\ell_{p}^{\prime}}{L_{p}}\epsilon>D^{p}, or 1>D^{p+1}(\frac{L_{p}}{2\ell_{p}^{\prime}})^{\frac{1+p}{p}}\epsilon^{\frac{1+% p}{p}}, so that the bound is vacuous in this case regardless: every method must take at least 1 step.
Finally, we return to demonstrate claim (36). Note that \langle u^{(T)},\rho(x/\sigma)\rangle<1/2 is equivalent to \langle u^{(T)},x\rangle<\frac{\sigma}{2}\sqrt{1+\{\frac{x}{\sigma R}}\^{2}}, and consider separately the cases \left\{x/\sigma}\right\\leq R/2 and \left\{x/\sigma}\right\>R/2=115\sqrt{T}. In the first case, we have \langle u^{(T)},x\rangle<(\sqrt{5}/4)\sigma<(3/5)D, by our assumption \sigma\leq D. Therefore, by Lemma 12.ii we have that \bar{h}_{T}(U^{\top}y/D)=0 for y near x. In the second case, we have \left\{x}\right\>(4R/\sqrt{5})\langle u^{(T)},x\rangle>230\langle u^{(T)}% ,x\rangle. If in addition \langle u^{(T)},x\rangle<(3/5)D then our conclusion follows as before. Otherwise, \left\{x}\right\/D>230\cdot(3/5)>1, so again the conclusion follows by Lemma 12.ii.
C.2 Proof of Lemma 12
Properties i and ii are evident from the definition (34) of \bar{h}_{T}. To show property iii, consider {h}(x)=\bar{h}_{T}(\frac{x+0.8e^{(T)}}{5})=\Psi(1\frac{1}{2}\x\^{2}), which is a translation and scaling of \bar{h}_{T}, so if we show {h} has (\tilde{\ell}_{p}/5^{p+1})Lipschitz pth order derivatives, for every p\geq 1, we obtain the required results. For any x,v\in\mathbb{R}^{T} with \left\{v}\right\\leq 1 we define the directional projection {h}_{x,v}(t)={h}(x+t\cdot v). The required smoothness bound is equivalent to
\left{h}_{x,v}^{(p+1)}(0)\right\leq\tilde{\ell}_{p}/5^{p+1}\leq e^{cp\log p+c} 
for every x,v\in\mathbb{R}^{d} with \left\{v}\right\\leq 1, every p\geq 1 and some numerical constant c<\infty (which we allow to change from equation to equation, caring only that it is finite and independent of T and p).
As in the proof of Lemma 6, we write {h}_{x,v}(t)=\Psi(\beta(t)) where \beta(t)=1\frac{1}{2}\left\{x+tv}\right\^{2}, and use Faá di Bruno’s formula (32) to write, for any k\geq 1,
{h}_{x,v}^{(k)}(0)=\sum_{P\in\mathcal{P}_{k}}\Psi^{(P)}(\beta(0))\cdot\prod_% {S\in P}\beta^{(S)}(0), 
where \mathcal{P}_{k} is the set of partitions of [k] and P denotes the number of set in partition P. Noting that \beta^{\prime}(0)=\langle x,v\rangle, \beta^{\prime\prime}(0)=\left\{v}\right\^{2} and \beta^{(n)}(0)=0 for any n>2, we have
{h}_{x,v}^{(k)}(0)=\sum_{P\in\mathcal{P}_{k,2}}(1)^{P}\Psi^{(P)}\left(1% \frac{1}{2}\left\{x}\right\^{2}\right)\langle x,v\rangle^{\mathsf{C}_{1}(P)}% \left\{v}\right\^{2\mathsf{C}_{2}(P)} 
where \mathcal{P}_{k,2} denote the partitions of [k] consisting only of subsets with one or two elements and \mathsf{C}_{i}(P) denotes the number of sets in P with precisely i elements.
Noting that \Psi^{(k)}(1\frac{1}{2}\left\{x}\right\^{2})=0 for any k\geq 0 and \left\{x}\right\>1, we may assume \left\{x}\right\\leq 1. Since \left\{v}\right\\leq 1, we may bound {h}_{x,v}^{(p+1)}(0) by
\left{h}_{x,v}^{(p+1)}(0)\right\leq\left\mathcal{P}_{p+1,2}\right\cdot\max% _{k\in[p+1]}\sup_{x\in\mathbb{R}}\Psi^{(k)}(x)\stackrel{{\scriptstyle(i)}}{{% \leq}}e^{\frac{p+1}{2}\log(p+1)}\cdot e^{\frac{5(p+1)}{2}\log(\frac{5}{2}(p+1)% )}\leq e^{3p\log p+cp} 
for some absolute constant c<\infty, where inequality (i) follows from Lemma 1.iv and that the number of matchings in the complete graph (or the kth telephone number [20, Lem. 2]) has bound \mathcal{P}_{k,2}\leq e^{\frac{k}{2}\log k}. This gives the result.