Provably Correct Automatic Subdifferentiationfor Qualified Programs

# Provably Correct Automatic Subdifferentiation for Qualified Programs

Sham Kakade 111University of Washington, Seattle, e-mail:  sham@cs.washington.edu    Jason D. Lee 222University of Southern California, e-mail:  jasonlee@marshall.usc.edu
###### Abstract

The Cheap Gradient Principle (Griewank and Walther, 2008) — the computational cost of computing the gradient of a scalar-valued function is nearly the same (often within a factor of ) as that of simply computing the function itself — is of central importance in optimization; it allows us to quickly obtain (high dimensional) gradients of scalar loss functions which are subsequently used in black box gradient-based optimization procedures. The current state of affairs is markedly different with regards to computing subderivatives: widely used ML libraries, including TensorFlow and PyTorch, do not correctly compute (generalized) subderivatives even on simple examples. This work considers the question: is there a Cheap Subgradient Principle? Our main result shows that, under certain restrictions on our library of nonsmooth functions (standard in nonlinear programming), provably correct generalized subderivatives can be computed at a computational cost that is within a (dimension-free) factor of of the cost of computing the scalar function itself.

## 1 Introduction

The widespread implementation of Automatic Differentiation (AD) methods (Baydin et al., 2015) has had a transformative effect on applied machine learning; these methods have eased the difficulty for practitioners, across a range of disciplines, to learn sophisticated machine learning models (including deep neural architectures and richer inferential models). The paradigm is: one simply writes a program to compute the function of interest, say a scalar (loss) function , and then a correctly implemented AD method will return both and all of its partial derivatives when provided with as an input. These partial derivatives are often used in conjunction with some (stochastic) gradient-based optimization approach.

Underlying the effectiveness of this general black-box approach is the Cheap Gradient Principle (Griewank and Walther, 2008): the computational cost of computing the vector of partial derivatives is often nearly the same as that of simply computing the scalar function itself. In fact, for all rational functions, the striking Baur-Strassen theorem (Baur and Strassen, 1983; Griewank, 1989) shows that this increase in computational complexity is a (dimension free) factor of .

In many settings, our underlying function is a nonsmooth function, and we resort to subgradient methods. This work considers the question: is there a Cheap Subgradient Principle? Specifically, given a program that computes a (locally Lipschitz) function and given a point , can we automatically compute an element of the (Clarke) subdifferential (Clarke, 1975), and can we do this at a cost which is comparable to computing the function itself? Informally, the set is the convex hull of limits of gradients at nearby differentiable points. It can be thought of as generalizing the gradient (for smooth functions) and the subgradient (for convex functions).

Let us briefly consider how current approaches handle nonsmooth functions, which are available to the user as functions in some library. Consider the following three equivalent ways to write the identity function, where ,

 f1(x)=x,f2(x)=ReLU(x)−ReLU(−x),f3(x)=10f1(x)−9f2(x),

where , and so . As these functions are differentiable at , the unique derivative is . However, both TensorFlow (Abadi et al., 2015) and PyTorch (Paszke et al., 2017), claim that , , . This particular answer is due to using a subgradient of at . One may ask if a more judicious choice fixes such issues; unfortunately, it is not difficult to see that no such universal choice exists333By defining , the reader may note we obtain the correct derivative on ; however, consider , which also equals . Here, we would need to obtain the correct answer..

This example should be concerning for a number of reasons. The use of nonsmooth functions in AD go well beyond simple one dimensional nonsmooth functions (such as or the ); current methods permit utilizing eigenvalues, SVDs, QR decompositions (there are AD procedures on these nonsmooth linear algebra functions (Maclaurin et al., 2015; Seeger et al., 2017)).

##### Is correctness important?

One option is to disregard these issues — which is the current state of affairs — based on the observation that in most cases these issues are unlikely to harm our optimization method. In numerical linear algebra, one could make the same argument: we never truly encounter degenerate linear systems (or degenerate eigenspaces); nonetheless, in retrospect, numerical issues have made evident the importance of carefully addressing these “corner cases”. The situation may be analogous here: numerical issues in these approaches can easily lead to unstable outputs. Note that some numerical instability is certainly to be expected due to nonsmoothness (a point we return to in the Discussion under the notion of mixed stability); yet we would still hope to have nontrivial stability guarantees in our widely used AD libraries, much in the manner we have for our established numerical linear algebra libraries (Trefethen and Bau III, 1997; Demmel, 1997).

Ultimately, the importance of correctness in these methods is a decision that must be made by the broader ML community. Here, it is worthwhile to consider that AD software has a range of applications: from physical simulators to health care/social science applications to deployed online learning systems to differentiable programming. For example, when using physical simulators (say in robotics or in the sciences), a strong notion of stability may be critical when doing AD through nonsmooth system dynamics. In safety-critical settings, we may seek to have deployed online learning methods which are not susceptible to errors due to misspecified input-output behavior in our programs. Perhaps the most compelling reason for provably correct software implementations is to avoid costly failure modes due to the utilization of the methods in novel and unforeseen manners.

##### Related Work:

These issues are in fact known in the mathematical AD literature (see Griewank and Walther (2008, Chapter 14)). Once we include either nonsmooth primitive functions or permit branching in a program, the usual chain rule fails to hold and incorrect input-out behavior is easy to observe. Due to that established calculus properties of nonsmooth functions (Klatte and Kummer, 2002; Mordukhovich, 2006) do not seem amenable to AD approaches, the current provable methods do not have general purpose, computationally efficient AD methods for subdifferentials.

One influential and powerful idea is that of lexicographic differentiation (Nesterov, 2005); it is a property of a subclass of nonsmooth functions which allow these function to inherit a generalized notion of a chain rule. This idea has been utilized for obtaining correct generalized derivatives in  Khan and Barton (2013); Griewank (2013). The difficulty is that lexicographic approach often is expensive in that it involves a dimensional factor in the computational cost increase.

The other relatively few works that do focus on automatic generalized differentiation go through some notion of algorithmic linearization (A.Griewank, 1995; Nesterov, 2005; Khan and Barton, 2013, 2015; Fiege et al., 2017), where often piecewise smooth functions are considered, and the approach attempts at correct AD through probing the pieces through some linearization (see Griewank (2014) for review). The difficulties are due to understanding what information we can extract through linear “probes” into the function.

One of the first ideas along this line of thought is due to (A.Griewank, 1995), which shows how to compute directional derivatives of nonsmooth functions through following a “branch” the program would take on an input (where the branch corresponds to the approach direction in the directional derivative). In fact, our work uses this basic idea, as does the “branch locking” approach in Khan (2017); Griewank (2013). The difficulty in these approaches is in finding a means to relate this linearization to properties of the (nonsmooth) functions, which will allow the algorithm to succeed; naively, we can tell when a method might have failed though it is difficult to guarantee if it will succeed.

As such, the extant body of work does not contain methods which contain only a constant factor blow up in the computational cost. Notable differences in this work is that our assumptions make strong connections to nonlinear programming (Abadie, 1967; Peterson, 1973; Gould and Tolle, 1971), which help in characterizing when the linearization approach is informative, and we provide a key technical result showing a certain chain rule holds for randomized algorithms. Furthermore, our focus is on generalizing the reverse mode for scalar functions (as opposed to focusing on multivariate functions where there is no known Cheap Gradient Principle).

##### Our contributions:

Our main result provides — under a natural set of assumptions widely used in nonlinear programming — a provably correct Automatic Subdifferentiation procedure, which given some , computes both the functional value and a dimensional subdifferential , with a computational cost that is a factor of at most times that of computing the scalar function itself. Our assumption is that our library of functions be implemented in a manner consistent with the standard constraint qualification assumptions in nonlinear programming (Abadie, 1967). In short, this work shows that in fact there is a Cheap Subgradient Principle.

## 2 Preliminaries

Assume is a locally Lipschitz function, and recall, that by Rademacher’s theorem, this implies that is differentiable almost everywhere. The Clarke subdifferential of at any point is the set (Clarke et al., 2008, Theorem 8.1)

 ∂f(x):=conv{limi→∞∇f(xi):xiΩ→x}, (1)

where is any full-measure subset of such that is differentiable at each of its points. Here, the limit is taken to be the set of all limit points. In classical circumstances, the subdifferential reduces to more familiar objects. Namely, when is -smooth at , the subdifferential consists only of the gradient , while for convex functions, it reduces to the subdifferential in the sense of convex analysis.

### 2.1 AD Review and The Baur-Strassen Theorem

A straight line program for computing is specified by a program of the form shown in Algorithm 1. Here the functions are assumed to be some function from a library of functions. In the algebraic circuit complexity model, these functions are either monomials or affine functions of its inputs.

More generally, we will be interested in utilizing a richer class of functions where , a library of functions, e.g. we may desire functions like the , , or ever richer nonsmooth functions like eigenvalues.

Define to be the time it takes to compute under a given program for .

###### Theorem 2.1.

(Baur and Strassen, 1983; Griewank, 1989) Assume all multiplications and additions have unit runtime cost. If we restrict to the algebraic circuit complexity model (where the functions are either monomials or affine functions), then it is possible to compute both and all its partial derivatives in time that is at most .

An algorithm achieving this guarantee is to first compute and then use the reverse mode of AD, in Algorithm 2. To see the specific counting argument, see (Morgenstern, 1985). This theorem is often more general: the reverse mode also correctly returns the derivatives even with a richer family of smooth functions in our library , often with a constant factor cost increase as well (Griewank, 1989). The reverse mode itself has been rediscovered many times (Griewank, 2012); the well known back-propagation algorithm (Rumelhart et al., 1986) is one example of the reverse mode of AD. The reverse mode (and the back-propagation algorithm) is not a direct application of the chain rule; the direct application of the chain rule is referred to as the forward mode of AD (see  Griewank and Walther (2008)), which is times more expensive procedure to compute the gradient. The reverse mode can be viewed as a form of dynamic programming. To compare the two, in the reverse mode of AD, we compute the derivatives , referred to as the adjoints444For a variable , the notation refers to the derivative with respect to , but holding all parent variables of as fixed. If is an input variable, then this is the usual partial derivative., while in the forward mode of AD we would compute (-dimensional) derivatives of the form (referred to as dual numbers).

### 2.2 Nonsmooth functions and our computational model

To specify how our nonsmooth functions are implemented, we extend the computational model to allow for branching, using (a restricted version555We avoid halting concerns by assuming our programs halt in a bounded amount of time. We also explicitly avoid discussing tapes and registers in our computational cost model. of) the Blum-Shub-Smale model of computation (Blum et al., 1988).

###### Definition 2.1 (Computation Model).

The computational model for computing any in our library ( may be different for each function) is specified by a program of the form shown in Algorithm 3. We assume that the function is either a monomial or an affine function of its inputs. Furthermore, for every , we assume that there exists a time , where the program terminates in at most this amount of time.

Throughout, we make the following assumption:

###### Assumption 2.1.

(Computational Cost) Assume all multiplications and additions have unit runtime cost and that an execution of an “If” statement is also unit cost. For example, the cost of computing a monomial is the number of multiplications.

The program implicitly encodes a function that has the following representation:

 f(x)=∑z∈{−1,1}TISz(x)pz(x), (2)

where each is a polynomial; is the indicator function on the set ; and consists of all where the program executes branch when given as input. The set can be explicitly defined as follows: for steps where the programs branches on , define ; on non-branching , define ; define the vector valued function ;

 Sz:={x|I(sign(hz(x))=z)} (3)

where the is the usual sign function (applied componentwise) taking values in (where we take ). Note that is specified by a set of polynomial inequalities as defined by the functions .

## 3 Provable Automatic Subdifferentiation

In the algebraic circuit complexity model, where AD is provably correct, branching is not permitted. The inclusion of branching into our program leads to a number of subtle issues. Branching allows us to implement the same nonsmooth function in different manners, which have important consequences in linearization approaches. Consider two different programs (with the same input-output behavior) for the function in Figure 1. The left program returns on the constraint set that is encoded as , while the right program returns on the constraint set that is encoded as . In nonlinear programming, the importance of avoiding encoding constraints in the latter manner is well-known (Abadie, 1967; Peterson, 1973; Gould and Tolle, 1971).

This example motivates our restriction to only consider library functions that are encoded like the former set. We will make the standard constraint qualification assumption666The standard constraint qualification assumption on a constraint set is that the tangent cone of the constraint set equals the linearized cone (of the functions which define the constraints).. Roughly speaking, the assumption states that first order information characterizes the set of feasible perturbations. We state this assumption in a manner more directly applicable to our setting (see (Abadie, 1967; Peterson, 1973; Gould and Tolle, 1971)).

###### Assumption 3.1.

(Constraint Qualification on our Library) Assume for all that is locally Lipschitz and our program for (in our computational model) satisfies the constraint qualification condition on all sets in the following sense: suppose (for binary ) are the corresponding constraint functions in our program. For any (of the same input dimensionality of ), assume that for all :

 limδ↓0(sign(hz(x+δv)))=limδ↓0(sign(hz(x)+δ∇hz(x)⋅v)).

Roughly, this states that the set approached along the limiting direction , when , can be determined with first order information.

Before we state our main theorem, one more definition is in order, due to that may not be continuous. Define the limiting runtime of at as the (supremum) runtime to compute , as is approached from nearby points. Precisely,

 Runtime∗(f;x):=sup{limi→∞Runtime(f;xi):xi→x},

(where the limit is taken to be the set of all limit points).

###### Theorem 3.1.

(A Cheap Subgradient Principle) Assume that our program for , in Algorithm 1, is allowed to use nonsmooth functions from our library (in addition to affine functions and monomials). Suppose assumptions 2.1 and  3.1 hold. There exists a (randomized) algorithm, which upon input , terminates in time that is at most , and, almost surely, returns both and an element .

The following example shows one subtle issue with regards to constraint qualification.

###### Example 3.1.

(Constraint qualification on programs do not compose) Consider the function (which is equivalent to the smooth function ). It is straight forward to see that the induced program for (when we unravel it) does not satisfy the constraint qualification assumption, even if we do use an implementation of that does satisfy this assumption.

Before we present the construction, we first provide a chain rule for nonsmooth functions.

### 3.1 A Chain Rule for Nonsmooth Functions

Let denote the one-sided (Dini) directional derivative:

 D[g;v](x):=limδ↓0g(x+δv)−g(x)δ.

(note that we are not assuming that is a unit vector). This derivative exists for locally Lipschitz functions (Shapiro, 1990)).

###### Assumption 3.2.

(Overloading the library with ASD subroutines) Assume we have a library of (locally Lipschitz) functions computable in our computational model. For any , with the representation , assume we have the following associated automatic subdifferentiation subroutine with the following behavior: upon input , the output satisfies

 a=g(x),d=D[g;v](x),u=∇pz(x)

where is such that:

 limδ↓0(ISz(x+δv))=1.

Roughly speaking, is the derivative determined by the set which is approached along the limiting direction , when .

For any locally Lipschitz function , define the limiting total derivate as:

 ∂[h;v](x):=limδ↓0∇h(x+δv)

if the limit exists. For almost all , the limit exists, and is a subdifferential of .

###### Theorem 3.2.

(A Chain Rule for Nonsmooth Functions) Assume and , … (where ) are locally Lipschitz functions computable in our computational model and that the function is overloaded with an ASD subroutine as specified in Assumption 3.2. Define:

 f(x):=h(g1(x),…gm(x))=h(g(x)),

where is the vector valued function . Denote the vector of (one-sided) directional derivatives as . If it exists, let denote limiting Jacobian matrix (whose rows are given by the vectors ’s). Set:

 [a,d,u]=ASD[h](g(x);D[g;v](x))

For all but a measure set of , we have that and exist and that:

 ∂[f;v](x)=∂[g;v](x)⊤u. (4)
###### Example 3.2.

Consider the example . We define , , and , so that . By applying the ASD subroutine to , starting at with which leads to running (where it is straightforward to verify that ), we obtain

 ∂[f2;v](0) =∂[g;v](0)Tu =[10]⊤[1−1] =1,

which is correct. Furthermore, note a correct answer is obtained for any .

###### Example 3.3.

We return to from Example 3.1. Define , , and so . By applying the chain rule lemma at with ,

 ∂[f;v](0)=∂[g;v](0)u=0⋅u=0

Subtly, note that , so we are feeding a degenerate direction into our subroutine. Regardless, the chain rule lemma still applies (for any in this case).

### 3.2 The algorithm

We first present the algorithm that utilizes an overloaded library. We then provide a provably correct construction of this overloaded library. All proofs are provided in the appendix.

#### Subdifferentiation with the overloaded library

Algorithm 6 is the Automatic Subdifferentiation procedure. Correctness follows from Lemma 3.1.

###### Lemma 3.1.

Suppose Assumptions 2.1 and 3.2 hold. Upon input of an arbitrary , and if is sampled uniformly at random from the unit sphere, then, almost surely, Algorithm 6 returns both and an element .

###### Proof.

Fix . Every parent variable can be expressed as , where is a piecewise polynomial on the dimensional input . Thus

 xk=gk(~g1(x),…,~gk−1(x)).

Now the usual chain rule holds for directional derivatives (Shapiro, 1990). As the forward mode of AD implements the usual chain rule of directional derivatives, then we have .

By Assumption 3.2 and Theorem 3.2, returns and this limiting total derivate satisfies the chain rule . Since the limiting total derivates satisfies the chain rule and the validity of reverse mode AD algorithm relies only on the chain rule, Algorithm 6 correctly computes .

By Rademacher’s theorem and the definition of Clarke subgradient in Equation (1), , for almost all . ∎

The following lemma shows that we can provide a method to correctly overload the library, which we use in Algorithm 6.

###### Lemma 3.2.

(Correct Library Overloading) Assume satisfies the constraint qualification conditions in Assumption 3.1, Suppose the corresponding representation is , On an arbitrary input and , Algorithm 7 returns , , and an element where is such that:

###### Example 3.4.

We again return to from Example 3.1. Here we examine how is overloaded based on the implementation in Algorithm 7. When , we are running and this may not follow the same branch had we run on the (infinitesimal) input which leads to running . However, the gradient is correctly computed, , regardless of the branch taken.

## 4 Discussion and Open Questions

Overloading the Library Functions: It is not difficult to see that piecewise univariate functions can be implemented in our library.

###### Example 4.1.

Univariate Piecewise Polynomial (Algorithm 8). Let be a univariate piecewise polynomial, meaning that the domain is partitioned into a set of intervals . On each interval, the function is equal to a polynomial .

Algorithm 8 provides a constraint qualified program for the function , which can be used as a library function.

An important step would be in extending our computational model to allow the incorporation of provably correct automatic subdifferentiation libraries for linear algebra libraries. AutoGrad (Maclaurin et al., 2015) does do AD through linear algebra methods though it can not be used to obtain correct subdifferentials in programs (at nondifferentiable points); obtaining correct generalized derivatives may be particularly important in cases where we deal with low rank methods. We conjecture our results can be extended, by extending the computational model, to handle these cases (there is already much known about the first order structure of these methods (Seeger et al., 2017)); technically, SVDs are not exactly computable in either the algebraic circuit complexity model or the Blum-Shub-Smale model.

Numerical Analysis: The most important open question is how to obtain numerically stable and accurate solutions (Trefethen and Bau III, 1997; Demmel, 1997). We conjecture the techniques developed here will help in characterizing these issues. In particular, the most natural question is how to develop algorithms that satisfy the mixed stability criterion: the algorithm should give “nearly the right answer to nearly the right problem” (as in  (Trefethen and Bau III, 1997)). For example, for the function, it should be acceptable for an AD method to provide a subgradient near to for a small input due to roundoff error; however, it would undesirable for numerical error to lead vastly different gradients than those that arise from any nearby problem. This may be particularly important when doing AD in physical simulators.

##### Acknowledgments:

We thank Dima Drusvyatskiy for many helpful discussions. Sham Kakade acknowledges funding from Washington Research Foundation Fund for Innovation in Data-Intensive Discovery, the NSF through award CCF-1740551, and ONR award N00014-18-1-2247. Jason D. Lee acknowledges support of the ARO under MURI Award W911NF-11-1-0303. This is part of the collaboration between US DOD, UK MOD and UK Engineering and Physical Research Council (EPSRC) under the Multidisciplinary University Research Initiative.

## References

• Abadi et al. (2015) M. Abadi, A. Agarwal, P. Barham, E. Brevdo, et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
• Abadie (1967) J. Abadie. On the kuhn tucker theorem. Nonlinear Programming, pages 19–36, 1967.
• A.Griewank (1995) A.Griewank. Automatic directional differentiation of nonsmooth composite functions. In R.Durier, editor, Recent developments in Optimization / Seventh French-German Conference on Optimization, Dijon 1994, pages 155–169. Springer Verlag, 1995.
• Baur and Strassen (1983) Walter Baur and Volker Strassen. The complexity of partial derivatives. Theoretical Computer Science, 22:317–330, 1983.
• Baydin et al. (2015) Atilim Gunes Baydin, Barak A. Pearlmutter, and Alexey Radul. Automatic differentiation in machine learning: a survey. CoRR, abs/1502.05767, 2015.
• Blum et al. (1988) Lenore Blum, Mike Shub, and Steve Smale. On a theory of computation over the real numbers; NP completeness, recursive functions and universal machines (extended abstract). In FOCS, pages 387–397. IEEE Computer Society, 1988.
• Clarke (1975) F.H. Clarke. Generalized gradients and applications. Trans. Amer. Math. Soc., 205:247–262, Apr. 1975.
• Clarke et al. (2008) F.H. Clarke, Y.S. Ledyaev, R.J. Stern, and P.R. Wolenski. Nonsmooth analysis and control theory, volume 178. Springer Science & Business Media, 2008.
• Demmel (1997) J. W. Demmel. Applied numerical linear algebra. Society for Industrial Mathematics, 1997.
• Fiege et al. (2017) Sabrina Fiege, Andrea Walther, Kshitij Kulshreshtha, and Andreas Griewank. Algorithmic differentiation for piecewise smooth functions: a case study for robust optimization. pages 1–16, 06 2017.
• Gould and Tolle (1971) F.J. Gould and J.W. Tolle. A necessary and sufficient qualification for constrained optimization. 20, 03 1971.
• Griewank (1989) Andreas Griewank. On automatic differentiation. In Mathematical Programming: Recent Developments and Applications, pages 83–108. Kluwer Academic Publishers, 1989.
• Griewank (2012) Andreas Griewank. Who invented the reverse mode of differentiation? Optimization Stories, Documenta Matematica, Extra Volume ISMP (2012):389–400, 2012.
• Griewank (2013) Andreas Griewank. On stable picewise linearization and generalized differentiation. Optimization Methods and Software, 28(6):1139–1178, 2013.
• Griewank (2014) Andreas Griewank. On Automatic Differentiation and Algorithmic Linearization. Pesquisa Operacional, 34:621 – 645, 12 2014. ISSN 0101-7438.
• Griewank and Walther (2008) Andreas Griewank and Andrea Walther. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, second edition, 2008.
• Khan and Barton (2013) K. A. Khan and P.I. Barton. Evaluating an element of the clarke generalized jacobian of a composite piecewise differentiable function. ACM Trans. Math. Softw., 39:23:1, 2013.
• Khan and Barton (2015) K. A. Khan and P.I. Barton. A vector forward mode of automatic differentiation for generalized derivative evaluation. Optim. Method Softw., 30:1185, 2015.
• Khan (2017) Kamil A. Khan. Branch-locking ad techniques for nonsmooth composite functions and nonsmooth implicit functions. Optimization Methods and Software, 0(0):1–29, 2017.
• Klatte and Kummer (2002) D. Klatte and B. Kummer. Nonsmooth equations in optimization, volume 60 of Nonconvex Optimization and its Applications. Kluwer Academic Publishers, Dordrecht, 2002. ISBN 1-4020-0550-4. Regularity, calculus, methods and applications.
• Maclaurin et al. (2015) Dougal Maclaurin, David Duvenaud, Matthew Johnson, and Ryan P. Adams. Autograd: Reverse-mode differentiation of native Python, 2015.
• Mordukhovich (2006) B. S. Mordukhovich. Variational Analysis and Generalized Differentiation II: Applications. Springer Berlin Heidelberg, 2006. ISBN 9783540312468.
• Morgenstern (1985) Jacques Morgenstern. How to compute fast a function and all its derivatives: a variation on the theorem of baur-strassen. In SIGA, 1985.
• Nesterov (2005) Yurii Nesterov. Lexicographic differentiation of nonsmooth functions. Math. Program., 104(2-3):669–700, 2005.
• Paszke et al. (2017) A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.
• Peterson (1973) David W. Peterson. A review of constraint qualifications in finite-dimensional spaces. SIAM Review, 15(3):639–654, 1973.
• Rumelhart et al. (1986) D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Parallel distributed processing: Explorations in the microstructure of cognition, vol. 1. chapter Learning Internal Representations by Error Propagation, pages 318–362. MIT Press, Cambridge, MA, USA, 1986.
• Seeger et al. (2017) Matthias W. Seeger, Asmus Hetzel, Zhenwen Dai, and Neil D. Lawrence. Auto-differentiating linear algebra. CoRR, abs/1710.08717, 2017.
• Shapiro (1990) A. Shapiro. On concepts of directional differentiability. Journal of Optimization Theory and Applications, 66(3):477–487, Sep 1990.
• Trefethen and Bau III (1997) Lloyd N Trefethen and David Bau III. Numerical linear algebra, volume 50. SIAM, 1997.

## Appendix A Proofs

First, we provide a helpful technical lemma.

###### Lemma A.1.

Let , and suppose the functions are analytic functions from . Let be a set defined as follows:

 Sz={x|sign(p1(x))=z1,…sign(pT(x))=zT}.

Fix any . For all and for almost all , there is an open neighborhood such that for all ,

 limδ↓0ISz(x+δu)=limδ↓0ISz(x+δu′).
###### Proof.

We need only consider one such for the proof, as the set of possible is finite. Without loss of generality, we can say is a set of the form:

 S={p1(x)≥0,…,pk(x)≥0,pk+1(x)<0,…,pT(x)<0}.

For the proof, it suffices to show that for all and almost all , there is an open neighborhood such that for all ,

 limδ↓0sign(pj(x+δu))=limδ↓0sign(pj(x+δu′)).

Let us split the constraints into a set of active constraints for and inactive constraints for . For inactive constraints, the above holds due to continuity. It remains to show the above for active constraints. Let us also assume that the functions in the active set are not identically equal to (the claim is true for any zero function).

For every active constraint , testing can be done by the Taylor expansion,

 pj(x+δu)=pj(x)+∞∑i=1δi∥u∥ii!∇ipj(x)[u⊗i],

which exists since is analytic. For any , note that is a polynomial function in . As it is a polynomial, this function will either be equal to (for all ) or it will be nonzero for almost all 777This can proven by induction on the dimension of . . Let be the first index that the function is nonzero for almost all ( is finite as is not the zero function). Then for almost every direction ,

 limδ↓0sign(pj(x+δu))=sign(∇ijpj(x)[u⊗ij]).

For almost all , . Since is strictly non-zero, then, by continuity,

 sign(∇ijpj(x)[u⊗ij])=sign(∇ijpj(x)[u′⊗ij])

for (for sufficiently small ). This implies that for

 limδ↓0sign(pj(x+δu))=limδ↓0sign(pj(x+δu′)),

which completes the proof. ∎

### a.1 Proof of Theorem 3.2

###### Proof.

Suppose that a program for has the representation:

 h(y)=∑z∈{−1,1}TISz(y)pz(y),

where is specified as in Equation 2. Since is the composition of two programs, it also has a representation of the form:

 f(x)=∑z∈{−1,1}T′I~Sz(x)~pz(x),

for different polynomials and sets .

By Rademacher’s theorem, exists for almost all . Since itself is a piecewise polynomial, where is such that . Using Lemma A.1, for all and almost all , there is a (full dimensional) neighborhood around , in which , which implies (using our choice of ). Hence,

 ∂[f;v](x)=∂[f;v′](x)

for all .

Furthermore, using this and the definition of the directional derivative, we have for all ,

 D[f;v′](x)=∂[f;v′](x)⊤v′=∂[f;v](x)⊤v′. (5)

The remainder of the proof seeks to show this directional derivative can be written as:

 D[f;v′](x)=u⊤∂[g;v](x)v′

(for in a sufficiently small full dimensional ball). Note this would imply that for all in a full dimensional ball, , which would complete the proof by choosing linearly independent ’s.

Define be a matrix whose rows are . Observe . Now let us show that:

 D[g;v′](x)=Mv′

for all , where is a sufficiently small (full dimensional) neighborhood around . To see this, note the function itself is a piecewise (vector) valued function, and, by Lemma A.1, for almost all , there is a neighborhood around , in which for all , selects the same piece as , which implies as claimed.

Now due to the assumed properties of , we know

 u=∇pz(y)|y=g(x).

for some which satisfies:

 limδ↓0ISz(g(x)+δD[g;v](x))=1.

Since , we have:

 limδ↓0ISz(g(x)+δMv)=1.

Also, note

 D[h;Mv](y)=∇pz(y)⊤Mv

by the definition of the directional derivative and where .

Using Lemma A.1 again, we will show that

 D[h;Mv′](y)=∇pz(y)⊤Mv′ (6)

for all (for a sufficiently small ball ). Note that the above holds if:

 limδ↓0ISz(g(x)+δMv′)=1, (7)

for all . Now consider the set defined by

 ¯Sz={~v|sign(p1(g(x)+M~v))=z1,…sign(pT(g(x)+M~v))=zT}.

Lemma A.1 guarantees that there exists a ball (centered on ) such that:

 limδ↓0I¯Sz(δv′)=1,

for all , which is equivalent to Equation (7). Thus we have proven the Equation (6), for all .

As the chain rule holds for directional derivatives (of locally Lipschitz functions) [Shapiro, 1990], we have that:

 D[f;v′](x)=D[h;Mv′](y)=∇pz(y)⊤Mv′=u⊤∂[g;v](x)v′

for that is in some sufficiently small full dimensional neighborhood (that is contained in and ).

Now define the full dimensional neighborhood so that it is contained in both and . We have shown, for ,

 u⊤∂[g;v](x)v′=∂[f;v](x)⊤v′,

using Equation 5. The proof is completed by choosing linearly independent ’s, which implies that . ∎

### a.2 Proof of Lemma 3.2

###### Proof of Lemma 3.2.

Let be polynomials that represent by expressing in terms of the input variables only. By the Constraint Qualification Assumption 3.1, the branch taken by Algorithm 7 is the same branch that would be taken by running on the perturbed input for all .

To finish the proof, we simply note that the reverse mode is performed on a straight-line program that computes , where is such that . ∎

### a.3 Completing the proof of Theorem 3.1

Lemma 3.2 shows that every library function that satisfies Assumption 3.1 can be implemented via Algorithm 7 to satisfy Assumption 3.2. Then Lemma 3.1 shows that we compute an element .

The second part of the proof counts the number of unit operations. See Morgenstern [1985], Griewank [1989] for the counting argument for a factor of , where the reverse mode (after we have computed the function and stored the intermediate variables) is a factor of more than computing the function itself.

To obtain a factor of in the existence claim in Theorem 3.1, the construction uses a minor modification of the algorithm presented, where the algorithm is modified to only run one (global) reverse mode (as opposed to using a reverse mode computation associated with each library function). Algorithm 6, the one presented, has a runtime that is within a factor of of the cost of computing just , due to that it runs the reverse mode when calling every library function (which costs a factor of more than just computing ). This algorithm presented is what we actually suggest to use, as the algorithm that achieves a factor of may require more memory in order to build a larger computational graph (and the factor of is rather pessimistic). To see the factor of , the reverse mode (associated with each library function) gives an additional factor of . The directional derivative costs one additional factor (this is referred to as the forward mode of AD [Griewank, 1989]). After all (overloaded) library functions are completed, one additional reverse mode is called at the end of Algorithm 6, which incurs (at most) an additional factor of .

To obtain the factor of , it is straightforward to modify the algorithm, as the following proof shows.

###### Proof of Theorem 3.1.

First, observe that the algorithm correctly computes the function value on some (limiting branch), so the time it takes to compute on this branch is some where . Now the directional derivative computation can be computed when doing the forward pass of the algorithm, which incurs at most one additional factor of (this is also referred to as the forward mode of AD). Instead of computing the intermediate derivatives (as in Algorithm 7), we will unravel the entire computational graph of , where we replace each function , by inserting the corresponding computational graph of into a global computational graph. We then run the reverse mode of AD on this global computational graph, giving only one additional factor of . ∎

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters