An Optimal Control Approach to Deep Learning and Applications to Discrete-Weight Neural Networks

# An Optimal Control Approach to Deep Learning and Applications to Discrete-Weight Neural Networks

Qianxiao Li liqix@ihpc.a-star.edu.sg Institute of High Performance Computing, A*STAR, Singapore Shuji Hao Institute of High Performance Computing, A*STAR, Singapore
###### Abstract

Deep learning is formulated as a discrete-time optimal control problem. This allows one to characterize necessary conditions for optimality and develop training algorithms that do not rely on gradients with respect to the trainable parameters. In particular, we introduce the discrete-time method of successive approximations (MSA), which is based on the Pontryagin’s maximum principle, for training neural networks. A rigorous error estimate for the discrete MSA is obtained, which sheds light on its dynamics and the means to stabilize the algorithm. The developed methods are applied to train, in a rather principled way, neural networks with weights that are constrained to take values in a discrete set. We obtain competitive performance and interestingly, very sparse weights in the case of ternary networks, which may be useful in model deployment in low-memory devices.

## 1 Introduction

The problem of training deep feed-forward neural networks is often studied as a nonlinear programming problem [Bazaraa et al., 2013, Bertsekas, 1999, Kuhn and Tucker, 2014]

 minθJ(θ)

where represents the set of trainable parameters and is the empirical loss function. In the general unconstrained case, necessary optimality conditions are given by the condition for an optimal set of training parameters . This is largely the basis for (stochastic) gradient-descent based optimization algorithms in deep learning [Robbins and Monro, 1951, Duchi et al., 2011, Zeiler, 2012, Kingma and Ba, 2014]. When there are additional constraints, e.g. on the trainable parameters, one can instead employ projected versions of the above algorithms. More broadly, necessary conditions for optimality can be derived in the form of the Karush-Kuhn-Tucker conditions [Kuhn and Tucker, 2014]. Such approaches are quite general and typically do not rely on the structures of the objectives encountered in deep learning. However, in deep learning, the objective function often has a specific structure; It is derived from feeding a batch of inputs recursively through a sequence of trainable transformations, which can be adjusted so that the final outputs are close to some fixed target set. This process resembles an optimal control problem [Bryson, 1975, Bertsekas, 1995, Athans and Falb, 2013] that originates from the study of the calculus of variations.

In this paper, we exploit this optimal control viewpoint of deep learning. First, we introduce the discrete-time Pontryagin’s maximum principle (PMP) [Halkin, 1966], which is an extension the central result in optimal control due to Pontryagin and coworkers [Boltyanskii et al., 1960, Pontryagin, 1987]. This is an alternative set of necessary conditions characterizing optimality, and we discuss the extent of its validity in the context of deep learning. Next, we introduce the discrete method of successive approximations (MSA) based on the PMP to optimize deep neural networks. A rigorous error estimate is proved that elucidates the dynamics of the MSA, and aids us in designing optimization algorithms under rather general conditions. We apply our method to train a class of unconventional networks, i.e. those with discrete-valued weights, to illustrate the usefulness of this approach. In the process, we discover that in the case of ternary networks, our training algorithm obtains trained models that are very sparse, which is an attractive feature in practice.

The rest of the paper is organized as follows: In Sec. 2, we introduce the optimal control viewpoint and the discrete-time Pontryagin’s maximum principle. We then introduce the method of successive approximation in Sec. 3 and prove our main estimate, Theorem 2. In Sec. 4, we derive algorithms based on the developed theory to train binary and ternary neural networks. Finally, we end with a discussion on related work and a conclusion in Sec. 5 and 6 respectively. Various details on proofs and algorithms are provided in Appendix A-D.

Hereafter, we denote the usual Euclidean norm by and the corresponding induced matrix norm by . The Frobenius norm is written as . Throughout this work, we use a bold-faced version of a variable to represent a collection of the same variable, but indexed additionally by , e.g. .

## 2 The Optimal Control Viewpoint

In this section, we formalize the problem of training a deep neural network as an optimal control problem. Let denote the number of layers and represent a collection of fixed inputs (images, time-series). Here, is the sample size. Consider the dynamical system

 xs,t+1=ft(xs,t,θt),t=0,1,…,T−1, (1)

where for each , is a transformation on the state. For example, in typical neural networks, it can represent a trainable affine transformation or a non-linear activation (in which case it is not trainable and does not depend on ). We assume that each trainable parameter set is a subset of an Euclidean space. The goal of training a neural network is to adjust the weights so as to minimize some loss function that measures the difference between the final network output and some true labels of , which are fixed. Thus, we may define a family of real-valued functions acting on ( are absorbed into the definition of ) and the average loss function is . In addition, we may consider some regularization terms for each layer that has to be simultaneously minimized. In typical applications, regularization is only performed for the trainable parameters so that , but here we will consider the slightly more general case where it is also possible to regularize the state at each layer. In summary, we wish to solve the following problem

 minθ∈ΘJ(θ):=1SS∑s=1Φs(xs,T)+1SS∑s=1T−1∑t=0Lt(xs,t,θt) subject to: xs,t+1=ft(xs,t,θt),t=0,…,T−1,s∈[S] (2)

where we have defined for shorthand and . One may recognize problem (2) as a classical fixed-time, variable-terminal-state optimal control problem in discrete time [Ogata, 1995], in fact a special one with almost decoupled dynamics across samples in .

### 2.1 The Pontryagin’s Maximum Principle

Maximum principles of the Pontryagin type [Boltyanskii et al., 1960, Pontryagin, 1987] usually consist of necessary conditions for optimality in the form of the maximization of a certain Hamiltonian function. The distinguishing feature is that it does not assume differentiability (or even continuity) of with respect to . Consequently the optimality condition and the algorithms based on it need not rely on gradient-descent type updates. This is an attractive feature for certain classes of applications.

Let be a solution of (2). We now outline informally the Pontryagin’s maximum principle (PMP) that characterizes . First, for each we define the Hamiltonian function by

 Ht(x,p,θ):=p⋅ft(x,θ)−1SLt(x,θ). (3)

One can show the following necessary conditions.

###### Theorem 1 (Discrete PMP, Informal Statement).

Let and , be sufficiently smooth in . Assume further that for each and , the sets and are convex. Then, there exists co-state processes , such that following holds for and :

 x∗s,t+1=∇pHt(x∗s,t,p∗s,t+1,θ∗t),x∗s,0=xs,0 (4) p∗s,t=∇xHt(x∗s,t,p∗s,t+1,θ∗t),p∗s,T=−1S∇Φs(x∗s,T) (5) S∑s=1Ht(x∗s,t,p∗s,t+1,θ∗t)≥S∑s=1Ht(x∗s,t,p∗s,t+1,θ),∀θ∈Θt (6)

The full statement of Theorem 1 involve explicit smoothness assumptions and additional technicalities (such as the inclusion of an abnormal multiplier). In Appendix A, we state these assumptions and give a sketch of its proof based on [Halkin, 1966].

Let us discuss the PMP in detail. The state equation (4) is simply the forward propagation equation (1) under the optimal parameters . Eq. (5) defines the evolution of the co-state . To draw an analogy with nonlinear programming, the co-state can be interpreted as a set of Lagrange multipliers that enforces the constraint (1) when the optimization problem (2) is regarded as a joint optimization problem in and , . In the optimal control and PMP viewpoint, it is perhaps more appropriate to think of the dynamics (5) as the evolution of the normal vector of a separating hyper-plane, which separates the set of reachable states and the set of states where the objective function takes on values smaller than the optimum (see Appendix A).

The Hamiltonian maximization condition (6) is the centerpiece of the PMP. It says that an optimal solution must globally maximize the (summed) Hamiltonian for each layer . Let us contrast this statement with usual first-order optimality conditions of the form . A key observation is that in Theorem 1, we did not make reference to the derivative of any quantity with respect . In fact, the PMP holds even if is not differentiable, or even continuous, with respect to , as long as the convexity assumptions are satisfied. On the other hand, if we assume for each : 1) is differentiable with respect to ; 2) is concave in ; and 3) lies in the interior of , then the Hamiltonian maximization condition (6) is equivalent to the condition for all , which one can then show is equivalent to (See Appendix C, proof of Prop. C.1). In other words, the PMP can be viewed as a stronger set of necessary conditions (at optimality, is not just stationary, but globally maximized) and has meaning in more general scenarios, e.g. when stationarity with respect to is not achievable due to constraints, or not defined due to non-differentiability.

###### Remark 1.

It may occur that is constant for all , in which case the problem is singular [Athans and Falb, 2013]. In such cases, the PMP is trivially satisfied by any and so the PMP does not tell us anything useful. This may arise especially in the case where there are no regularization terms.

### 2.2 The Convexity Assumption

The most crucial assumption in Theorem 1 is the convexity of the sets and for each fixed  111Note that this is in general unrelated to the convexity, in the sense of functions, of with respect to either or . For example, the scalar function is evidently non-convex in both arguments, but is convex for each . On the other hand is non-convex because of a non-convex admissable set.. We now discuss how restrictive these assumptions are with regard to deep neural networks. Let us first assume that the admissable sets are convex. Then, the assumption with respect to is not restrictive since most regularizers (e.g. ) satisfy it. Let us consider the convexity of . In classical feed-forward neural networks, there are two types of layers: trainable ones and non-trainable ones. Suppose layer is non-trainable (e.g.  where is a non-linear activation function), then for each the set is a singleton, and hence trivially convex. On the other hand, in trainable layers is usually affine in . This includes fully connected layers, convolution layers and batch normalization layers [Ioffe and Szegedy, 2015]. In these cases, as long as the admissable set is convex, we again satisfy the convexity assumption. Residual networks also satisfy the convexity constraint if one introduces auxillary variables (see Appendix A.1). When the set is not convex, then it is in general not true that the PMP constitute necessary conditions.

Finally, we remark that in the original derivation of the PMP for continuous-time control systems [Boltyanskii et al., 1960] (i.e. in place of Eq. (1)), the convexity condition can be removed due to the “convexifying” effect of integration with respect to time [Halkin, 1966, Warga, 1962]. Hence, the convexity condition is purely an artifact of discrete-time dynamical systems.

## 3 The Method of Successive Approximations

The PMP (Eq. (4)-(6)) gives us a set of necessary conditions an optimal solution to (2) must satisfy. However, it does not tell us how to find one such solution. The goal of this section is to discuss algorithms for solving (2) based on the maximum principle.

On closer inspection of Eq. (4)-(6), one can see that they each represent a manifold in solution space consisting of all possible , and , and the intersection of these three manifolds must contain an optimal solution, if one exists. Consequently, an iterative projection method that successively projects a guessed solution onto each of the manifolds is natural. This is the method of successive approximations (MSA), which was first introduced to solve continuous-time optimal control problems [Krylov and Chernousko, 1962, Chernousko and Lyubushin, 1982]. Let us now outline a discrete-time version.

Start from an initial guess . For each sample , we define by the dynamics

 xθ0s,t+1=ft(xθ0s,t,θ0t),xθ0s,0=xs,0, (7)

for . Intuitively, this is a projection onto the manifold defined by Eq. (4). Next, we perform the projection onto the manifold defined by Eq. (5), i.e. we define by the backward dynamics

 pθ0s,t=∇xH(xθ0s,t,pθ0s,t+1,θ0t),pθ0s,T=−1S∇Φs(xθ0s,T), (8)

for . Finally, we project onto manifold defined by Eq. (6) by performing Hamiltonian maximization to obtain with

 θ1t=argmaxθ∈ΘtS∑s=1Ht(xθ0s,t,pθ0s,t+1,θ).t=0,…,T−1. (9)

The steps (7)-(9) are then repeated until convergence. We summarize the basic MSA algorithm in Alg. 1.

Let us contrast the MSA with gradient-descent based methods. Similar to the formulation of the PMP, at no point did we take the derivative of any quantity with respect to . Hence, we can in principle apply this to problems that are not differentiable with respect to . However, the catch is that the Hamiltonian maximization step (9) may not be trivial to evaluate. Nevertheless, observe that the maximization step is decoupled across different layers of the neural network, and hence it is a much smaller problem than the original optimization problem, and its solution method can be parallelized. Alternatively, as seen in Sec. 4, one can exploit cases where the maximization step has explicit solutions.

The basic MSA (Alg. 1 can be shown to converge for problems where is linear and the costs are quadratic [Aleksandrov, 1968]. In general, however, unless a good initial condition is given, the MSA may diverge. Let us understand the nature of such phenomena by obtaining rigorous error estimates per-iteration of Eq. (7)-(9).

### 3.1 An Error Estimate for the MSA

In this section, we derive a rigorous error estimate for the MSA, which can help us understand its dynamics. Let us define , where is defined according to Eq. (7). This is the convex hull of all states reachable at layer by some initial sample and some choice of the values for the trainable parameters. Let us now make the following assumptions:

1. is twice continuously differentiable, with and satisfying a Lipschitz condition uniformly in , i.e. there exists such that

 |Φs(x)−Φs(x′)|+∥∇Φs(x)−∇Φs(x′)∥≤K∥x−x′∥

for all and .

2. and are twice continuously differentiable in , with satisfying Lipschitz conditions in uniformly in and , i.e. there exists such that

 ∥ft(x,θ)−ft(x′,θ)∥ +∥∇xft(x,θ)−∇xft(x′,θ)∥2≤K∥x−x′∥ |Lt(x,θ)−Lt(x′,θ)| +∥∇xLt(x,θ)−∇xLt(x′,θ)∥≤K∥x−x′∥

for all , and .

Again, let us discuss these assumptions with respect to neural networks. Note that both assumptions are more easily satisfied if each is bounded, which is usually implied by the boundedness of . Although this is not typically true in principle, we can safely assume this in practice by truncating weights that are too large in magnitude. Consequently, (A1) is not very restrictive, since many commonly employed loss functions (mean-square, soft-max with cross-entropy) satisfy these assumptions. In (A2), the regularity assumption on is again not an issue, because we mostly take to be independent of . On the other hand, the regularity of with respect to is sometimes restrictive. For example, ReLU activations does not satisfy (A2) due to non-differentiability. Nevertheless, any suitably mollified version (like Soft-plus) does satisfy it. Moreover, tanh and sigmoid activations also satisfy (A2). Finally, unlike in Theorem 1, we do not assume the convexity of the sets and , and hence the results in this section applies to discrete-weight neural networks considered in Sec. 4. With the above assumptions, we prove the following estimate.

###### Theorem 2 (Error Estimate for Discrete MSA).

Let assumptions (A1) and (A2) be satisfied. Then, there exists a constant , independent of , and , such that for any , we have

 J(ϕ)−J(θ) ≤ −T−1∑t=0S∑s=1Ht(xθs,t,pθs,t+1,ϕt)−Ht(xθs,t,pθs,t+1,θt) (10) +CST−1∑t=0S∑s=1∥ft(xθs,t,ϕt)−ft(xθs,t,θt)∥2 (11) +CST−1∑t=0S∑s=1∥∇xft(xθs,t,ϕt)−∇xft(xθs,t,θt)∥22, (12) +CST−1∑t=0S∑s=1∥∇xLt(xθs,t,ϕt)−∇xLt(xθs,t,θt)∥2, (13)

where , are defined by Eq. (7) and (8).

###### Proof.

The proof follows from elementary estimates and a discrete Gronwall’s lemma. See Appendix B. ∎

Theorem 2 relates the decrement of the total objective function with respect to the iterative projection steps of the MSA. Intuitively, Theorem 2 says that the Hamiltonian maximization step (9) is generally the right direction, because a large magnitude of (10) results in higher loss improvement. However, whenever we change the parameters from to (e.g. during the maximization step (9)), we incur non-negative penalty terms (11)-(13). Observe that these penalty terms vanish if , or more generally, when the state and co-state equations (Eq. (7), (8)) are still satisfied when is replaced by . In other words, these terms measure the distance from manifolds defined by the state and co-state equations when the parameter changes. Alg. 1 diverges when these penalty terms dominate the gains from (10). This insight can point us in the right direction of developing convergent modifications of the basic MSA. We shall now discuss this in the context of some specific applications.

## 4 Neural Networks with Discrete Weights

We now turn to the application of the theory developed in the previous section on the MSA, which is a PMP-based numerical method for training deep neural networks. As discussed previously, the main strength of the PMP and MSA formalism is that we do not rely on gradient-descent type updates. This is particularly useful when one considers neural networks with (some) trainable parameters that can only take values in a discrete set. Then, any small gradient update to the parameters will almost always be infeasible. In this section, we will consider two such cases: binary networks, where weights are restricted to ; and ternary networks, where weights are selected from . These networks are potentially useful for low-memory devices as storing the trained weights requires less memory. In this section, we will modify the MSA so that we can train these networks in a principled way.

### 4.1 Binary Networks

Binary neural networks are those with binary trainable layers, e.g. in the fully connected case,

 ft(x,θ)=θx (14)

where is a binary matrix. A similar form of holds for convolution neural networks after reshaping, except that is now the set of Toeplitz binary matrices. Hereafter, we will consider the fully connected case for simplicity of exposition. It is also natural to set the regularization to since there is in general no preference between or . Thus, the Hamiltonian has the form

 Ht(x,p,θ)=p⋅θx.

Consequently, the Hamiltonian maximization step (9) has explicit solution, given by

 argmaxθ∈ΘtS∑s=1Ht(xθks,t,pθks,t+1,θ)=sign(Mθkt)

where . Note that the sign function is applied element-wise. If , then the arg-max is arbitrary. Using Theorem 2 with the form of given by (14) and the fact that , we get

 J(ϕ)−J(θ)≤ −T−1∑t=0S∑s=1Ht(xθks,t,pθks,t+1,θ) +CST−1∑t=0(1+S∑s=1∥xθs,t∥2)∥ϕt−θt∥2F,

Note that we have used the inequality . Assuming that is , we may then decrease by not only maximizing the Hamiltonian, but also penalizing the difference , i.e. for each and we set

 θk+1t= argmaxθ∈Θt[S∑s=1Ht(xθks,t,pθks,t+1,θ)−ρk,t∥θ−θk∥2F] (15)

for some penalization parameters . This again has the explicit solution

 [θk+1t]ij= {sign([Mθkt]ij)|[Mθkt]ij|≥2ρk,tijotherwise (16)

Therefore, we simply replace the parameter update step in Alg. 1 with (16). Furthermore, to deal with mini-batches, we keep a moving average of across different mini-batches and use the averaged value to update our parameters. It is found empirically that this further stabilizes the algorithm. Note that the assumption is can be achieved by normalization, e.g. batch-normalization [Ioffe and Szegedy, 2015]. We summarize the algorithm in Alg. 2. Further algorithmic details are found in Appendix D, where we also discuss the choice of hyper-parameters and the convergence of the algorithm for a simple binary regression problem. A rigorous proof of convergence in the general case is beyond the scope of this work, but we demonstrate via experiments below that the algorithm performs well on the tested benchmarks.

We apply Alg. 2 to train binary neural networks on various benchmark datasets and compare the results from previous work on training binary-weight neural networks [Courbariaux et al., 2015]. We consider a fully-connected neural network on MNIST [LeCun, 1998], as well as (shallow) convolutional networks on CIFAR-10 [Krizhevsky and Hinton, 2009] and SVHN [Netzer et al., 2011] datasets. The network structures are mostly identical to those considered in [Courbariaux et al., 2015] for ease of comparison. Complete implementation and model details are found in Appendix D. The graphs of training/testing loss and error rates are are shown in Fig. 1. We observe that our algorithm performs well in terms of an optimization algorithm, as measured by the training loss and error rates. For the harder datasets (CIFAR-10 and SVHN), we have rapid convergence but worse test loss and error rates at the end, possibly due to overfitting. We note that in [Courbariaux et al., 2015], many regularization strategies are performed. We expect that similar techniques must be employed to improve the testing performance. However, these issues are out of the scope of the optimization framework of this paper. Note that we also compared the results of BinaryConnect without regularization strategies such as stochastic binarization, but the results are similar in that our algorithm converges very fast with very low training losses, but sometimes overfits.

### 4.2 Ternary Networks

We shall consider another case where the network weights are allowed to take on values in . In this case, our goal is to explore the sparsification of the network. To this end, we shall take for some parameter . Note that since the weights are restricted to the ternary set, any component-wise regularization for are identical. The higher the values, the more sparse the solution will be.

As in Sec. 4.1, we can write down the Hamiltonian for a fully connected ternary layer as

 Ht(x,p,θ)=p⋅θx−1Sλt∥θ∥2F.

The derivation of the ternary algorithm then follows directly from those in Sec. 4.1, but with the new form of Hamiltonian above and that . Maximizing the augmented Hamiltonian (15) with as defined above, we obtain the ternary update rule

 [θk+1t]ij=⎧⎪⎨⎪⎩+1[Mθkt]ij≥ρk,t(1−2[θkt]ij)+λt−1[Mθkt]ij≤−ρk,t(1+2[θkt]ij)−λt0otherwise. (17)

We replace the parameter update step in Alg. 2 by (17) to obtain the MSA algorithm for ternary networks. For completeness, we give the full ternary algorithm in Alg. 3. We now test the ternary algorithm on the same benchmarks used in Sec. 4.1 and the results are shown in Fig. 2. Observe that the performance on training and testing datasets is similar to the binary case (Fig. 1), but the ternary networks achieve high degrees of sparsity in the weights, with only 0.5-2.5% of the trained weights being non-zero, depending on the dataset. This potentially offers significant memory savings compared to its binary or full floating precision counterparts.

## 5 Discussion and Related Work

We begin with a discussion of the results presented thus far. We first introduced the viewpoint that deep learning can be regarded as a discrete-time optimal control problem. Consequently, an important result in optimal control theory, the Pontryagin’s maximum principle, can be applied to give a set of necessary conditions for optimality. These are in general stronger conditions than the usual optimality conditions based on the vanishing of first-order partial derivatives. Moreover, they apply to broader contexts such as problems with constraints on the trainable parameters or problems that are non-differentiable in the trainable parameters. However, we note that specific assumptions regarding the convexity of some sets must be satisfied. We showed that they are justified for conventional neural networks, but not necessarily so for all neural networks (e.g. binary, ternary networks).

Next, based on the PMP, we introduced an iterative projection technique, the discrete method of successive approximations (MSA), to find an optimal solution of the learning problem. A rigorous error estimate (Theorem 2) is derived for the discrete MSA, which can be used to both understand its dynamics and to derive useful algorithms. This should be viewed as the main theoretical result of the present paper. Note that the usual back-propagation with gradient descent can be regarded as a simple modification of the MSA, if differentiability conditions are assumed (see Appendix C). Nevertheless, we note that Theorem 2 itself does not assume any regularity conditions with respect to the trainable parameters. Moreover,neither does it require the convexity conditions in Theorem 1, and hence applies to a wider range of neural networks, including those in Sec. 4. All results up to this point apply to general neural networks (assuming that the respective conditions are satisfied), and are not specific to the applications presented subsequently.

In the last part of this work, we apply our results to devise training algorithms for discrete-weight neural networks, i.e. those with trainable parameters that can only take values in a discrete set. Besides potential applications in model deployment in low-memory devices, the main reasons for choosing such applications are two-fold. First, gradient-descent updates are not applicable by itself because small updates to parameters are prohibited by the discrete equality constraint on the trainable parameters. However, our method based on the MSA is applicable since it does not perform gradient-descent updates. Second, in such applications the potentially expensive Hamiltonian maximization steps in the MSA have explicit solutions. This makes MSA an attractive optimization method for problems of this nature. In Sec 4, we demonstrate the effectiveness of our methods on various benchmark datasets. Interestingly, the ternary network exhibits extremely sparse weights that perform almost as well as its binary counter-part (see Fig. 2). Also, the phenomena of overfitting in Fig. 1 and 2 are interesting as overfitting is generally less common in stochastic gradient based optimization approaches. This seems to suggest that the MSA based methods optimize neural networks in a rather different way.

Let us now put our work in the context of the existing literature. First, the optimal control approach we adopt is quite different from the prevailing viewpoint of nonlinear programming [Bertsekas, 1999, Bazaraa et al., 2013, Kuhn and Tucker, 2014] and the analysis of the derived gradient-based algorithms [Moulines, 2011, Shamir and Zhang, 2013, Bach and Moulines, 2013, Xiao and Zhang, 2014, Shalev-Shwartz and Zhang, 2014] for the training of deep neural networks. In particular, the PMP (Thm. 1) and the MSA error estimate (Thm. 2) do not assume differentiability and do not characterize optimality via gradients (or sub-gradients) with respect to trainable parameters. In this sense, it is a stronger and more robust condition, albeit sometimes requiring different assumptions. The optimal control and dynamical systems viewpoint has been discussed in the context of deep learning in [E, 2017, Li et al., 2017b] and dynamical systems based discretization schemes has been introduced in [Haber and Ruthotto, 2017, Chang et al., 2017]. Most of these works have theoretical basis in continuous-time dynamical systems. For example, [Li et al., 2017b] analyzed continuous-time analogues of neural networks in the optimal control framework and derived MSA-based algorithms in continuous time. In contrast, the present work starts with a discrete-time formulation, which is natural in the usual context of deep learning. The present method for stabilizing the MSA is also different from that in [Li et al., 2017b], where augmented Lagrangian type of modifications are employed. The latter would not be effective here because weights cannot be updated infinitesimally without violating the binary/ternary constraint.

In the deep learning literature, the connection between optimal control and deep learning has been qualitative mentioned in [LeCun, 1988] and applied to the development of automatic differentiation and back-propagation [Bryson, 1975, Baydin et al., 2015]. However, there are relatively fewer works relating optimal control algorithms to training neural networks beyond the classical gradient-descent with back-propagation. Optimal control based strategies in hyper-parameter tuning has been discussed in [Li et al., 2017c].

In the continuous-time setting, the Pontryagin’s maximum principle and the method of successive approximations have a long history, with a large body of relevant literature including, but not limited to [Boltyanskii et al., 1960, Pontryagin, 1987, Bryson, 1975, Bertsekas, 1995, Athans and Falb, 2013, Krylov and Chernousko, 1962, Aleksandrov, 1968, Krylov and Chernousko, 1972, Chernousko and Lyubushin, 1982, Lyubushin, 1982]. The discrete-time PMP have been studied in [Halkin, 1966, Holtzman, 1966a, Holtzman and Halkin, 1966, Holtzman, 1966b, Canon et al., 1970], where Theorem 1 and its extensions are proved. To the best of our knowledge, the discrete-time MSA and its quantitative analysis have not been performed in either the deep learning or the optimal control literature.

Sec. 4 concerns the application of the MSA, in particular Thm. 2, to develop training algorithms for binary and ternary neural networks. There are a number of prior work exploring the training of similar neural networks, such as [Courbariaux et al., 2015, Hubara et al., 2016, Rastegari et al., 2016, Tang et al., 2017, Li et al., 2016, Zhu et al., 2016]. Theoretical analysis for the case of convex loss functions is carried out in [Li et al., 2017a]. Our point of numerical comparison for the binary MSA algorithm is [Courbariaux et al., 2015], where optimization of binary networks is based on shadow variables with full floating-point precision that is iteratively truncated to obtain gradients. We showed in Sec. 4.1 that the binary MSA is competitive as a training algorithm, but is in need of modifications to reduce overfitting for certain datasets. Training ternary networks has been discussed in [Hwang and Fan, 1967, Kim et al., 2014, Li et al., 2016, Zhu et al., 2016]. The difference in our ternary formulation is that we explore the sparsification of networks using a regularization parameter. In this sense it is related to compression of neural networks (e.g. [Han et al., 2015]), but our approach trains a network that is naturally ternary, and compression is achieved during training by a regularization term. Generally, a contrasting aspect of our approach from the aforementioned literature is that the theory of optimal control, together with Theorem. 2, provide a theoretical basis for the development of our algorithms. Nevertheless, further work is required to rigorously establish the convergence of these algorithms. We also mention a recent work [Yin et al., 2018] which analyzes quantized networks and develops algorithms based on relaxing the discrete-weight constraints into continuous regularizers. This is again a different approach from the current framework based on optimal control, and it will be interesting to compare results in future work.

## 6 Conclusion and Outlook

In this paper, we have introduced the discrete-time optimal control viewpoint of deep learning. In particular, the PMP and the MSA form an alternative theoretical and algorithmic basis for deep learning that may apply to broader contexts. As an application of our framework, we considered the training of binary and ternary neural networks, in which we develop effective algorithms based on the optimal control framework.

There are certainly many avenues of future work. An interesting mathematical question is the applicability of the PMP for discrete-weight neural networks, which does not satisfy the convexity assumptions in Theorem 1. It will be a challenge to find the condition under which rigorous statements can be made. Another question is to establish the convergence of the algorithms presented.

## Appendix A Full Statement and Sketch of the Proof of Theorem 1

In this section, we give the full statement of Theorem 1 and a sketch of its proof as presented in [Halkin, 1966]. We note that in [Halkin, 1966], more general initial and final conditions are considered. For simplicity, we shall stick to the current formulation in the main text. We note also that the result presented here has been extended (in the sense that the convexity condition has been relaxed to directional convexity) [Holtzman, 1966a, Holtzman and Halkin, 1966] and proven in different ways subsequently [Canon et al., 1970].

Before we begin, we simplify the notation by concatenating all the samples into a large vector . The functions are then redefined accordingly in the natural way. Moreover, we define the total loss function and the total regularization . Consequently, we have the reformulated problem

 minθ∈ΘJ(θ):=Φ(xT)+T−1∑t=0Lt(xt,θt) subject to: xt+1=ft(xt,θt),t=0,…,T−1. (18)

We now make the following assumptions:

• is twice continuous differentiable.

• are twice continuously differentiable with respect to , and together with their partial derivatives are uniformly bounded in and .

• The sets and are convex for every and .

The full statement of Theorem 1 is as follows:

###### Theorem 3 (Discrete PMP, Full Statement).

Let (B1)-(B3) be satisfied. Suppose that is an optimal solution of (18) and is the corresponding state process with . Then, there exists a co-state (or adjoint) process and a real number (abnormal multiplier) such that are not all zero, and the following holds:

 x∗t+1=∇pHt(x∗t,p∗t+1,θ∗t) x∗0=x0 (19) p∗t=∇xHt(x∗t,p∗t+1,θ∗t) p∗T=−β∇Φ(x∗T) (20) Ht(x∗t,p∗t,θ∗t)≥Ht(x∗t,p∗t,θ) for all θ∈Θt (21)

for , where the Hamiltonian function is defined as

 Ht(x,p,θ):=p⋅ft(x,θ)−βLt(x,θ).
###### Remark 2.

Compared with the informal statement, the full statement involves an abnormal multiplier . It exists to cover degenerate cases. This is related to “normality” in the calculus of variations [Bliss, 1938], or constraint qualification in the language of nonlinear programming [Kuhn and Tucker, 2014]. When it equals , the problem is degenerate. In applications we often focus on non-degenerate cases where is positive, in which case we can normalize accordingly so that . We then obtain the informal statement in the main text.

###### Sketch of the proof of Theorem 3.

To begin with, we may assume without loss of generality that . To see why this is so, we define an extra scalar variable with

 wt+1=wt+Lt(xt,θt),w0=0.

We then append to to form the new -dimensional state vector . Accordingly, we modify to and to . It is clear that all assumptions (B1)-(B3) are preserved.

As in the main text, we define the set of reachable states by the original dynamical system

 Wt:={x∈Rdt:∃θ s.t. xθt=x} (22)

where is the evolution of the dynamical system for under . This is basically the set of all states that the system can reach under “some” control at time . Let be a pair of optimal solutions of (18). Let us define the set of all final states with lower loss value than the optimum as

 S:={x∈RdT:Φ(x)<Φ(x∗T)}. (23)

Then, it is clear that and are disjoint. Otherwise, would not have been optimal. Now, if and are convex, then one can then use separation properties of convex sets to prove the theorem. However, in general they are non-convex (even if (B3) is satisfied). The idea is to consider the following linearized problem

 ψt+1=ft(x∗t,θt)+∇xft(x∗t,θ∗t)(ψt−x∗t),t=0,1,…,T−1 ψ0=x0 (24)

Then, we can similarly define the counter-parts to and as

 W+t:={x∈Rdt:∃θ s.t. ψθt=x} (25)

and

 S+:={x∈RdT:(x−x∗T)⋅∇Φ(x∗T)<0}. (26)

It is clear that the sets and are both convex. In [Halkin, 1966], author proves an important linearization lemma that says: if and are disjoint, then and are separated, i.e. there exists a non-zero vector such that

 (x−x∗T)⋅π≤0 x∈W+T (27) (x−x∗T)⋅π≥0 x∈S+ (28)

Here, is the normal of a separating hyper-plane of the convex sets and . In fact, one can show that for some . We note here that the linearization lemma, i.e. the separation of and , forms the bulk of the proof of the theorem in [Halkin, 1966]. The proof relies on topological properties of non-separated convex sets. We shall omit its proof here and refer the reader to [Halkin, 1966].

Now, we may define , and for , set

 p∗t=∇xHt(x∗t,p∗t+1,θ∗t)=∇xf(x∗t,θ∗t)Tp∗t+1. (29)

In other words, evolves the normal of the separating hyper-plane of and backwards in time. An important property one can check is that and (defined by Eq. (29) and (24)) are adjoint of each other at the optimum, i.e. if , then we have

 (ψt+1−x∗t+1)⋅p∗t+1=(ψt−x∗t)⋅p∗t. (30)

This fact allows one to prove the Hamiltonian maximization condition (21). Indeed, suppose that for some the condition is violated, i.e. there exists such that

 Ht(x∗t,p∗t+1,~θ)=Ht(x∗t,p∗t+1,θ∗t)+ϵ

for some . This means

 p∗t+1⋅ft(x∗t,~θ)=p∗t+1⋅ft(x∗t,θ∗t)+ϵ

i.e.,

 p∗t+1⋅(ft(x∗t,~θ)−x∗t+1)=ϵ

Now, we simply evolve , with but the initial condition . Then, Eq. (30) implies that , but this contradicts (27). ∎

###### Remark 3.

Note that in the original proof [Halkin, 1966], it is also assumed that is non-singular, which also forces to be constant for all . This is obviously not satisfied naturally by most neural networks that have changing dimensions. However, one can check that this condition only serves to ensure that if , then for all . Hence, without this assumption, we can only be sure that not all are 0.

### a.1 The Convexity Condition for Neural Networks

As also discussed in the main text, the most stringent condition in Theorem 3 is the convexity condition for , i.e. the set must be convex. It is easy to see that for the usual feed-forward neural networks, one can decompose it in such a way that the convexity constraint is satisfied as long as the parameter sets are convex. Indeed,we have

 xt+1=σ(gt(xt,θt))

where is some non-trainable nonlinear activation function and is affine in . We can simply decompose this into two steps

 x′t+1=gt(xt,θt), x′t+2=σ(x′t+1).

Then, but each of these two steps now satisfy the convexity constraint.

Similarly, in residual networks, we can usually write the layer transformation as

 xt+1=xt+ht(σ(gt(xt,θt)),ϕt)

where are maps affine in and respectively, and is a non-trainable non-linearity. The above cannot be straightforwardly split into two layers as there is a shortcut connection from . However, we can introduce auxillary variables and consider the 3-step decomposition

 x<