A Mean-Field Optimal Control Formulation of Deep Learning
Recent work linking deep neural networks and dynamical systems opened up new avenues to analyze deep learning. In particular, it is observed that new insights can be obtained by recasting deep learning as an optimal control problem on difference or differential equations. However, the mathematical aspects of such a formulation have not been systematically explored. This paper introduces the mathematical formulation of the population risk minimization problem in deep learning as a mean-field optimal control problem. Mirroring the development of classical optimal control, we state and prove optimality conditions of both the Hamilton-Jacobi-Bellman type and the Pontryagin type. These mean-field results reflect the probabilistic nature of the learning problem. In addition, by appealing to the mean-field Pontryagin’s maximum principle, we establish some quantitative relationships between population and empirical learning problems. This serves to establish a mathematical foundation for investigating the algorithmic and theoretical connections between optimal control and deep learning.
Deep learning bengio2009learning (); lecun2015deep (); goodfellow2016deep () has become a primary tool in many modern machine learning tasks, such as image classification and segmentation. Consequently, there is a pressing need to provide a solid mathematical framework to analyze various aspects of deep neural networks. The recent line of work on linking dynamical systems, optimal control and deep learning has suggested such a candidate e2017proposal (); li2017maximum (); li2018optimal (); haber2017stable (); chang2017multi (); chang2017reversible (); lu2017beyond (); sonoda2017double (); li2017deep (); chen2018neural (). In this view, ResNet he2016deep () can be regarded as a time-discretization of a continuous-time dynamical system. Learning (usually in the empirical risk minimization form) is then recast as an optimal control problem, from which novel algorithms li2017maximum (); li2018optimal () and network structures haber2017stable (); chang2017multi (); chang2017reversible (); lu2017beyond () can be designed. An attractive feature of this approach is that, the compositional structure, which is widely considered the essence of deep neural networks is explicitly taken into account in the time-evolution of the dynamical systems.
While most prior work on the dynamical systems viewpoint of deep learning have focused on algorithms and network structures, this paper aims to study the fundamental mathematical aspects of the formulation. Indeed, we show that the most general formulation of the population risk minimization problem can be regarded as a mean-field optimal control problem, in the sense that the optimal control parameters (or equivalently, the trainable weights) depend on the population distribution of input-target pairs. Our task is then to analyze the mathematical properties of this mean-field control problem. Mirroring the development of classical optimal control, we will proceed in two parallel, but inter-connected ways, namely the dynamic programming formalism and the maximum principle formalism.
The paper is organized as follows. We discuss related work in Sec. 2 and introduce the basic mean-field optimal control formulation of deep learning in Sec. 3. In Sec. 4, following the classical dynamic programming approach bellman2013dynamic (), we introduce and study the properties of a value function for the mean-field control problem whose state space is an appropriate Wasserstein space of probability measures. By defining an appropriate notion of derivative with respect to probability measures, we show that the value function is related to solutions of an infinite dimensional Hamilton-Jacobi-Bellman (HJB) partial differential equation. With the concept of viscosity solutions crandall1983viscosity (), we show in Sec. 5 that the HJB equation admits a unique viscosity solution and completely characterize the optimal loss function and the optimal control policy of the mean-field control problem. This establishes a concrete link between the learning problem viewed as a variational problem and the Hamilton-Jacobi-Bellman equation that is associated with the variational problem. It should be noted the essential ideas in the proof of Sec. 4 and 5 are not new, but we present our simplified treatment for this particular setting.
Next, in Sec. 6, we develop the more local theory based on the Pontryagin’s maximum principle (PMP) pontryagin1987mathematical (). We state and prove a mean-field version of the classical PMP that provides necessary conditions for optimal controls. Further, we study situations when the mean-field PMP admits a unique solution, which then imply that it is also sufficient for optimality, provided an optimal solution exists. We will see in Sec. 7 that compared with the HJB approach, this further requires the fact that the time horizon of the learning problem is small enough. Finally, in Sec. 8 we study the relationship between the population risk minimization problem (cast as a mean-field control problem and characterized by a mean-field PMP) and its empirical risk minimization counter-part (cast as a classical control problem and characterized by a classical, sampled PMP). We prove that under appropriate conditions for every stable solution of the mean-field PMP, with high probability there exist close-by solutions of the sampled PMP, and the latter converge in probability to the former, with explicit error estimates on both the distance between the solutions and the distance between their loss function values. This provides a type of a priori error estimate that has implications on the generalization ability of neural networks, which is an important and active area of machine learning research.
Note that it is not the purpose of this paper to prove the sharpest estimates under the most general conditions, thus we have taken the most convenient but reasonable assumptions and the results presented could be sharpened with more technical details. In each section from Sec. 4 to Sec. 8, we first present the mathematical results, and then discuss the related implications in deep learning. Furthermore, in this work we shall focus our analysis on the continuous idealization of deep residual networks, but we believe that much of the analysis presented also carry over to the discrete domain (i.e. discrete layers).
2 Related work
The connection between back-propagation and optimal control of dynamical systems is known since the earlier works on control and deep learning bryson1975applied (); athans2013optimal (); le1988theoretical (). Recently, the dynamical systems approach to deep learning was proposed in e2017proposal () and explored in the direction of training algorithms based on the PMP and the method of successive approximations li2017maximum (); li2018optimal (). In another vein, there are also studies on the continuum limit of neural networks sonoda2017double (); li2017deep () and on designing network architectures for deep learning haber2017stable (); chang2017multi (); chang2017reversible (); lu2017beyond () based on dynamical systems and differential equations. Instead of analysis of algorithms or architectures, the present paper focuses on the mathematical aspects of the control formulation itself, and develops a mean-field theory that characterize the optimality conditions and value functions using both PDE (HJB) and ODE (PMP) approaches. The over-arching goal is to develop the mathematical foundations of the optimal control formulation of deep learning.
In the control theory literature, mean-field optimal control is an active area of research. Many works on mean-field games lasry2007mean (); huang2006large (); gueant2011mean (); bensoussan2013mean (), the control of McKean-Vlasov systems lauriere2014dynamic (); pham2017dynamic (); pham2018bellman (), and the control of Cucker-Smale systems caponigro2015sparse (); fornasier2014mean (); bongini2017mean () focus on deriving the limiting partial differential equations that characterize the optimal control as the number of agents goes to infinity. This is akin to the theory of the propagation of chaos sznitman1991topics (). Meanwhile there are also works discussing the stochastic maximum principle for stochastic differential equations of mean-field type andersson2011maximum (); buckdahn2011general (); carmona2015forward (). The present paper differs from all previous works in two aspects. First, in the context of continuous-time deep learning, the problem differs from these previous control formulations as the source of randomness are coupled input-target pairs (the latter determines the terminal loss function, which can now be regarded as a random function). On the other hand, a simplifying feature in our case is that the dynamics, given the input-target pair, are otherwise deterministic. Second, the dynamics of each random realization are independent of the distribution law of the population, and are coupled only through the shared control parameters. This is to be contrasted with optimal control of McKean-Vlasov dynamics carmona2015forward (); pham2017dynamic (); pham2018bellman () or mean-field games lasry2007mean (); huang2006large (); gueant2011mean (); bensoussan2013mean (), where the population law directly enters the dynamical equations (and not just through the shared control). Thus, in this sense our dynamical equations are much simpler to analyze. Consequently, although some of our results can be deduced from more general mean-field analysis in the control literature, here we will present simplified derivations tailored to our setting, Note also that there are neural network structures (e.g. batch-normalization) that can be considered to have explicit mean-field dynamics, and we defer this discussion to Sec. 9.
3 From ResNets to mean-field optimal control
Let us now present the optimal control formulation of deep learning as introduced in e2017proposal (); li2017maximum (); li2018optimal (). In the simplest form, the feed-forward propagation in a -layer residual network can be represented by the difference equations
where is the input (image, time-series, etc.) and is the final output. The final output is then compared with some target corresponding to via some loss function. The goal of learning is to tune the trainable parameters such that is close to . The only change in the continuous-time idealization of deep residual learning, which we will subsequently focus on, is that instead of the difference equation (1), the forward dynamics is now a differential equation. Let us now introduce this formulation more precisely.
Let be a fixed and sufficiently rich probability space so that all subsequently required random variables can be constructed. Suppose and are random variables jointly distributed according to (hereafter, for each random variable we denote its distribution or law by ). This represents the distribution of the input-target pairs, which we assume can be embedded in Euclidean spaces. Consider a set of admissible controls or training weights . In typical deep learning, is taken as the whole space , but here we consider the more general case where can be constrained. Fix (network “depth”) and let (feed-forward dynamics), (terminal loss function) and (regularizer) be functions
We define the state dynamics as the ordinary differential equation (ODE)
with initial condition equals to the random variable . Thus, this is a stochastic ODE, whose only source of randomness is on the initial condition. Consider the set of essentially bounded measurable controls . To improve clarity, we will reserve bold-faced letters for path-space quantities. For example, . In contrast, variables/functions taking values in finite-dimensional Euclidean spaces are not bold-faced.
The population risk minimization problem in deep learning can hence be posed as the following mean-field optimal control problem
The term “mean-field” highlights the fact that is shared by a whole population of input-target pairs, and the optimal control must depend on the law of the input-target random variables. Strictly speaking, the law of does not enter the forward equations explicitly (unlike e.g., McKean-Vlasov control carmona2015forward ()), and hence our forward dynamics are not explicitly in mean-field form. Nevertheless, we will use the term “mean-field” to emphasize the dependence of the control on the population distribution.
In contrast, if we were to perform empirical risk minimization, as is often the case in practice (and is the case analyzed by previous work on algorithms li2017maximum (); li2018optimal ()), we would first draw i.i.d. samples and pose the sampled optimal control problem
Thus, the solutions of sampled optimal control problems are typically random variables. We now focus our analysis on the mean-field problem (3) and only later in Sec. 8 relate it with the sampled problem (4).
Throughout this paper, we always use to denote the concatenated -dimensional variable where and . Correspondingly is the extended -dimensional feed-forward function, is the extended -dimensional regularization loss, and still denotes the terminal loss function. We denote by the inner product of two Euclidean vectors and with the same dimension. The Euclidean norm is denoted by and the absolute value is denoted by . Gradient operators on Euclidean spaces are denoted by with subscripts indicating the variable with which the derivative is taken with. In contrast, we use to represent the Fréchet derivative on Banach spaces. Namely, if and is a mapping between two Banach spaces and , then is defined by the linear operator s.t.
For a matrix , we use the symbol to mean that is negative semi-definite.
Let the Banach space be the set of essentially bounded measurable functions from to , where is a subset of a Euclidean space with the usual Lebesgue measure. The norm is , and we shall write for brevity in place of . In this paper, is often either or , and the path-space variables we consider in this paper, such as the controls , will mostly be defined in this space.
As this paper introduces a mean-field optimal control approach, we also need some notation for the random variables and their distributions. We use the shorthand for , the set of -valued square integrable random variables. We equip this Hilbert space with the norm for . We denote by the set of square integrable probability measures on the Euclidean space . Note that if and only if . The space is regarded as a metric space equipped with the 2-Wasserstein distance
For , we also define .
Given a measurable function that is square integrable with respect to , we use the notation
Now, we introduce some notation for the dynamical evolution of probabilities. Given and a control process , we consider the following dynamical system for :
Note that is always square integrable given is Lipschitz continuous with respect to . Let , we denote the law of for simplicity by
This is valid since the law of should only depend on the law of and not on the random variable itself. This notation also allow as to write down the flow or semi-group property of the dynamical system as
for all .
Finally, throughout the results and proofs, we will use or with subscripts as names for generic constants, whose values may change from line to line when there is no need for them to be distinct. In general, these constants may implicitly depend on and the ambient dimensions , but for brevity we omit them in the rest of the paper.
4 Mean-field dynamic programming principle and HJB equation
We begin our analysis of (3) by employing the dynamic programming principle and the Hamilton-Jacobi-Bellman formalism. In this approach, the key idea is to define a value function that corresponds to the optimal loss of the control problem (3), but under a general starting time and starting state. One can then derive a partial differential equation (Hamilton-Jacobi-Bellman equation, or HJB equation) to be satisfied by such a value function, which characterizes both the optimal loss function value and the optimal control policy of the original control problem. Compared to the classical optimal control case corresponding to empirical risk minimization in learning, here the value function’s state argument is no longer a finite-dimensional vector, but an infinite-dimensional object corresponding to the joint distribution of the input-target pair. We shall interpret it as an element of a suitable Wasserstein space. The detailed mathematical definition of this value function and its basic properties are discussed in Subsec. 4.1.
In the finite-dimensional case, the HJB equation is a classical partial differential equation. In contrast, since the state variables we are dealing with are probability measures rather than Euclidean vectors, we need a concept of derivative with respect to a probability measure, as introduced by Lions in his course at Collège de France lions2012cours (). We give a brief introduction of this concept in Subsec. 4.2 and refer readers to the lecture notes cardaliaguet2010notes () for more details. We then present the resulting infinite-dimensional HJB equation in Subsec. 4.3.
Throughout this section and next section (Sec. 5), we assume
are bounded; are Lipschitz continuous with respect to , and the Lipschitz constants of and are independent of .
4.1 Value function and its properties
Adopting the viewpoint of taking probability measures as state variables, we can define a time-dependent objective functional
The second line in the above is just a rewriting of the first line based on the notation introduced earlier. Here, we abuse the notation in (3) for the new objective functional, which now has additional arguments . Of course, in (3) corresponds to in (7).
The value function is defined as a real-valued function on through
If we assume attains the infimum in (3), then by definition
The following proposition shows the continuity of the value function.
The function is Lipschitz continuous on , uniformly with respect to , and the value function is Lipschitz continuous on .
We first establish some elementary estimates based on the assumptions. We suppose
Let such that , the Lipschitz continuity of gives us
Note that in the proceeding inequality the left side does not depend on the choice of while the right side does. Hence we can take the infimum over all the joint choices of to get
The same argument applied to gives us
For the deterministic ODE
define the induced flow map as
Using Gronwall’s inequality with the boundedness and Lipschitz continuity of , we know
Therefore we use the definition of Wasserstein distance to obtain
which gives us the desired Lipschitz continuity property.
Finally, combining the fact that
and is Lipschitz continuous at , uniformly with respect to , we deduce that the value function is Lipschitz continuous on . ∎
The important observation we now make is that the value function satisfies a recursive relation. This is known as the dynamic programming principle, which forms the basis of deriving the Hamilton-Jacobi-Bellman equation. Intuitively, the dynamic programming principle states that for any optimal trajectory, starting from any intermediate state in the trajectory, the remaining trajectory must again be optimal, starting from that time and state. We now state and prove this intuitive statement precisely.
(Dynamic programming principle) For all , , we have
The proof is elementary as in the context of deterministic control problem. We provide it as follows for completeness.
1). Given fixed and any , we consider the probability measure . Fix and by definition of value function (8) we can pick satisfying
Now consider the control process defined as
As and are both arbitrary, we have
2). Fix again and we choose by definition such that
Using the flow property (6) and the definition of the value function again gives us the estimate
Hence we deduce
Combining the inequalities in the two parts completes the proof. ∎
4.2 Derivative and Chain Rule in Wasserstein Space
In classical finite-dimensional optimal control, the HJB equation can be formally derived from the dynamic programming principle by a Taylor expansion of the value function with respect to the state vector. However, in the current formulation, the state is now a probability measure. To derive the corresponding HJB equation in this setting, it is essential to define a notion of derivative of the value function with respect to a probability measure. The basic idea to achieve this is to take probability measures on as laws of -valued random variables on the probability space and then use the corresponding Banach space of random variables to define derivatives. This approach is more extensively outlined in cardaliaguet2010notes ().
Concretely, let us take any function . We now lift it into its “extension” , a function defined on by
We say is if the lifted function is Fréchet differentiable with continuous derivatives. Since we can identify with its dual space, if the Fréchet derivative exists, by Riesz’ theorem one can view it as an element of :
The important result one can prove is that the law of does not depend on but only on the law of . Accordingly we have the representation
for some function , which is called derivative of at . Moreover, we know is square integrable with respect to .
We next need a chain rule defined on . Consider the dynamical system
and . Then, for all , we have
or equivalently its lifted version
4.3 HJB equation in Wasserstein Space
Guided by the dynamic programming principle (15) and formula (18), we are ready to formally derive the associated HJB equation as follows. Let with being small. By performing a formal Taylor series expansion of (15), we have
Passing to the limit , we obtain the following HJB equation
which the value function should satisfy. The rest of this and the next section is to establish the precise link between equation (20) and the value function (8). We now prove a verification result, which essentially says that if we have a smooth enough solution of the HJB equation (20), then this solution must be the value function. Moreover, the HJB allows us to identify the optimal control policy.
Given any control process , one can apply formula (18) between and with explicit dependence and obtain
Equivalently, we have
where the first inequality comes from the infimum condition in (20). Since the control process is arbitrary, we have
Replacing the arbitrary control process with where is given by the optimal feedback control and repeating the above argument, noting that the inequality becomes equality since the infimum is attained, we have
Therefore we obtain and defines an optimal feedback control policy. ∎
Prop. 3 is an important statement that links smooth solutions of the HJB equation with solutions of the mean-field optimal control problem, and hence the population minimization problem in deep learning. Furthermore, by taking the infimum in (20), it allows us to identify an optimal control policy . This is in general a stronger characterization of the solution of the learning problem. In particular, it is of feedback, or closed-loop form. On the other hand, an open-loop solution can be obtained from the closed-loop control policy by sequentially setting , where is the solution of the feed-forward ODE with