Hopfield Neural Network Flow: A Geometric Viewpoint
We provide gradient flow interpretations for the continuous-time continuous-state Hopfield neural network (HNN). The ordinary and stochastic differential equations associated with the HNN were introduced in the literature as analog optimizers, and were reported to exhibit good performance in numerical experiments. In this work, we point out that the deterministic HNN can be transcribed into Amari’s natural gradient descent, and thereby uncover the explicit relation between the underlying Riemannian metric and the activation functions. By exploiting an equivalence between the natural gradient descent and the mirror descent, we show how the choice of activation function governs the geometry of the HNN dynamics.
For the stochastic HNN, we show that the so-called “diffusion machine”, while not a gradient flow itself, induces a gradient flow when lifted in the space of probability measures. We characterize this infinite dimensional flow as the gradient descent of certain free energy with respect to a Wasserstein metric that depends on the geodesic distance on the ground manifold. Furthermore, we demonstrate how this gradient flow interpretation can be used for fast computation via recently developed proximal algorithms.
Keywords: Natural gradient descent, mirror descent, proximal operator, stochastic neural network, optimal mass transport, Wasserstein metric.
Computational models such as the Hopfield neural networks (HNN) [1, 2] have long been motivated as analog machines to solve optimization problems. Both deterministic and stochastic differential equations for the HNN have appeared in the literature, and the numerical experiments using these models have reported appealing performance [3, 4, 5, 6, 7]. In , introducing the stochastic HNN as the “diffusion machine”, Wong compared the same with other stochastic optimization algorithms, and commented: “As such it is closely related to both the Boltzmann machine and the Langevin algorithm, but superior to both in terms of possible integrated circuit realization”, and that “A substantial speed advantage for diffusion machines is likely”. Despite these promising features often backed up via numerical simulations, studies seeking basic understanding of the evolution equations remain sparse [9, 10]. The purpose of this paper is to revisit the differential equations for the HNN from a geometric standpoint.
Our main contribution is to clarify the interplay between the Riemannian geometry induced by the choice of HNN activation functions, and the associated variational interpretations for the HNN flow. Specifically, in the deterministic case, we argue that the HNN flow is natural gradient descent (Section II) of a function over a manifold with respect to (w.r.t.) a distance induced by certain metric tensor that depends on the activation functions of the HNN. This identification leads to an equivalent mirror descent interpretation (Section III). We work out an example and a case study (Section IV) to highlight these ideas.
In the stochastic HNN, the random fluctuations resulting from the noise do not allow the sample paths of the stochastic differential equation to be interpreted as gradient flow on . However, we argue (Section V) that the stochastic sample path evolution engenders a (deterministic) flow of probability measures supported on which does have an infinite dimensional gradient descent interpretation. In other words, the gradient flow interpretation for the stochastic HNN holds in the macroscopic or ensemble sense. This viewpoint seems novel in the context of stochastic HNN, and indeed leads to proximal algorithms enabling fast computation (Section V-C). A comparative summary of the gradient flow interpretations for the deterministic and the stochastic HNN is provided in Table I (after Section V-C).
The notation stands for the Euclidean gradient operator. We sometimes put a subscript as in to denote the gradient w.r.t. vector , and omit the subscript when its meaning is obvious from the context. As usual, acting on vector-valued function returns the Jacobian. The symbol is used for the Hessian. We use the superscript to denote matrix transposition, and to denote the expectation operator w.r.t. the probability density function (PDF) , i.e., . In the development that follows, we assume that all probability measures are absolutely continuous, i.e., the corresponding PDFs exist. The notation stands for the set of all joint PDFs supported on with finite second raw moments, i.e., . The inequality means that the matrix is symmetric positive definite. The symbol denotes an identity matrix of appropriate dimension; denotes a column vector of ones. For , the symbol denotes the tangent space of at ; we use to denote the tangent space at a generic point in . The set of natural numbers is denoted as . We use the symbols and for denoting element-wise (i.e., Hadamard) product and division, respectively.
Ii HNN Flow as Natural Gradient Descent
We consider the deterministic HNN dynamics given by
with initial condition . Here, the function is continuously differentiable, and therefore admits a minimum on . Without loss of generality, we assume to be non-negative on . The vectors are referred to as the state and hidden state, respectively. The so called “activation function” is assumed to satisfy the following properties:
differentiable almost everywhere,
i.e., its -th component depends only on the -th component of the argument,
for all are strictly increasing.
Examples of activation function include the logistic function, hyperbolic tangent function, among others.
To view the continuous-time dynamics (1) from a geometric standpoint, we start by rewriting it as
In words, the flow of is generated by a vector field that is negative of the Jacobian of evaluated at , times the gradient of w.r.t. . Due to (2), the Jacobian in (3) is a diagonal matrix. Furthermore, since each is strictly increasing and differentiable almost everywhere, the diagonal terms are positive. Therefore, the right-hand-side (RHS) of (3) has the following form: negative of a positive definite matrix times the gradient of . This leads to the result below.
Amari’s natural gradient descent  describes the steepest descent of on a Riemannian manifold with metric tensor , and is given by the recursion
where the discrete iteration index , and the step-size is small. For , we get the natural gradient dynamics
Notice that due to the assumptions listed on the activation function , its Jacobian appearing in (3) is a diagonal matrix with positive entries. Therefore, (3) is of the form (6) with , . In particular, for all . This completes the proof.
i.e., an orthogonal coordinate system can be associated with . Further insights can be obtained by interpreting as a Hessian manifold . To this end, we would like to view the positive definite metric tensor as the Hessian of a (twice differentiable) strictly convex function. Such an identification will allow us to associate a mirror descent [13, 14] in the dual manifold  corresponding to the natural gradient descent in manifold . Before delving into mirror descent, we next point out that (7) allows us to compute geodesics on the manifold , which completes the natural gradient descent interpretation.
The (minimal) geodesic curve parameterized via that connects , solves the well-known Euler-Lagrange equation (see e.g., [16, Ch. 3.2])
subject to the boundary conditions , . Here denote the Cristoffel symbols of the second kind for , and are given by
wherein denotes the -th element of the inverse metric tensor . For , and diagonal , we get
For our diagonal metric tensor (4), since the -th diagonal element of only depends on , the terms (10a)–(10c) are all zero, i.e., the only non-zero Cristoffel symbols in (10) are (10d). This simplifies (8) as a set of decoupled ODEs:
Once the geodesic is obtained from (12), the geodesic distance induced by the metric tensor , can be computed as
In view of the notation , we can define the Riemannian gradient , and rewrite the dynamics (6) as .
Iii HNN Flow as Mirror Descent
We now provide an alternative way to interpret the flow generated by (3) as mirror descent on a Hessian manifold that is dual to . For and , consider the mapping given by the map for some strictly convex and differentiable map , i.e.,
The map is referred to as the “mirror map”, formally defined below. In the following, the notation stands for the domain of .
(Mirror map) A differentiable, strictly convex function , on an open convex set , satisfying as the argument of approaches the boundary of the closure of , is called a mirror map.
Given a mirror map, one can introduce a Bregman divergence  associated with it as follows.
(Bregman divergence associated with a mirror map) For a mirror map , the associated Bregman divergence , is given by
Geometrically, is the error at due to first order Taylor approximation of about . It is easy to verify that iff . However, the Bregman divergence is not symmetric in general, i.e., .
In the sequel, we assume that the mirror map is twice differentiable. Then, being strictly convex, its Hessian is positive definite. This allows one to think of as a Hessian manifold with metric tensor . The mirror descent for function on a convex set , is a recursion of the form
where is the step-size, and the Bregman projection
Raskutti and Mukherjee  pointed out that if is twice differentiable, then the map given by , establishes a one-to-one correspondence between the natural gradient descent (5) and the mirror descent (15). Specifically, the mirror descent (15) on associated with the mirror map , is equivalent to the natural gradient descent (5) on with metric tensor , where
See for example, [22, Section 2.2].
for , where the notation means that the function does not depend on the -th component of the vector . Therefore, the map associated with the HNN flow is of the form (19).
where is defined in (19), and are constants. The constants define a parametrized family of maps . Combining (14) and (20) yields , and hence (due to (18b)). In Section IV, we illustrate these ideas on a concrete example.
for some to-be-determined mirror map , thereby interpreting (1) as mirror descent in itself, with as the dual variable, and as the primal variable. This interpretation, however, is contingent on the assumption that the monotone vector function is expressible as the gradient of a convex function (here, ), for which the necessary and sufficient condition is that the map be maximally cyclically monotone [21, Section 2, Corollary 1 and 2]. This indeed holds under the structural assumption (2).
In this Section, we first give a simple example (Section IV-A) to elucidate the ideas presented in Sections II and III. Then, we provide a case study (Section IV-B) to exemplify their application in the context of an engineering problem.
Iv-a An Illustrative Example
Consider the activation function given by the soft-projection operator
Iv-A1 Natural Gradient Descent Interpretation
Direct calculations yield
for all . Notice that implies , and hence the RHS of (23) is positive. We have
In this case, (12) reduces to the nonlinear ODE
which solved with the boundary conditions , , , determines the geodesic curve connecting . Introducing the change-of-variable , noting that , and enforcing , we can transcribe (25) into the separable ODE
where are constants of integration. Using the boundary conditions , in (28) yields the geodesic curve as
where , and . From (29), it is easy to verify that is component-wise in , and satisfies the boundary conditions , . From (13b) and (29), the geodesic distance associated with the activation function (22) is
where all vector operands such as square-root, arcsin, division (denoted by ), are element-wise. The contour plots in Fig. 2 show the geodesic balls of given by (30), centered at . We summarize: the HNN flow generated by (3) with activation function (22) is natural gradient descent of measured w.r.t. the geodesic distance (30).
Iv-A2 Mirror Descent Interpretation
To derive a mirror descent interpretation, integrating (24) we obtain
where , and as before, means that the function does not depend on the -th component of the vector . Equation (31) implies that
where are constants. Therefore, (31) can be re-written as
Let us consider a special case of (32), namely , that is particularly insightful. In this case,
can be interpreted as the weighted sum of logistic loss functions associated with the Bernoulli random variables. Furthermore, yields
which is the dual Logistic loss [22, Table 1], and the corresponding mirror map can be identified as
Thus, an alternative interpretation of the HNN dynamics with activation function (22) is as follows: it can be seen as mirror descent of the form (15) on in variables given by (33); a candidate mirror map is given by (38) with associated Bregman divergence (37).
Iv-B Case Study: Economic Load Dispatch Problem
We now apply the HNN gradient descent to solve an economic load dispatch problem in power system – a context in which HNN methods have been used before [7, 24, 25, 26]. Our objective, however, is to illustrate the geometric ideas presented herein.
Specifically, we consider generators which at time are either ON or OFF, denoted via state vector . For , if the th generator is ON (resp. OFF) at , then the th component of equals 1 (resp. 0). Let be the vector of the generators’ power output magnitudes at . A dispatcher simultaneously decides for the next time step , which of the generators should be ON or OFF () and the corresponding power output () in order to satisfy a known power demand while minimizing a convex cost of the form
Here, denotes the (known) marginal production cost of the generators; the second summand in (39) models the symmetric cost for switching the generator states from ON to OFF or from OFF to ON; the last summand in (39) models the generator ramp-up or ramp-down cost. The weights are known to the dispatcher. The optimization problem the dispatcher needs to solve (over single time step) reads
To solve the Mixed Integer Quadratically Constrained Quadratic Programming (MIQCQP) problem (40), we construct the augmented Lagrangian
where the parameter , and the dual variables . Then, the augmented Lagrange dual function
where is the step size for the dual updates. Specifically, for the iteration index , we perform the recursions
where (44a) is solved via Hopfield sub-iterations with step size . In , this algorithm was referred to as the dual Hopfield method, and was shown to have superior numerical performance compared to other standard approaches such as semidefinite relaxation.
For numerical simulation, we implement (44) with generators, uniform random vector , the pair uniform random in , , , and . The vectors and in (39) are chosen as follows: is chosen to be uniform random in and then element-wise rounded to the nearest integers; is chosen uniform random in and then element-wise multiplied with . For chosen uniform random in , we set . The activation functions are as in (22) with . The Hopfield sub-iterations were run until error tolerance was achieved with maximum number of sub-iterations being .
We fix the problem data generated as above, and perform 100 Monte Carlo simulations, each solving an instance of the dual Hopfield method for the same problem data with randomly chosen . Fig. 3 shows the convergence trends for the joint vector associated with the Monte Carlo simulations by plotting the distance between the current iterate and the converged vector over iteration index . In particular, Fig. 3 highlights how the geometry induced by the HNN governs the convergence trend: the geodesic distance to minimum (blue curves in Fig. 3), where is given by (30), decays monotonically, as expected in gradient descent. However, (red curves in Fig. 3) does not decay monotonically since the Euclidean distance does not encode the Riemannian geometry of the HNN recursion (5).
V Stochastic HNN Flow As Gradient Descent In the Space of Probability Measures
We next consider the stochastic HNN flow, also known as the “diffusion machine” [8, Section 4], given by the Itô stochastic differential equation (SDE)
where the state***We make the standard assumption that at . If the activation functions do not satisfy this, then a reflecting boundary is needed at each , to keep the sample paths within the unit hypercube; see e.g., . , the standard Wiener process , the notation stands for the vector of ones, and the parameter denotes the thermodynamic/annealing temperature. Recall that the metric tensor given by (4) is diagonal, and hence its inverse and square roots are diagonal too with respective diagonal elements being the inverse and square roots of the original diagonal elements.
The diffusion machine (45) was proposed in [8, 28] to solve global optimization problems on the unit hypercube, and is a stochastic version of the deterministic HNN flow discussed in Sections II–IV. This can be seen as a generalization of the Langevin model [29, 30] for the Boltzmann machine . Alternatively, one can think of the diffusion machine as the continuous-state continuous-time version of the HNN in which Gaussian white noise is injected at each node.
where the given initial joint PDF . The drift and diffusion coefficients in (45) are tailored so that admits the stationary joint PDF
i.e., a Gibbs PDF where is a normalizing constant (known as the “partition function”) to ensure . This follows from noting that for (45), we have†††We remind the readers that the raised indices denote the elements of the inverse of the metric tensor, i.e., .
and that for each . From (47), the critical points of coincides with that of , which is what allows to interpret (45) as an analog machine for globally optimizing the function . In fact, for the FPK operator (48), the solution of (46) enjoys an exponential rate of convergence [33, p. 1358-1359] to (47). We remark here that the idea of using SDEs for solving global optimization problems have also appeared in [34, 35].
In the context of diffusion machine, the key observation that part of the drift term can be canceled by part of the diffusion term in the FPK operator (48a), thereby guaranteeing that the stationary PDF associated with (48) is (47), was first pointed out by Wong . While this renders the stationary solution of the FPK operator meaningful for globally minimizing , it is not known if the transient PDFs generated by (46) admit a natural interpretation. We next argue that (46) is indeed a gradient flow in the space of probability measures supported on the manifold .
V-a Wasserstein Gradient Flow
For , consider the free energy functional given by
along any solution generated by (49), thanks to for all . In particular, the RHS of (52) equals zero iff , i.e., at the stationary PDF (47). For , the RHS of (52) is for any transient PDF solving (49).
To view (49) as gradient flow over , we will need the notion of (quadratic) Wasserstein metric between two probability measures and on , defined as
where denotes a joint probability measure supported on , and the geodesic distance is given by (13). The infimum in (53) is taken over the set , which we define as the set of all joint probability measures having finite second moment that are supported on , with prescribed -marginal , and prescribed -marginal . If and have respective PDFs and , then we can use the notation in lieu of . The subscript in (53) is indicative of its dependence on the ground Riemannian metric , and generalizes the Euclidean notion of Wasserstein metric , i.e., case. See also [37, Section 3] and . Following [36, Ch. 7] and using the positive definiteness of , it is easy to verify that (53) indeed defines a metric on , i.e., is non-negative, zero iff , symmetric in its arguments, and satisfies the triangle inequality.
The square of (53) is referred to as the “optimal transport cost” (see  for the Euclidean case) that quantifies the minimum amount of work needed to reshape the PDF to (or equivalently, to ). This can be formalized via the following dynamic variational formula [40, Corollary 2.5] for (53):
The infimum above is taken over the PDF-vector field pairs , where . Recognizing that (54a) is the continuity/Liouville equation for the single integrator dynamics , one can interpret (54) as a fixed horizon “minimum weighted energy” stochastic control problem subject to endpoint PDF constraints (54b)-(54c).