Hopfield Neural Network Flow: A Geometric Viewpoint
Abstract
We provide gradient flow interpretations for the continuoustime continuousstate Hopfield neural network (HNN). The ordinary and stochastic differential equations associated with the HNN were introduced in the literature as analog optimizers, and were reported to exhibit good performance in numerical experiments. In this work, we point out that the deterministic HNN can be transcribed into Amari’s natural gradient descent, and thereby uncover the explicit relation between the underlying Riemannian metric and the activation functions. By exploiting an equivalence between the natural gradient descent and the mirror descent, we show how the choice of activation function governs the geometry of the HNN dynamics.
For the stochastic HNN, we show that the socalled “diffusion machine”, while not a gradient flow itself, induces a gradient flow when lifted in the space of probability measures. We characterize this infinite dimensional flow as the gradient descent of certain free energy with respect to a Wasserstein metric that depends on the geodesic distance on the ground manifold. Furthermore, we demonstrate how this gradient flow interpretation can be used for fast computation via recently developed proximal algorithms.
Keywords: Natural gradient descent, mirror descent, proximal operator, stochastic neural network, optimal mass transport, Wasserstein metric.
I Introduction
Computational models such as the Hopfield neural networks (HNN) [1, 2] have long been motivated as analog machines to solve optimization problems. Both deterministic and stochastic differential equations for the HNN have appeared in the literature, and the numerical experiments using these models have reported appealing performance [3, 4, 5, 6, 7]. In [8], introducing the stochastic HNN as the “diffusion machine”, Wong compared the same with other stochastic optimization algorithms, and commented: “As such it is closely related to both the Boltzmann machine and the Langevin algorithm, but superior to both in terms of possible integrated circuit realization”, and that “A substantial speed advantage for diffusion machines is likely”. Despite these promising features often backed up via numerical simulations, studies seeking basic understanding of the evolution equations remain sparse [9, 10]. The purpose of this paper is to revisit the differential equations for the HNN from a geometric standpoint.
Our main contribution is to clarify the interplay between the Riemannian geometry induced by the choice of HNN activation functions, and the associated variational interpretations for the HNN flow. Specifically, in the deterministic case, we argue that the HNN flow is natural gradient descent (Section II) of a function over a manifold with respect to (w.r.t.) a distance induced by certain metric tensor that depends on the activation functions of the HNN. This identification leads to an equivalent mirror descent interpretation (Section III). We work out an example and a case study (Section IV) to highlight these ideas.
In the stochastic HNN, the random fluctuations resulting from the noise do not allow the sample paths of the stochastic differential equation to be interpreted as gradient flow on . However, we argue (Section V) that the stochastic sample path evolution engenders a (deterministic) flow of probability measures supported on which does have an infinite dimensional gradient descent interpretation. In other words, the gradient flow interpretation for the stochastic HNN holds in the macroscopic or ensemble sense. This viewpoint seems novel in the context of stochastic HNN, and indeed leads to proximal algorithms enabling fast computation (Section VC). A comparative summary of the gradient flow interpretations for the deterministic and the stochastic HNN is provided in Table I (after Section VC).
Notations
The notation stands for the Euclidean gradient operator. We sometimes put a subscript as in to denote the gradient w.r.t. vector , and omit the subscript when its meaning is obvious from the context. As usual, acting on vectorvalued function returns the Jacobian. The symbol is used for the Hessian. We use the superscript to denote matrix transposition, and to denote the expectation operator w.r.t. the probability density function (PDF) , i.e., . In the development that follows, we assume that all probability measures are absolutely continuous, i.e., the corresponding PDFs exist. The notation stands for the set of all joint PDFs supported on with finite second raw moments, i.e., . The inequality means that the matrix is symmetric positive definite. The symbol denotes an identity matrix of appropriate dimension; denotes a column vector of ones. For , the symbol denotes the tangent space of at ; we use to denote the tangent space at a generic point in . The set of natural numbers is denoted as . We use the symbols and for denoting elementwise (i.e., Hadamard) product and division, respectively.
Ii HNN Flow as Natural Gradient Descent
We consider the deterministic HNN dynamics given by
(1a)  
(1b) 
with initial condition . Here, the function is continuously differentiable, and therefore admits a minimum on . Without loss of generality, we assume to be nonnegative on . The vectors are referred to as the state and hidden state, respectively. The so called “activation function” is assumed to satisfy the following properties:

homeomorphism,

differentiable almost everywhere,

has structure
(2) i.e., its th component depends only on the th component of the argument,

for all are strictly increasing.
Examples of activation function include the logistic function, hyperbolic tangent function, among others.
To view the continuoustime dynamics (1) from a geometric standpoint, we start by rewriting it as
(3) 
In words, the flow of is generated by a vector field that is negative of the Jacobian of evaluated at , times the gradient of w.r.t. . Due to (2), the Jacobian in (3) is a diagonal matrix. Furthermore, since each is strictly increasing and differentiable almost everywhere, the diagonal terms are positive. Therefore, the righthandside (RHS) of (3) has the following form: negative of a positive definite matrix times the gradient of . This leads to the result below.
Theorem 1.
Proof.
Amari’s natural gradient descent [11] describes the steepest descent of on a Riemannian manifold with metric tensor , and is given by the recursion
(5) 
where the discrete iteration index , and the stepsize is small. For , we get the natural gradient dynamics
(6) 
Notice that due to the assumptions listed on the activation function , its Jacobian appearing in (3) is a diagonal matrix with positive entries. Therefore, (3) is of the form (6) with , . In particular, for all . This completes the proof.
Theorem 1 allows us to interpret the flow generated by (3) as the steepest descent of in a Riemannian manifold with length element given by
(7) 
i.e., an orthogonal coordinate system can be associated with . Further insights can be obtained by interpreting as a Hessian manifold [12]. To this end, we would like to view the positive definite metric tensor as the Hessian of a (twice differentiable) strictly convex function. Such an identification will allow us to associate a mirror descent [13, 14] in the dual manifold [15] corresponding to the natural gradient descent in manifold . Before delving into mirror descent, we next point out that (7) allows us to compute geodesics on the manifold , which completes the natural gradient descent interpretation.
Iia Geodesics
The (minimal) geodesic curve parameterized via that connects , solves the wellknown EulerLagrange equation (see e.g., [16, Ch. 3.2])
(8) 
subject to the boundary conditions , . Here denote the Cristoffel symbols of the second kind for , and are given by
(9) 
wherein denotes the th element of the inverse metric tensor . For , and diagonal , we get
(10a)  
(10b)  
(10c)  
(10d) 
For our diagonal metric tensor (4), since the th diagonal element of only depends on , the terms (10a)–(10c) are all zero, i.e., the only nonzero Cristoffel symbols in (10) are (10d). This simplifies (8) as a set of decoupled ODEs:
(11) 
Using (4) and (10d), for , we can rewrite (11) as
(12) 
which solved together with the endpoint conditions , , yields the geodesic curve . In Section IV, we will see an explicitly solvable example for the system of nonlinear ODEs (12).
Once the geodesic is obtained from (12), the geodesic distance induced by the metric tensor , can be computed as
(13a)  
(13b) 
This allows us to formally conclude that the flow generated by (3), or equivalently by (6), can be seen as the gradient descent of on the manifold , measured w.r.t. the distance given by (13b).
Remark 1.
In view of the notation , we can define the Riemannian gradient , and rewrite the dynamics (6) as .
Iii HNN Flow as Mirror Descent
We now provide an alternative way to interpret the flow generated by (3) as mirror descent on a Hessian manifold that is dual to . For and , consider the mapping given by the map for some strictly convex and differentiable map , i.e.,
The map is referred to as the “mirror map”, formally defined below. In the following, the notation stands for the domain of .
Definition 1.
(Mirror map) A differentiable, strictly convex function , on an open convex set , satisfying as the argument of approaches the boundary of the closure of , is called a mirror map.
Given a mirror map, one can introduce a Bregman divergence [17] associated with it as follows.
Definition 2.
(Bregman divergence associated with a mirror map) For a mirror map , the associated Bregman divergence , is given by
(14) 
Geometrically, is the error at due to first order Taylor approximation of about . It is easy to verify that iff . However, the Bregman divergence is not symmetric in general, i.e., .
In the sequel, we assume that the mirror map is twice differentiable. Then, being strictly convex, its Hessian is positive definite. This allows one to think of as a Hessian manifold with metric tensor . The mirror descent for function on a convex set , is a recursion of the form
(15a)  
(15b) 
where is the stepsize, and the Bregman projection
(16) 
For background on mirror descent, we refer the readers to [13, 14]. For a recent reference, see [18, Ch. 4].
Raskutti and Mukherjee [15] pointed out that if is twice differentiable, then the map given by , establishes a onetoone correspondence between the natural gradient descent (5) and the mirror descent (15). Specifically, the mirror descent (15) on associated with the mirror map , is equivalent to the natural gradient descent (5) on with metric tensor , where
(17) 
is the LegendreFenchel conjugate [19, Section 12] of . In our context, the Hessian is given by (4). To proceed further, the following Lemma will be useful.
Lemma 1.
Proof.
See for example, [22, Section 2.2].
Combining (4) with (18a), we get
(19) 
for , where the notation means that the function does not depend on the th component of the vector . Therefore, the map associated with the HNN flow is of the form (19).
To derive the mirror descent (15) associated with the HNN flow, it remains to determine . Thanks to (18b), we can do so without explicitly computing . Specifically, integrating (19) yields , i.e.,
(20) 
where is defined in (19), and are constants. The constants define a parametrized family of maps . Combining (14) and (20) yields , and hence (due to (18b)). In Section IV, we illustrate these ideas on a concrete example.
Remark 2.
We note that one may rewrite (1) in the form of a mirror descent ODE [20, Section 2.1], given by
(21) 
for some tobedetermined mirror map , thereby interpreting (1) as mirror descent in itself, with as the dual variable, and as the primal variable. This interpretation, however, is contingent on the assumption that the monotone vector function is expressible as the gradient of a convex function (here, ), for which the necessary and sufficient condition is that the map be maximally cyclically monotone [21, Section 2, Corollary 1 and 2]. This indeed holds under the structural assumption (2).
Iv Examples
In this Section, we first give a simple example (Section IVA) to elucidate the ideas presented in Sections II and III. Then, we provide a case study (Section IVB) to exemplify their application in the context of an engineering problem.
Iva An Illustrative Example
Consider the activation function given by the softprojection operator
(22) 
see Fig. 1. We next illustrate the natural gradient descent and the mirror descent interpretations of (3) with activation function (22).
IvA1 Natural Gradient Descent Interpretation
Direct calculations yield
(23) 
for all . Notice that implies , and hence the RHS of (23) is positive. We have
(24) 
which allows us to deduce the following: the HNN flow with activation function (22) is the natural gradient descent (5) with diagonal metric tensor given by (24), i.e., , .
In this case, (12) reduces to the nonlinear ODE
(25) 
which solved with the boundary conditions , , , determines the geodesic curve connecting . Introducing the changeofvariable , noting that , and enforcing , we can transcribe (25) into the separable ODE
(26) 
which gives
(27) 
and consequently,
(28) 
where are constants of integration. Using the boundary conditions , in (28) yields the geodesic curve as
(29) 
where , and . From (29), it is easy to verify that is componentwise in , and satisfies the boundary conditions , . From (13b) and (29), the geodesic distance associated with the activation function (22) is
(30) 
where all vector operands such as squareroot, arcsin, division (denoted by ), are elementwise. The contour plots in Fig. 2 show the geodesic balls of given by (30), centered at . We summarize: the HNN flow generated by (3) with activation function (22) is natural gradient descent of measured w.r.t. the geodesic distance (30).
IvA2 Mirror Descent Interpretation
To derive a mirror descent interpretation, integrating (24) we obtain
(31) 
where , and as before, means that the function does not depend on the th component of the vector . Equation (31) implies that
(32) 
where are constants. Therefore, (31) can be rewritten as
(33) 
Comparing the RHS of (33) with (19), for this specific example, we have that
(34) 
Let us consider a special case of (32), namely , that is particularly insightful. In this case,
i.e., a separable (weighted) sum of the bit entropy (see [22, Table 1] and [23, Table 1]). The associated Bregman divergence
(35) 
can be interpreted as the weighted sum of logistic loss functions associated with the Bernoulli random variables. Furthermore, yields
(36) 
Consequently, using (18b) and (36), we get
(37) 
which is the dual Logistic loss [22, Table 1], and the corresponding mirror map can be identified as
(38) 
Thus, an alternative interpretation of the HNN dynamics with activation function (22) is as follows: it can be seen as mirror descent of the form (15) on in variables given by (33); a candidate mirror map is given by (38) with associated Bregman divergence (37).
IvB Case Study: Economic Load Dispatch Problem
We now apply the HNN gradient descent to solve an economic load dispatch problem in power system – a context in which HNN methods have been used before [7, 24, 25, 26]. Our objective, however, is to illustrate the geometric ideas presented herein.
Specifically, we consider generators which at time are either ON or OFF, denoted via state vector . For , if the ^{th} generator is ON (resp. OFF) at , then the ^{th} component of equals 1 (resp. 0). Let be the vector of the generators’ power output magnitudes at . A dispatcher simultaneously decides for the next time step , which of the generators should be ON or OFF () and the corresponding power output () in order to satisfy a known power demand while minimizing a convex cost of the form
(39) 
Here, denotes the (known) marginal production cost of the generators; the second summand in (39) models the symmetric cost for switching the generator states from ON to OFF or from OFF to ON; the last summand in (39) models the generator rampup or rampdown cost. The weights are known to the dispatcher. The optimization problem the dispatcher needs to solve (over single time step) reads
(40a)  
(40b)  
(40c) 
The constraint (40c) enforces that the supply equals the demand; the constraint (40b) ensures that only those generators which are ON, can produce electricity.
To solve the Mixed Integer Quadratically Constrained Quadratic Programming (MIQCQP) problem (40), we construct the augmented Lagrangian
(41) 
where the parameter , and the dual variables . Then, the augmented Lagrange dual function
(42) 
For some initial guess of the pair , we apply the HNN natural gradient descent (5) to perform the minimization (42), and then do the dual ascent updates
(43a)  
(43b) 
where is the step size for the dual updates. Specifically, for the iteration index , we perform the recursions
(44a)  
(44b)  
(44c) 
where (44a) is solved via Hopfield subiterations with step size . In [7], this algorithm was referred to as the dual Hopfield method, and was shown to have superior numerical performance compared to other standard approaches such as semidefinite relaxation.
For numerical simulation, we implement (44) with generators, uniform random vector , the pair uniform random in , , , and . The vectors and in (39) are chosen as follows: is chosen to be uniform random in and then elementwise rounded to the nearest integers; is chosen uniform random in and then elementwise multiplied with . For chosen uniform random in , we set . The activation functions are as in (22) with . The Hopfield subiterations were run until error tolerance was achieved with maximum number of subiterations being .
We fix the problem data generated as above, and perform 100 Monte Carlo simulations, each solving an instance of the dual Hopfield method for the same problem data with randomly chosen . Fig. 3 shows the convergence trends for the joint vector associated with the Monte Carlo simulations by plotting the distance between the current iterate and the converged vector over iteration index . In particular, Fig. 3 highlights how the geometry induced by the HNN governs the convergence trend: the geodesic distance to minimum (blue curves in Fig. 3), where is given by (30), decays monotonically, as expected in gradient descent. However, (red curves in Fig. 3) does not decay monotonically since the Euclidean distance does not encode the Riemannian geometry of the HNN recursion (5).
V Stochastic HNN Flow As Gradient Descent In the Space of Probability Measures
We next consider the stochastic HNN flow, also known as the “diffusion machine” [8, Section 4], given by the Itô stochastic differential equation (SDE)
(45) 
where the state^{*}^{*}*We make the standard assumption that at . If the activation functions do not satisfy this, then a reflecting boundary is needed at each , to keep the sample paths within the unit hypercube; see e.g., [27]. , the standard Wiener process , the notation stands for the vector of ones, and the parameter denotes the thermodynamic/annealing temperature. Recall that the metric tensor given by (4) is diagonal, and hence its inverse and square roots are diagonal too with respective diagonal elements being the inverse and square roots of the original diagonal elements.
The diffusion machine (45) was proposed in [8, 28] to solve global optimization problems on the unit hypercube, and is a stochastic version of the deterministic HNN flow discussed in Sections II–IV. This can be seen as a generalization of the Langevin model [29, 30] for the Boltzmann machine [31]. Alternatively, one can think of the diffusion machine as the continuousstate continuoustime version of the HNN in which Gaussian white noise is injected at each node.
Let be the FokkerPlanckKolmogorov (FPK) operator [32] associated with (45) governing the evolution of the joint PDFs , i.e.,
(46) 
where the given initial joint PDF . The drift and diffusion coefficients in (45) are tailored so that admits the stationary joint PDF
(47) 
i.e., a Gibbs PDF where is a normalizing constant (known as the “partition function”) to ensure . This follows from noting that for (45), we have^{†}^{†}†We remind the readers that the raised indices denote the elements of the inverse of the metric tensor, i.e., .
(48a)  
(48b) 
and that for each . From (47), the critical points of coincides with that of , which is what allows to interpret (45) as an analog machine for globally optimizing the function . In fact, for the FPK operator (48), the solution of (46) enjoys an exponential rate of convergence [33, p. 13581359] to (47). We remark here that the idea of using SDEs for solving global optimization problems have also appeared in [34, 35].
In the context of diffusion machine, the key observation that part of the drift term can be canceled by part of the diffusion term in the FPK operator (48a), thereby guaranteeing that the stationary PDF associated with (48) is (47), was first pointed out by Wong [8]. While this renders the stationary solution of the FPK operator meaningful for globally minimizing , it is not known if the transient PDFs generated by (46) admit a natural interpretation. We next argue that (46) is indeed a gradient flow in the space of probability measures supported on the manifold .
Va Wasserstein Gradient Flow
Using (48b), we rewrite the FPK PDE (46) as
(49) 
where
(50) 
For , consider the free energy functional given by
(51) 
which is a sum of the potential energy , and the internal energy . The free energy (51) serves as the Lyapunov functional for (49) since direct calculation yields (Appendix A)
(52) 
along any solution generated by (49), thanks to for all . In particular, the RHS of (52) equals zero iff , i.e., at the stationary PDF (47). For , the RHS of (52) is for any transient PDF solving (49).
To view (49) as gradient flow over , we will need the notion of (quadratic) Wasserstein metric between two probability measures and on , defined as
(53) 
where denotes a joint probability measure supported on , and the geodesic distance is given by (13). The infimum in (53) is taken over the set , which we define as the set of all joint probability measures having finite second moment that are supported on , with prescribed marginal , and prescribed marginal . If and have respective PDFs and , then we can use the notation in lieu of . The subscript in (53) is indicative of its dependence on the ground Riemannian metric , and generalizes the Euclidean notion of Wasserstein metric [36], i.e., case. See also [37, Section 3] and [38]. Following [36, Ch. 7] and using the positive definiteness of , it is easy to verify that (53) indeed defines a metric on , i.e., is nonnegative, zero iff , symmetric in its arguments, and satisfies the triangle inequality.
The square of (53) is referred to as the “optimal transport cost” (see [39] for the Euclidean case) that quantifies the minimum amount of work needed to reshape the PDF to (or equivalently, to ). This can be formalized via the following dynamic variational formula [40, Corollary 2.5] for (53):
(54a)  
(54b)  
(54c) 
The infimum above is taken over the PDFvector field pairs , where . Recognizing that (54a) is the continuity/Liouville equation for the single integrator dynamics , one can interpret (54) as a fixed horizon “minimum weighted energy” stochastic control problem subject to endpoint PDF constraints (54b)(54c).