# Spectral Inference Networks:

Unifying Spectral Methods With Deep Learning

###### Abstract

We present Spectral Inference Networks, a framework for learning eigenfunctions of linear operators by stochastic optimization. Spectral Inference Networks generalize Slow Feature Analysis to generic symmetric operators, and are closely related to Variational Monte Carlo methods from computational physics. As such, they can be a powerful tool for unsupervised representation learning from video or pairs of data. We derive a training algorithm for Spectral Inference Networks that addresses the bias in the gradients due to finite batch size and allows for online learning of multiple eigenfunctions. We show results of training Spectral Inference Networks on problems in quantum mechanics and feature learning for videos on synthetic datasets as well as the Arcade Learning Environment. Our results demonstrate that Spectral Inference Networks accurately recover eigenfunctions of linear operators, can discover interpretable representations from video and find meaningful subgoals in reinforcement learning environments.

Spectral Inference Networks:

Unifying Spectral Methods With Deep Learning

David Pfau, Stig Petersen, Ashish Agarwal, David Barrett and Kim Stachenfeld DeepMind Google Brain London, UK Mountain View, CA, USA {pfau, svp, agarwal, barrettdavid, stachenfeld}@google.com

noticebox[b]Preprint. Work in progress.\end@float

## 1 Introduction

Spectral algorithms are central to machine learning and scientific computing. In machine learning, eigendecomposition and singular value decomposition are foundational tools, used for PCA as well as a wide variety of other models. In scientific applications, solving for the eigenfunction of a given linear operator is central to the study of PDEs, and gives the time-independent behavior of classical and quantum systems. For systems where the linear operator of interest can be represented as a reasonably-sized matrix, full eigendecomposition can be achieved in time (pan1998complexity, ), and in cases where the matrix is too large to diagonalize completely (or even store in memory), iterative algorithms based on Krylov subspace methods can efficiently compute a fixed number of eigenvectors by repeated application of matrix-vector products (golub2012matrix, ).

At a larger scale, the eigenvectors themselves cannot be represented explicitly in memory. This is the case in many applications in quantum physics and machine learning, where the state space of interest may be combinatorially large or even continuous and high dimensional. Typically, the eigenfunctions of interest are approximated from a fixed number of points small enough to be stored in memory, and then the value of the eigenfunction at other points is approximated by use of the Nyström method (bengio2004out, ). As this depends on evaluating a kernel between a new point and every point in the training set, this may not be practical for large datasets, and some form of function approximation may work better. By choosing a class of function approximator known to work well in a certain domain, such as convolutional neural networks for vision, we may be able to bias the learned representation towards reasonable solutions in a way that is difficult to encode by choice of kernel.

In this paper, we propose a way to approximate eigenfunctions of linear operators on high-dimensional function spaces, which we call Spectral Inference Networks (SpIN) and train these networks via stochastic optimization. We present an algorithm to compute nearly unbiased gradients of the spectral inference network objective, and show our method finds correct eigenfunctions of problems in quantum physics and discovers interpretable representations from video.

## 2 Method

### 2.1 Spectral Decomposition as Optimization

Eigenvectors of a matrix are defined as those vectors such that for some scalar , the eigenvalue. It is also possible to define eigenvectors as the solution to an optimization problem. If is a symmetric matrix, then the largest eigenvector of is the solution of:

(1) |

or equivalently (up to a scaling factor in )

(2) |

This is the Rayleigh quotient, and it is easily seen by setting derivatives equal to zero that this is equivalent to finding such that , where is equal to the value of the Rayleigh quotient. We can equivalently find the lowest eigenvector of by minimizing the Rayleigh quotient instead. Amazingly, despite being a nonconvex problem, algorithms such as power iteration converge to the global solution of this problem (daskalakis2017converse, , Sec. 4).

To compute the top eigenvectors , we can solve a sequence of maximization problems:

(3) |

If we only care about finding a subspace that spans the top eigenvectors, we can divide out the requirement that the eigenvectors are orthogonal to one another, and reframe the problem as a single optimization problem (edelman1998geometry, , Sec. 4.4):

(4) |

or, if denotes row of :

(5) |

Note that this objective is invariant to right-multiplication of by an arbitrary matrix, and thus we do not expect the columns of to be the separate eigenvectors. We will discuss how to break this symmetry in Section 2.4.

### 2.2 From Eigenvectors to Eigenfunctions

We are interested in the case where both and are too large to represent in memory. Suppose that instead of a matrix we have a symmetric (not necessarily positive definite) kernel where and are in some space . Let the inner product on be defined with respect to a probability distribution with density , so that . In theory this could be an improper density, such as the uniform distribution over , but in practice we will have to choose some distribution over points in to sample from during training (or it will be the data distribution, over which we have no control). We can construct a symmetric linear operator from as . To compute a function that spans the top eigenfunctions of this linear operator, we need to solve the equivalent of Equation 5 for function spaces. Replacing rows and with points and and sums with expectations, this becomes:

(6) |

where the optimization is over all functions such that each element of is an integrable function under the metric above. Also note that as is a row vector while is a column vector, the transposes are switched. This is equivalent to solving the constrained optimization problem

(7) |

For clarity, we will use to denote the covariance^{1}^{1}1Technically, this is the second moment, as is not necessarily zero-mean, but we will refer to it as the covariance for convenience. of features and to denote the kernel-weighted covariance throughout the paper, so the objective in Equation 6 becomes . The empirical estimate of these quantities will be denoted as and .

The form of the kernel often allows for simplification to Equation 6. If is a graph, and if and are neighbors and 0 otherwise, and is equal to the total number of neighbors of , this is the graph Laplacian, and can equivalently be written as:

(8) |

for neighboring points. It’s clear that this kernel penalizes the difference between neighbors, and in the case where the neighbors are adjacent video frames this is Slow Feature Analysis (SFA) (wiskott2002slow, ). Thus SFA is a special case of SpIN, and the algorithm for learning in SpIN here allows for end-to-end online learning of SFA with arbitrary function approximators. The equivalent kernel to the graph Laplacian for is

(9) |

where is the unit vector along the axis . This converges to the differential Laplacian, and the linear operator induced by this kernel is , which appears frequently in physics applications. The generalization to generic manifolds is the Laplace-Beltrami operator. Since these are purely local operators, we can replace the double expectation over and with a single expectation, simplifying computation.

There are many possible ways of solving the optimization problems in Equations 6 and 7. In principle, we could use a constrained optimization approach such as the augmented Lagrangian method (bertsekas2014constrained, ), which has been successfully combined with deep learning for approximating maximum entropy distributions (loaiza2017maximum, ). In our experience, such an approach was difficult to stabilize. We could also construct an orthonormal function basis and then learn some flow that preserves orthonormality. This approach has been suggested for quantum mechanics problems by cranmer2018quantum (). But, if the distribution is unknown, then constructing an explicitly orthonormal function basis may not be possible, and if the state space is large but discrete, defining a flow that preserves orthogonality may not be possible. Instead, we take the approach of directly optimizing the quotient in Equation 6.

### 2.3 Reducing the Bias in the Gradients

Unlike most machine learning problems, the objective in Equation 6 is a combination of two expectations, which is a linear function of but a nonlinear function of . This means estimating gradients directly from minibatches will be biased, making this more challenging than most stochastic optimization problems in machine learning. Let be the parameters of the function class to optimize, so the eigenfunctions become , though we will leave out the explicit dependence on for clarity. Given samples , , let be the empirical estimate of the kernel-weighted covariance from one minibatch. Then is an unbiased estimate of the objective due to the linearity of expectations, and its gradient

(10) |

is an unbiased estimate of the gradient of the objective. Here is a shorthand for for matrix-valued and . If we had access to the true covariance of the features , then the first term in the gradient could be computed efficiently by reverse-mode automatic differentiation. The second term, however, requires both knowledge of the covariance and the full Jacobian of the covariance.

In practice we substitute both the covariance and Jacobian of the covariance in the above expression with a moving average. Provided the moving average decays slowly enough, this is a good approximation to the true covariance and its Jacobian. However, updating this moving average still requires gradient computations at each time step.

### 2.4 Breaking the Symmetry Between Eigenfunctions

Since Equation 6 is invariant to linear transformation of the features , optimizing it will only give a function that spans the top eigenfunctions of . To find separated eigenfunctions, we really need to solve a sequence of optimization problems as in Equation 3, each of which depends on the solutions before it but not after it. Consider a normalized form of the kernel-weighted covariance , where is the Cholesky decomposition of . Then the lower triangular structure of the Cholesky decomposition means that maximizing the diagonal element with respect to the th feature is the same as maximizing the Rayleigh quotient under the constraint that is orthogonal to all , . If we just optimize with respect to sequentially, we are done, so long as we use a different network for every eigenfunction. But if we try to optimize with respect to all , then all features before will change as well.

Our idea is to use a modified form of the gradients with respect to the features:

(11) |

and to then update the features with the total modified gradient over all features . A closed form expression for this modified gradient exists (derived in the supplementary material in Section A) and plugging this modified gradient into Equation 10 gives:

(12) |

where and give the upper triangular and diagonal of a matrix, respectively. Note here that we use , the minibatch estimate of , in computing the normalized kernel-weighted covariance . This gives an unbiased gradient estimate which optimizes the sequential eigenfunction problem, making it possible to solve for multiple eigenfunctions simultaneously in an online setting. The full training algorithm for SpIN is given in Algorithm 1.

## 3 Related Work

Spectral methods are mathematically ubiquitous, arising in a number of diverse settings. Spectral clustering (ng2002spectral, ), normalized cuts (shi2000normalized, ) and Laplacian eigenmaps (belkin2002laplacian, ) are all machine learning applications of spectral decompositions applied to graph Laplacians. Related manifold learning algorithms like LLE (Tenenbaum2000, ) and IsoMap (Roweis2000, ) also rely on eigendecomposition, with a different kernel. Spectral algorithms can also be used for asymptotically exact estimation of parametric models like hidden Markov models and latent Dirichlet allocation by computing the SVD of moment statistics (hsu2008spectral, ; anandkumar2012spectral, ).

In the context of reinforcement learning, spectral decomposition of predictive state representations has been proposed as a method for learning a coordinate system of environments for planning and control (boots2011closing, ), and when the transition function is symmetric its eigenfunctions are also known as proto-value functions (PVFs) (mahadevan2007proto, ). The use of PVFs for discovering subgoals in reinforcement learning has been investigated in (Machado2017a, ) and combined with function approximation in (machado2018eigenoption, ), an approach which we compare against in the experimental section. Spectral decomposition of an enviroment’s transition function or successor function (which has the same eigenfunctions) has also been proposed by neuroscientists as a model for the emergence of grid cells in the entorhinal cortex (stachenfeld17, ).

Spectral learning with stochastic approximation has a long history as well. Probably the earliest work on stochastic PCA is that of “Oja’s rule" (Oja1982, ), which is a Hebbian learning rule that converges to the first principal component, and a wide variety of online SVD algorithms have appeared since. Most of these stochastic spectral algorithms are concerned with learning fixed-size eigenvectors from online data, while we are concerned with cases where the eigenfunctions are over a space too large to be represented efficiently with a fixed-size vector.

The closest related work in machine learning on finding eigenfunctions by stochastic approximation is Slow Feature Analysis (SFA) (wiskott2002slow, ), which is a special case of SpIN. SFA is equivalent to function approximation for Laplacian eigenmaps (sprekeler2011relation, ), and it has been shown that optimizing for the slowness of features in navigation can also lead to the emergence of grid cells (wyss2006model, ; franzius2007slowness, ). SFA has primarily been applied to train shallow models, and when trained on deep models is typically trained in a layer-wise fashion, rather than end-to-end (kompella2012incremental, ; sun2014dl, ). The features in SFA are typically learned sequentially, from slowest to fastest, while SpIN allows for simultaneous training of all eigenfunctions.

Spectral methods and deep learning have also been combined in other ways. The spectral networks of bruna2014spectral () are a generalization of convolutional neural networks to graph and manifold structured data based on the idea that the convolution operator is diagonal in a basis defined by eigenvectors of the Laplacian. In (ionescu2015matrix, ) spectral decompositions were incorporated as differentiable layers in deep network architectures. While these use spectral methods to design neural network architectures, our work uses neural networks to solve large-scale spectral decompositions.

In computational physics, the problem of finding eigenfunctions of a Hamiltonian operator by optimization is known as Variational Monte Carlo (VMC) (toulouse2016introduction, ). VMC methods are usually applied to finding the ground state (lowest eigenvalue) of electronic systems, but extensions to excited states (higher eigenvalues) have been proposed (blunt2015krylov, ). Typically the class of function approximator is tailored to the system, but neural networks have been used for ground state calculations (carleo2017solving, ). Most of these methods select points on a fixed grid or use batches large enough that the gradients are nearly unbiased. To the best of our knowledge ours is the first work to address the issue of biased gradients due to finite batch sizes, and the first to connect SFA with work in the physics community.

## 4 Experiments

### 4.1 Solving the Schrödinger Equation

As a first experiment to demonstrate the correctness of the method on a problem with a known solution, we investigated the use of SpIN for solving the Schrödinger equation for a two-dimensional hydrogen atom. The time-independent Schrödinger equation for a single particle with mass in a potential field is a partial differential equation of the form:

(13) |

whose solutions describe the wavefunctions with unique energy . The probability of a particle being at position then has the density . The solutions are eigenfunctions of the linear operator - known as the Hamiltonian operator. We set to 1 and choose , which corresponds to the potential from a charged particle. In 2 or 3 dimensions this can be solved exactly, and in 2 dimensions it can be shown that there are eigenfunctions with energy for all (yang1991analytic, ).

Details of the training network and experimental setup are given in the supplementary material in Section B.1. We found it critical to set the decay rate for RMSProp to be slower than the decay used for the moving average of the covariance in SpIN, and expect the same would be true for other adaptive gradient methods. To investigate the effect of biased gradients and demonstrate how SpIN can correct it, we specifically chose a small batch size for our experiments. As an additional baseline over the known closed-form solution, we computed eigenvectors of a discrete approximation to on a grid.

Training results are shown in Figure 1. In Figure 0(a), we see the circular harmonics that make up the electron orbitals of hydrogen in two dimensions. With a small batch size and no bias correction, the eigenfunctions (Figure 0(b)) are incorrect and the eigenvalues (Figure 0(d), ground truth in black) are nowhere near the true minimum. With the bias correction term in SpIN, we are able to both accurately estimate the shape of the eigenfunctions (Figure 0(c)) and converge to the true eigenvalues of the system (Figure 0(e)). Note that, as eigenfunctions 2-4 and 5-9 are nearly degenerate, any linear combination of them is also an eigenfunction, and we do not expect Fig. 0(a) and Fig. 0(c) to be identical. We believe that the high accuracy of the learned eigenvalues shows the correctness of our method, and to the best of our knowledge this is the first demonstration of a VMC method which can work from small batches.

### 4.2 Deep Slow Feature Analysis

Having demonstrated the effectiveness of SpIN on a problem with a known closed-form solution, we now turn our attention to problems relevant to representation learning in vision. We trained a convolutional neural network to extract features from videos, using the Slow Feature Analysis kernel of Equation 8. The first video we looked at was a simple example with three bouncing balls. The velocities of the balls are constant until they collide with each other or the walls, meaning the time dynamics are reversible, and hence the transition function is a symmetric operator. We trained a convolutional neural network with 12 output eigenfunctions using similar decay rates to the experiments in Section 4.1. Full details of the training setup are given in Section B.2, including training curves in Figure 4. During the course of training, the order of the different eigenfunctions often switched, as lower eigenfunctions sometimes took longer to fit than higher eigenfunctions.

Analysis of the learned solution is shown in Figure 2. Fig. 1(a) is a heatmap showing whether the feature is likely to be positively activated (red) or negatively activated (blue) when a ball is in a given position. Since each eigenfunction is invariant to change of sign, the choice of color is arbitrary. Most of the eigenfunctions are encoding for the position of balls independently, with the first two eigenfunctions discovering the separation between up/down and left/right, and higher eigenfunctions encoding higher frequency combinations of the same thing. However, some eigenfunctions are encoding more complex joint statistics of position. For instance, one eigenfunction (outlined in green in Fig. 1(a)) has no clear relationship with the marginal position of a ball. But when we plot the frames that most positively or negatively activate that feature (Fig. 1(b)) we see that the feature is encoding whether all the balls are crowded in the lower right corner, or one is there while the other two are far away. Even higher eigenfunctions would likely encode for even more complex joint relationships. None of the eigenfunctions we investigated seemed to encode anything meaningful about velocity, likely because collisions cause the velocity to change rapidly, and thus optimizing for slowness of features is unlikely to discover this. A different choice of kernel may lead to different results.

### 4.3 Successor Features and the Arcade Learning Environment

Lastly, we compare the performance of SpIN with the same SFA objective against the successor feature approach for learning eigenpurposes machado2018eigenoption () on the Arcade Learning Environment (bellemare2013arcade, ). As in machado2018eigenoption (), we trained a network to perform next-frame prediction on 500k frames of a random agent playing one game. We simultaneously trained another network to compute the successor features barreto2017successor () of the latent code of the next-frame predictor, and computed the “eigenpurposes" by applying PCA to the successor features on 64k held-out frames of gameplay. We used the same convolutional network architecture as machado2018eigenoption (), a batch size of 32 and RMSProp with a learning rate of 1e-4 for 300k iterations, and updated the target network every 10k iterations. While the original paper did not mean-center the successor features when computing eigenpurposes, we found that the results were significantly improved by doing so. Thus the baseline presented here is actually stronger than in the original publication.

On the same data, we trained a spectral inference network with the same architecture as the encoder of the successor feature network, except for the fully connected layers, which had 128 hidden units and 5 non-constant eigenfunctions. We tested SpIN on the same 64k held-out frames as those used to estimate the eigenpurposes. We used the same training parameters and kernel as in Section 4.2. As SpIN is not a generative model, we must find another way to compare the features learned by each method. We averaged together the 100 frames from the test set that have the largest magnitude positive or negative activation for each eigenfunction/eigenpurpose. Results are shown in Figure 3, with more examples and comparison against PCA on pixels in the supplementary material in Section C.

By comparing the top row to the bottom row in each image, we can judge whether that feature is encoding anything nontrivial. It can be seen that for many games, successor features may find a few eigenpurposes that encode interesting features, but many eigenpurposes do not seem to encode anything that can be distinguished from the mean image. Whereas for SpIN, nearly all eigenfunctions are encoding features such as the presence/absence of a sprite, or different arrangements of sprites, that lead to a clear distinction between the top and bottom row. Moreover, SpIN is able to learn to encode these features in a fully end-to-end fashion, without any pixel reconstruction loss, whereas the successor features must be trained from two distinct losses, followed by a third step of computing eigenpurposes.

## 5 Discussion

We have shown that a single unified framework is able to compute spectral decompositions by stochastic gradient descent on domains relevant to physics and reinforcement learning. This generalizes existing work on using slowness as a criterion for unsupervised learning, and addresses a previously unresolved issue with biased gradients due to finite batch size. One of the shortcomings of the proposed solution is the requirement of computing full Jacobians at every time step. Improving the scaling of training, either through approximations to the Jacobian, or imposing additional structure such as hierarchy or factorization on the learned eigenfunctions, is a promising direction for future research. The physics application presented here is on a fairly simple system, and we hope that Spectral Inference Nets can be fruitfully applied to more complex physical systems for which computational solutions are not yet available. The representations learned on video data show nontrivial structure and sensitivity to meaningful properties of the scene. These representations could be used for many downstream tasks, such as object tracking, gesture recognition, or faster exploration and subgoal discovery in reinforcement learning. Finally, while the framework presented here is quite general, the examples shown investigated only a small number of linear operators. Now that the basic framework has been laid out, there is a rich space of possible kernels and architectures to combine and explore.

#### Acknowledgments

Thanks to Danilo Rezende, James Kirkpatrick, Marlos Machado, Shakir Mohamed, Greg Wayne, Aron Cohen and Suvrit Sra for helpful discussions and pointers.

## References

- (1) Anima Anandkumar, Dean P Foster, Daniel J Hsu, Sham M Kakade, and Yi-Kai Liu. A Spectral Algorithm for Latent Dirichlet Allocation. In Advances in Neural Information Processing Systems, pages 917–925, 2012.
- (2) André Barreto, Will Dabney, Rémi Munos, Jonathan J Hunt, Tom Schaul, Hado P van Hasselt, and David Silver. Successor Features for Transfer in Reinforcement Learning. In Advances in Neural Information Processing Systems, pages 4055–4065, 2017.
- (3) Mikhail Belkin and Partha Niyogi. Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering. In Advances in Neural Information Processing Systems, pages 585–591, 2002.
- (4) Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The Arcade Learning Environment: An Evaluation Platform for General Agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
- (5) Yoshua Bengio, Jean-François Paiement, Pascal Vincent, Olivier Delalleau, Nicolas L Roux, and Marie Ouimet. Out-of-sample Extensions for LLE, IsoMap, MDS, Eigenmaps, and Spectral Clustering. In Advances in Neural Information Processing Systems, pages 177–184, 2004.
- (6) Dimitri P Bertsekas. Constrained Optimization and Lagrange Multiplier methods. Academic press, 2014.
- (7) NS Blunt, Ali Alavi, and George H Booth. Krylov-Projected Quantum Monte Carlo Method. Physical Review Letters, 115(5):050603, 2015.
- (8) Byron Boots, Sajid M Siddiqi, and Geoffrey J Gordon. Closing the Learning-Planning Loop with Predictive State Representations. The International Journal of Robotics Research, 30(7):954–966, 2011.
- (9) Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral Networks and Locally Connected Networks on Graphs. In Proceedings of the 2nd International Conference on Learning Representations, 2014.
- (10) Giuseppe Carleo and Matthias Troyer. Solving the Quantum Many-Body Problem with Artificial Neural Networks. Science, 355(6325):602–606, 2017.
- (11) Kyle Cranmer, Duccio Pappadopulo, and Siavash Golkar. Quantum Inference and Quantum Flows. doi:10.6084/m9.figshare.6197069.v1, 2018.
- (12) Constantinos Daskalakis, Christos Tzamos, and Manolis Zampetakis. A Converse to Banach’s Fixed Point Theorem and its CLS Completeness. In 50th Annual ACM Symposium on Theory of Computing, 2018.
- (13) Alan Edelman, Tomás A Arias, and Steven T Smith. The Geometry of Algorithms with Orthogonality Constraints. SIAM journal on Matrix Analysis and Applications, 20(2):303–353, 1998.
- (14) Mathias Franzius, Henning Sprekeler, and Laurenz Wiskott. Slowness and Sparseness Lead to Place, Head-Direction, and Spatial-View Cells. PLoS Computational Biology, 3(8):e166, 2007.
- (15) Mike Giles. An Extended Collection of Matrix Derivative Results for Forward and Reverse Mode Automatic Differentiation. Oxford University Computing Laboratory, Numerical Analysis Report, 08(01), 2008.
- (16) Gene H Golub and Charles F Van Loan. Matrix Computations, volume 3. JHU Press, 2012.
- (17) Daniel Hsu, Sham M Kakade, and Tong Zhang. A Spectral Algorithm for Learning Hidden Markov Models. Journal of Computer and System Sciences, 78:1460–1480, 2012.
- (18) Catalin Ionescu, Orestis Vantzos, and Cristian Sminchisescu. Matrix Backpropagation for Deep Networks with Structured Layers. In Proceedings of the IEEE International Conference on Computer Vision, pages 2965–2973, 2015.
- (19) Varun Raj Kompella, Matthew Luciw, and Jürgen Schmidhuber. Incremental Slow Feature Analysis: Adaptive Low-Complexity Slow Feature Updating from High-Dimensional Input Streams. Neural Computation, 24(11):2994–3024, 2012.
- (20) Gabriel Loaiza-Ganem, Yuanjun Gao, and John P Cunningham. Maximum Entropy Flow Networks. In Proceedings of the 5th International Conference on Learning Representations, 2017.
- (21) Marlos C Machado, Marc G Bellemare, and Michael Bowling. A Laplacian Framework for Option Discovery in Reinforcement Learning. Proceedings of the 34th International Conference on Machine Learning, 2017.
- (22) Marlos C Machado, Clemens Rosenbaum, Xiaoxiao Guo, Miao Liu, Gerald Tesauro, and Murray Campbell. Eigenoption Discovery through the Deep Successor Representation. In Proceedings of the 6th International Conference on Learning Representations, 2018.
- (23) Sridhar Mahadevan and Mauro Maggioni. Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes. Journal of Machine Learning Research, 8:2169–2231, 2007.
- (24) Iain Murray. Differentiation of the Cholesky decomposition. arXiv preprint arXiv:1602.07527, 2016.
- (25) Andrew Y Ng, Michael I Jordan, and Yair Weiss. On Spectral Clustering: Analysis and an Algorithm. In Advances in Neural Information Processing Systems, pages 849–856, 2002.
- (26) Erkki Oja. Simplified Neuron Model as a Principal Component Analyzer. Journal of Mathematical Biology, 15(3):267–273, Nov 1982.
- (27) Victor Y Pan, Z Chen, Ailong Zheng, et al. The Complexity of the Algebraic Eigenproblem. Mathematical Sciences Research Institute, Berkeley, pages 1998–71, 1998.
- (28) S. T. Roweis and L K Saul. Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science, 290(5500):2323–2326, Dec 2000.
- (29) Jianbo Shi and Jitendra Malik. Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000.
- (30) Henning Sprekeler. On the Relation of Slow Feature Analysis and Laplacian Eigenmaps. Neural Computation, 23(12):3287–3302, 2011.
- (31) Kimberly L Stachenfeld, Matthew M Botvinick, and Samuel J Gershman. The Hippocampus as a Predictive Map. Nature Neuroscience, 20:1643–1653, 2017.
- (32) Lin Sun, Kui Jia, Tsung-Han Chan, Yuqiang Fang, Gang Wang, and Shuicheng Yan. DL-SFA: Deeply-Learned Slow Feature Analysis for Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2625–2632, 2014.
- (33) J. B. Tenenbaum, V de Silva, and J C Langford. A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science, 290(5500):2319–2323, Dec 2000.
- (34) Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-RMSProp: Divide the Gradient by a Running Average of its Recent Magnitude. COURSERA: Neural Networks for Machine Learning, 4(2):26–31, 2012.
- (35) Julien Toulouse, Roland Assaraf, and Cyrus J Umrigar. Introduction to the Variational and Diffusion Monte Carlo Methods. In Advances in Quantum Chemistry, volume 73, pages 285–314. Elsevier, 2016.
- (36) Laurenz Wiskott and Terrence J Sejnowski. Slow Feature Analysis: Unsupervised Learning of Invariances. Neural Computation, 14(4):715–770, 2002.
- (37) Reto Wyss, Peter König, and Paul FM J Verschure. A Model of the Ventral Visual System Based on Temporal Stability and Local Memory. PLoS Biology, 4(5):e120, 2006.
- (38) XL Yang, SH Guo, FT Chan, KW Wong, and WY Ching. Analytic Solution of a Two-Dimensional Hydrogen Atom. I. Nonrelativistic Theory. Physical Review A, 43(3):1186, 1991.

## Appendix A Derivation of Modified Gradients

The derivative of the normalized features with respect to parameters can be expressed as

(14) |

if we flatten out the matrix-valued , and .

The reverse-mode sensitivities for the matrix inverse and Cholesky decomposition are given by where and where is the Cholesky decomposition of and is the operator that replaces the upper triangular of a matrix with its lower triangular transposed (giles2008extended, ; murray2016differentiation, ). Using this, we can compute the gradients in closed form by application of the chain rule.

To simplify notation slightly, let and be matrices defined as:

(15) |

Then the unmodified gradient has the form:

(16) | ||||

(17) | ||||

(18) |

where is simply from Equation 14 in a form that preserves the shape of the matrix-valued variables.

To zero out the relevant elements of the gradient as described in Equation 11, we can right-multiply by . The modified gradients can be expressed in closed form as:

(19) | ||||

(20) |

where and give the upper triangular and diagonal of a matrix, respectively. By plugging in these modified gradients into the expression for the gradients with respect to in Equation 10, we get:

(21) |

## Appendix B Experimental Details

### b.1 Solving the Schrödinger Equation

To solve for the eigenfunctions with lowest eigenvalues, we used a neural network with 2 inputs (for the position of the particle), 4 hidden layers each with 128 units, and 9 outputs, corresponding to the first 9 eigenfunctions. We used a batch size of 128 - much smaller than the 16,384 nodes in the 2D grid used for the exact eigensolver solution. We chose a softplus nonlinearity rather than the more common ReLU, as the Laplacian operator would be zero almost everywhere for a ReLU network. We used RMSProp (tieleman2012lecture, ) with a decay rate of 0.999 and learning rate of 1e-5 for all experiments. We sampled points uniformly at random from the box during training, and to prevent degenerate solutions due to the boundary condition, we multiplied the output of the network by , which forces the network output to be zero at the boundary without the derivative of the output blowing up. We chose for the experiments shown here. We use the finite difference approximation of the differential Laplacian given in Section 2.2 with some small number (around 0.1), which takes the form:

(22) |

when applied to . Because the Hamiltonian operator is a purely local function of , we don’t need to sample pairs of points for each minibatch, which simplifies calculations.

We made one additional modification to the neural network architecture to help separation of different eigenfunctions. Each layer had a block-sparse structure that became progressively more separated the deeper into the network it was. For layer out of with inputs and outputs, the weight was only nonzero if there exists such that and . This split the weight matrices into overlapping blocks, one for each eigenfunction, allowing features to be shared between eigenfunctions in lower layers of the network while separating out features which were distinct between eigenfunctions higher in the network.

### b.2 Deep Slow Feature Analysis

We trained on 200,000 6464 pixel frames, and used a network with 3 convolutional layers, each with 32 channels, 55 kernels and stride 2, and a single fully-connected layer with 128 units before outputting 12 eigenfunctions. We also added a constant first eigenfunction, since the first eigenfunction of the Laplacian operator is always constant with eigenvalue zero. This is equivalent to forcing the features to be zero-mean. We used the same block-sparse structure for the weights that was used in the Schrödinger equation experiments, with sparsity in weights between units extended to sparsity in weights between entire feature maps for the convolutional layers. We trained with RMSProp with learning rate 1e-6 and decay 0.999 and covariance decay rate for 1,000,000 iterations. Each batch contained 24 clips of 10 consecutive frames. So that the true state was fully observable, we used two consecutive frames as the input and trained the network so that the difference from that and the features for the frames were as small as possible.