Multilevel Artificial Neural Network Training for Spatially Correlated LearningSubmitted to the editors 06/18/2018.      Funding: This work was supported by U.S. National Science Foundation NRT Award number 1633631, Human Frontiers Science Program grant HFSP - RGP0023/2018, U.S. National Institute of Aging grant AG059602, U.S. National Institutes for Health grant R01HD073179, USAF/DARPA FA8750-14-C-0011, and by the Leverhulme Trust and and the hospitality of the Sainsbury Laboratory Cambridge University.

Multilevel Artificial Neural Network Training for Spatially Correlated Learningthanks: Submitted to the editors 06/18/2018.
     Funding: This work was supported by U.S. National Science Foundation NRT Award number 1633631, Human Frontiers Science Program grant HFSP - RGP0023/2018, U.S. National Institute of Aging grant AG059602, U.S. National Institutes for Health grant R01HD073179, USAF/DARPA FA8750-14-C-0011, and by the Leverhulme Trust and and the hospitality of the Sainsbury Laboratory Cambridge University.

C.B. Scott Department of Computer Science, University of California, Irvine. (scottcb@uci.edu).    Eric Mjolsness Departments of Computer Science and Mathematics, University of California, Irvine. (emj@uci.edu).
Abstract

Multigrid modeling algorithms are a technique used to accelerate iterative method models running on a hierarchy of similar graphlike structures. We introduce and demonstrate a new method for training neural networks which uses multilevel methods. Using an objective function derived from a graph-distance metric, we perform orthogonally-constrained optimization to find optimal prolongation and restriction maps between graphs. We compare and contrast several methods for performing this numerical optimization, and additionally present some new theoretical results on upper bounds of this type of objective function. Once calculated, these optimal maps between graphs form the core of Multiscale Artificial Neural Network (MsANN) training, a new procedure we present which simultaneously trains a hierarchy of neural network models of varying spatial resolution. Parameter information is passed between members of this hierarchy according to standard coarsening and refinement schedules from the multiscale modelling literature. In our machine learning experiments, these models are able to learn faster than training at the fine scale alone, achieving a comparable level of error with fewer weight updates (by an order of magnitude).

Key words. Multigrid Methods, Neural Networks, Classification, Image Analysis

AMS subject classifications. 46N10, 47N10, 65M55, 68T05, 82C32

1 Motivation

Multigrid methods (or multilevel methods when the underlying graph is not a grid) comprise a modeling framework that seeks to ameliorate a core problem in iterative method models with local update rules: namely, that these models have differing rates of convergence for fine-scale and coarse-scale modes [39]. Because iteration of these models involves making updates of a given characteristic length, they are maximally efficient for propagating modes of approximately this wavelength, but ignore finer modes and are inefficient on coarser modes. Multigrid approaches gain computational benefit by addressing these multiple length-scales of behavior using multiple model resolutions, rather than attempting to address them all at the finest scale (in which the coarse modes converge slowly). These methods make use of “prolongation” and “restriction” operators to move between models in a hierarchy of scales. At each level, a “smoothing” step is performed - usually, for multilevel methods, a smoothing step consists of one pass of some iterative method for improving the model at that scale.

In this paper, we describe a novel general algorithm for applying this approach to the training of Artificial Neural Networks (ANNs). In particular we demonstrate the efficiency of this new method, which combines ideas from machine learning and multilevel modelling, by training several hierarchies of Autoencoder networks (ANNs which learn a mapping from data to a lower-dimensional latent space) [16] [6]. By applying multilevel modeling methods we learn this latent representation with an order of magnitude less cost. We will make our notion of ‘cost’ more precise in the experiments section, Section LABEL:sec:netsec.

2 Background

2.1 Prior Work

In this section, we discuss prior attempts to apply ideas from multigrid methods to neural network models. Broadly speaking, prior approaches to neural net multigrid can be categorized into two classes: (1) Neural network models which are “structurally multigrid”, i.e. are typical neural network models which make use of multiple scales of resolution; and (2) Neural network training processes which are hierarchical in some way, or use a coarsening-refinement procedure as part of the training process.

In the first class are approaches [14, 20, 34]. Reference [20] implements a convolutional network in which convolutions make use of a multigrid-like structure similar to a Gaussian pyramid, with the motivation that the network will learn features at multiple scales of resolution. Reference [14] defines a convolution operation, inspired by multigrid methods, that convolves at multiple levels of resolution simultaneously. Reference [34] demonstrates a recurrent neural network model which similarly operates in multiple levels of some scale space; but in this paper the scale space is a space of aggregated language models (specifically, the differing scales are different levels of generality in language models - for example, topic models are coarsest, word models are finest, with document models somewhere in between). Common to all three of these approaches is that they make use of a modified neural net structure while leaving the training process unchanged, except that the network accepts multiresolution inputs.

In contrast, multilevel neural network models [3, 33] in the second category present modified learning procedures which also use methodology similar to multilevel modeling. Reference [3] introduces a network which learns at coarse scales, and then gradually refines its decision making by increasing the resolution of the input space and learning “corrections” at each scale. However, that paper focuses on the capability of a particular family of basis functions for neural networks, and not on the capabilities of the multigrid approach. Reference [33] presents a reframing of the neural network training process as an evolution equation in time, and then applies a method called MGRIT (Multigrid Reduction in Time [11]) to achieve the same results as parallelizing over many runs of training.

Our approach is fundamentally different: we use coarsened versions of the network model to make coarse updates to the weight variables of our model, followed by ‘smoothing steps’ in which the fine-scale weights are refined. This approach is more general than any of [14, 20, 34], since it can be applied to any feed-forward network and is not tied to a particular network structure. The approach in [33] is to parallelize the training process by reframing it as a continuous-in-time evolution equation, but it still uses the same base model and therefore only learns at one spatial scale.

Our method is both structurally multilevel and learns using a multilevel training procedure. Our hierarchical neural network architecture is the first to learn at all spatial scales simultaneously over the course of training, transitioning between neural networks of varying input resolution according to standard multigrid method schedules of coarsening and refinement. To our knowledge, this represents a fully novel approach to combining the powerful data analysis of neural networks with the model acceleration of multiscale modeling.

2.2 Outline

Section LABEL:sec:theory covers the mathematical theory underlying our method. We first introduce the necessary definitions, which are then used in Subsection LABEL:subsub:objective to define an objective function which evaluates a map between two graphs in terms of how well it preserves the behavior of some local process operating on those graphs (interpreting the smaller of the two graphs as a coarsened version of the larger). In Subsection LABEL:subsub:precomputeP we examine some properties of this objective function, including presenting some projection matrices which are local optima for particular choices of graph structure and process. In Subsections LABEL:subsec:msann_defn and LABEL:subsub:msann_training, we define the Multiscale Artificial Neural Network (MsANN), a hierarchically-structured neural network model which uses these optimized projection matrices to project network parameters between levels of the hierarchy, resulting in more efficient training. In Section LABEL:sec:netsec, we demonstrate this efficiency by training a simple neural network model on a variety of datasets, comparing the cost of our approach to that of training only the finest network in the hierarchy. Finally, we conclude the paper by proving two novel properties of our objective function in Section LABEL:sec:theoryshortsec.

3 Theory

In this section, we first define basic terms used throughout the paper, and explain the core theory of our paper: that of optimal prolongation maps between computational processes running on graph-based data structures, and hence between graphs. In this paper we use a specific example of such a process, single-particle diffusion on graphs, to examine the behavior of these prolongation maps. Finally, we discuss numerical methods for finding (given two input graphs and , and a process) prolongation and restriction maps which minimize the error of using as a surrogate structure for simulating the behavior of that process on . We will define more rigorously what we mean by “process”, “error”, and “prolongation” in Section LABEL:subsub:objective.

3.1 Definitions

In order to describe our objective function, we must first introduce some core concepts related to minimal mappings between graphs.

  • Graph lineage: A graph lineage is a sequence of graphs, indexed by , satisfying the following:

    • is the graph with one vertex and one self-loop, and;

    • Successive members of the lineage grow roughly exponentially - that is, the growth rate is for some , , and .

    We introduce this term to differentiate this definition from that of a graph family, which is a sequence of graphs without the growth condition. Most of the graph lineages we examine in this work are structurally similar - for example, the lineage of path graphs of length . However, we do not define this similarity in a rigorous sense, and we do not require it in the definition of a lineage.

  • Graph Laplacian: We define the Laplacian matrix of a graph as , where and are the adjacency matrix and diagonal degree matrix of the graph, respectively. The eigenvalues of this matrix are referred to as the spectrum of . See [4, 10] for more details on graph Laplacians and spectral graph theory. Our sign convention for agrees with the standard continuum Laplacian operator, , of a multivariate function : .

  • Kronecker Product and Sum of matrices: Given a matrix , and some other matrix , the Kronecker product is the block matrix

    If and are square, their Kronecker Sum is defined, and is given by

    where we write to denote an identity matrix of the same size as .

  • Box Product () of graphs: For with vertex set and with vertex set , is the graph with vertex set and an edge between and when either of the following is true:

    • and and are adjacent in , or

    • and and are adjacent in .

    This may be rephrased in terms of the Kronecker Sum of the two matrices:

    (3.1)
  • Cross Product () of graphs: For with vertex set and with vertex set , is the graph with vertex set and an edge between and when both of the following are true:

    • and are adjacent in , and

    • and are adjacent in .

    We include the standard pictorial illustration of the difference between these two graph products in Figure LABEL:fig:graph_prod.

    Figure 1: Two types of graph product: the Cross product (, left) and Box product (, right). For two edges and , we illustrate the resultant edges in the set of vertices in the graph product.
  • Grid Graph: a grid graph (called a lattice graph or Hamming Graph in some texts [7]) is the distance-regular graph given by the box product of path graphs (yielding a grid with aperiodic boundary conditions) or by a similar list of cycle graphs (yielding a grid with periodic boundary conditions).

  • Prolongation map: A prolongation map between two graphs and of sizes and , with , is an matrix of real numbers which is an optimum of the objective function of equation LABEL:eqn:objfunction below (possibly subject to some set of constraints ).

  • Eigenvalue matching: Given two matrices and , and lists of their eigenvalues and , with , we define the minimal eigenvalue matching as the matrix which is the solution of the following constrained optimization problem:

    (3.2)
    subject to

    In the case of eigenvalues with multiplicity , there may not be one unique such matrix, in which case we distinguish matrices with identical cost by the lexicographical ordering of their occupied indices and take as the first of those with minimal cost. This matching problem is well-studied and efficient algorithms for solving it exist; we use a Python language implementation [8] of a 1957 algorithm due to Munkres [28]. Additionally, given a way to enumerate the minimal-cost matchings found as solutions to this eigenvalue matching problem, we can perform combinatorial optimization with respect to some other objective function , in order to find optima of subject to the constraint that is a minimal matching.

3.2 Optimal Prolongation Maps Between Graphs

3.2.1 Our objective function

Given two graphs and , we find the optimal prolongation map between them as follows: We first calculate the graph Laplacians and , as well as pairwise vertex Manhattan distance matrices (i.e. the matrix with the minimal number of graph edges between vertices and in the graph), and , of each graph. Calculating these matrices may not be trivial for arbitrary dense graphs; for example, calculating the pairwise Manhattan distance of a graph with edges on vertices can be accomplished in by the Fibonacci heap version of Dijkstra’s algorithm [13]. Additionally, in Section LABEL:sec:optresults we discuss an optimization procedure which requires computing the eigenvalues of (which are referred to as the spectrum of ). Computing graph spectra is a well studied problem; we direct the reader to [9, 31]. In practice, all of the graph spectra computed for experiments in this paper took a negligible amount of time ( 1s) on a modern consumer-grade laptop using the scipy.linalg package [19], which in turn uses LAPACK routines for Schur decomposition of the matrix [2]. The optimal map is defined as which minimizes the matrix function

(3.3)
“Diffusion Term”
“Locality Term”111By this we mean the notion that neighborhoods of should be mapped to neighborhoods of and vice versa.

where is the Frobenius norm, and is a set of constraints on (in particular, we require , but could also impose other restrictions such as sparsity, regularity, and/or bandedness). The manifold of real-valued orthogonal matrices with is known as the Stiefel manifold; minimization constrained to this manifold is a well-studied problem [32, 36]. This optimization problem can be thought of as measuring the agreement between processes on each graph, as mapped through . The expression compares the end result of

  1. Advancing process forward in time on and then using to interpolate vertex states to the smaller graph, to:

  2. Interpolating the initial state (the all-ones vector) using and then advancing process on .

Strictly speaking the above interpretation of our objective function does not apply to the Manhattan distance matrix T of a graph, since is not a valid time evolution operator and thus is not a valid choice for . However, the objective function term containing may still be interpreted as comparing travel distance in one graph to travel distance in the other. That is, we are implicitly comparing the similarity of two ways of measuring the distance of two nodes and in :

  1. The Manhattan distance, as defined above, and;

  2. , a sum of path distances in weighted by how strongly and are connected, through , to the endpoints of those paths, and .

Parameters and are rescaling parameters to compensate for different graph sizes; in other words, must only ensure that processes 1 and 2 above agree up to some multiplicative constant. In operator theory terminology, the Laplacian is a time evolution operator for the single particle diffusion equation: . This operator evolves the probability distribution of states of a single-particle diffusion process on a graph (but other processes could be used - for example, a chemical reaction network or multiple-particle diffusion). The process defines a probability-conserving Master Equation of nonequilibrium statistical mechanics which has formal solution . Pre-multiplication by the prolongation matrix is clearly a linear operator i.e. linear transformation from to . Thus, we are requiring which minimizes the degree to which the operator diagram

{diagram}

L_1 &\rTo^Δt & L_1’
\dTo_P & &\dTo_P &&&& (Diagram 1)
L_2 &\rTo^Δt & L_2’
fails to commute. of course refers to advancement in time. See [18], Figure 1, for a more complete version of this commutative diagram for model reduction.

We thus include in our objective function terms with 1) graph diffusion and 2) graph locality as the underlying process matrices (, the Manhattan distance matrix, cannot be considered a time evolution operator because it is not probability-preserving). Parameter adjusts the relative strength of these terms to each other; so we may find “fully diffuse” when and “fully local” when . Figure LABEL:fig:localityfig illustrates this tradeoff for an example prolongation problem on a pair of grid graphs, including the transition from a global optimum of the diffusion term to a global optimum of the locality term. In each case, we only require to map these processes into one another up to a multiplicative constant: for the diffusion term and for the locality term. Exhaustive grid search over and for a variety of prolongations between (a) path graphs and (b) 2D grid graphs of varying sizes has suggested that for prolongation problems where the are both paths or both grids, the best values (up to the resolution of our search, ) for these parameters are and . However, we do not expect this scaling law to hold for general graphs.

Figure 2: Several solutions of our objective function found by PyManOpt as , the relative weight of the two terms of our objective function, is tuned from 0 (fully diffuse, top left) to 1 (fully local, bottom right). Within each subplot, grayscale indicates the magnitude of matrix entries. Note that the matrices found with do not appear to be structured in a way which respects the locality of the original graphs, whereas the matrices with do.

3.3 Numerical Optimization of P Matrices

3.3.1 Minimization method

We tried various publicly available optimization codes to find optima of our objective function. Unless otherwise noted, all matrices found via optimization were found using PyManOpt, a Python language package for manifold-constrained optimization. In our experience, this package outperformed other numerical methods codes such as constrained Nelder-Mead (as implemented in Mathematica or SciPy), gradient descent with projection back to the constraint manifold, or the orthogonally-constrained optimization code of [38]. More details on our comparison of these software packages may be found in the section “Comparison of Numerical Methods” of the Supplementary Material accompanying this paper.

3.3.2 Initialization

We initialize our minimization with an upper-bound solution given by the Munkres minimum-cost matching algorithm; the initial is as defined in equation LABEL:eqn:matchingconstraints, i.e. the binary matrix where an entry is 1 if the pair is one of the minimal-cost pairs selected by the minimum-cost assignment algorithm, and 0 otherwise. While this solution is, strictly speaking, minimizing the error associated with mapping the spectrum of one graph into the spectrum of the other (rather than actually mapping a process running on one graph into a process on the other) we found it to be a reasonable initialization, outperforming both random restarts and initialization with the appropriately sized block matrix . As detailed further in Section LABEL:sec:theoryshortsec, the found as a solution to this matching problem provides an upper bound for the full orthogonality-constrained optimization problem.

3.3.3 Precomputing matrices

For some structured graph lineages it may be possible to derive formulaic expressions for optimal and , as a function of the lineage index. For example, during our experiments we discovered species of which are local minima of prolongation between path graphs, cycle graphs, and grid graphs. A set of these outputs is shown in Figure LABEL:fig:localityfig. They feature various diagonal patterns as naturally idealized in Figure LABEL:fig:pspeciesfig. These idealized versions of these patterns all are also empirical local minima of our optimization procedure, for or , as indicated. Each column of Figure LABEL:fig:pspeciesfig provides a regular family of structures for use in our subsequent experiments in Section LABEL:sec:netsec. We have additionally derived closed-form expressions for global minima of the diffusion term of our objective function for some graph families (cycle graphs and grid graphs with periodic boundary conditions). Proof of the optimality of these solutions may be found in the supplementary materials which accompany this paper. However, in practice these global minima are nonlocal (in the sense that they are not close to optimizing the locality term) and thus may not preserve learned spatial rules between weights in levels of our hierarchy.

Examples of these formulaic matrices can be seen in Figure LABEL:fig:pspeciesfig. Each column of that figure shows increasing sizes of generated by closed-form solutions which were initially found by solving smaller prolongation problems (for various graph pairs and choices of ) and generalizing the solution to higher . Many of these examples are similar to what a human being would design as interpolation matrices between cycles and periodic grids. However, (a) they are valid local optima found by our optimization code and (b) our approach generalizes to processes running on more complicated or non-regular graphs, for which there may not be an obvious a priori choice of prolongation operator.

We highlight the best of these multiple species of closed-form solution, for both cycle graphs and grid graphs. The interpolation matrix-like seen in the third column of the “Cycle Graphs” section, or the sixth column of the “Grid Graphs” section of Figure LABEL:fig:pspeciesfig, were the local optima with lowest objective function value (with , i.e. they are fully local). As the best optima found by our method(s), these matrices were our choice for line graph and grid graph prolongation operators in our neural network experiments, detailed in Section LABEL:sec:netsec. We reiterate that in those experiments we do not find the matrices via any optimization method - since the neural networks in question have layer sizes of order , finding the prolongation matrices from scratch may be computationally difficult. Instead, we use the solutions found on smaller problems as a recipe for generating prolongation matrices of the proper size.

Furthermore, given two graph lineages and , and sequences of optimal matrices and mapping between successive members of each, we can construct which are related to the optima for prolonging between members of a new graph lineage which is comprised of the levelwise graph box product of the two sequences. We show in (Section LABEL:subsec:deompose, Corollary LABEL:thm:decompcorollary) conditions under which the value of the objective function at is an upper bound of the optimal value for prolongations between members of the lineage . We leave open the question of whether such formulaic exist for other families of structured graphs (complete graphs, -partite graphs, etc.). Even in cases where formulaic are not known, the computational cost of numerically optimizing over may be amortized, in the sense that once a -map is calculated, it may be used in many different hierarchical neural networks or indeed many different multiscale models.

Figure 3: Examples of matrices for cycle graph (left) and grid graph (right) prolongation problems of various sizes, which can be generated by closed-form representations dependent on problem size. Within each of the top and bottom plots, columns represent a series of matrices each generated by a particular numerical recipe, with rows representing increasing sizes of prolongation problem. Each matrix plot is a plot of the absolute value of matrix cell values. These closed-form representations were initially found as local minima of our objective function on small problems and then generalized to closed-form representations. For the “Cycle Graphs” section, the prolongation problems were between cycle graphs of sizes and . Columns 1-3 were solutions found with (fully local), and the rest were found with (fully diffuse). For the “Grid Graphs” section, the prolongation problems were between grids of size to grids of size for in . Columns 1-6 are fully local and columns 7-10 are fully diffuse, respectively. As in Figure LABEL:fig:localityfig, grayscale values indicate the magnitude of each matrix entry.

3.4 Multiscale Artificial Neural Network Algorithm

In this section we describe the Multiscale Artificial Neural Network (MsANN) training procedure, both in prose and in pseudocode (Algorithm LABEL:alg:mgannalg). Let be a sequence of neural network models with identical “aspect ratios” (meaning the sizes of each layer relative to other layers in the same model) but differing input resolution, so that operates at the finest scale and at the coarsest. For each model , let be a list of the network parameters (each in matrix or vector form) in some canonical order which is maintained across all scales. Let the symbol represent either:

  • If the network parameters at levels are weight matrices between layers and of each hierarchy, then represents a pair of matrices , such that:

    • prolongs or restricts between possible values of nodes in layer of model , and values of nodes in layer of model .

    • does the same for possible values of nodes in layer of each model.

  • If the network parameters at levels are bias vectors which are added to layer of each hierarchy, then represents a single which prolongs or restricts between possible values of nodes in layer of model , and values of nodes in layer of model .

As a concrete example, for a hierarchy of single-layer networks , each with one weight matrix and one bias vector , we could have for each . would represent a pair of matrices which map between the space of possible values of and the space of possible values of in a manner detailed in the next section. On the other hand, would represent a single matrix which maps between and . Similarly, would map between and , and between and . In Section LABEL:subsub:msann_training, we describe a general procedure for training such a hierarchy according to standard multilevel modeling schedules of refinement and coarsening, with the result that the finest network, informed by the weights of all coarser networks, requires fewer training examples.

3.4.1 Weight Prolongation and Restriction Operators

In this section we introduce the prolongation and restriction operators for neural network weight and bias optimization variables in matrix or vector form respectively.

For a 2D matrix of weights , define

(3.4)

where and are each prolongation maps between graphs which respect the structure of the spaces of inputs and outputs of , i.e. whose structure is similar to the structure of correlations in that space. Further research is necessary to make this notion more precise. In our experiments on autoencoder networks in Section LABEL:sec:netsec, we use example problems with an obvious choice of graph to use. In these 1D and 2D machine vision tasks, where we expect each pixel to be highly correlated with the activity of its immediate neighbors in the grid, 1D and 2D grids are clear choices of graphs for our prolongtion matrix calculation. Other choices may lead to similar results; for instance, we speculate that since neural network weight matrices may be interpreted as the weights of a multipartite graph of connected neurons in the network, these graphs could be an alternate choice of structure to prolong/restrict between. We leave for future work the development of automatic methods for determining these structures.

Note that the Pro and Res linear operators satisfy , the identity operator, so is a projection operator.

For a 1D matrix of biases , define

(3.5)

where, as before, we require that be a prolongation matrix between graphs which are appropriate for the dynamics of the network layer where is applied. Again .

Given such a hierarchy of models , and appropriate Pro and Res operators as defined above, we define a Multiscale Artificial Neural Network (MsANN) to be a neural network model with the same layer and parameter dimensions as the largest model in the hierarchy, where each layer parameter is given by a sum of prolonged weight matrices from level of each of the models defined above:

(3.6)
Here we are using as a shorthand to indicate composed prolongation from model to model , so if are weight variables we have (by Equation LABEL:eqn:weightpro)
(3.7)
and if are bias variables we have (by Equation LABEL:eqn:biaspro)
(3.8)

We note that matrix products such as need only be computed once, during model construction.

3.4.2 Multiscale Artificial Neural Network Training

The Multiscale Artificial Neural Network algorithm is defined in terms of a recursive ‘cycle’ that is analogous to one epoch of default neural network training. Starting with (i.e. the finest model in the hierarchy), we call the routine , which is defined recursively. At any level , MsANNCycle trains the network at level for batches of training examples, recurses by calling , and then returns to train for further batches at level . The number of calls to inside each call to is given by a parameter .

This is followed by additional training at the refined scale; this process is normally [37] referred to by the multigrid methods community as ‘restriction’ and ‘prolongation’ followed by ‘smoothing’. The multigrid methods community additionally has special names for this type of recursive refining procedure with (“V-Cycles”) and (“W-Cycles”). See Figure LABEL:fig:gammafig for an illustration of these contraction and refinement schedules. In our numerical experiments below, we examine the effect of this parameter on multigrid network training.

Neural network training with gradient descent requires computing the gradient of the error between the network output and target with regard to the network parameters. This gradient is computed by taking a vector of error for the nodes in the output layer, and backpropagating that error backward through the network layer by layer to compute the individual weight matrix and bias vector gradients. An individual network weight or bias term is then adjusted using gradient descent, i.e. the new value is given by , where is a learning rate or step size. Several techniques can be used to dynamically change learning rate during model training - we refer the reader to [5] for a description of these techniques and backpropagation in general.

Our construction of the MsANN model above did not make use of the Res (restriction) operator - we show here how this operator is used to compute the gradient of the coarsened variables in the hierarchy. This can be thought of as continuing the process of backpropagation through the Pro operator. For these calculations we assume is a weight matrix, and derive the gradient for a particular . For notational simplicity we rename these matrices and , respectively. We also collapse the matrix products

(3.9)
(3.10)

Let be a matrix where , calculated via backpropagation as described above. Then, for some :

(3.11)

Then,

(3.12)
and so
and therefore finally
(3.13)

where Res is as in LABEL:eqn:weightpro.

Procedure MsANNCycle():
       Train model for batches, where each consists of:
  1. Feed examples through the network in feed-forward mode;

  2. Compute error between network output and target;

  3. Use the classical backpropagation algorithm to compute the gradient of top-level parameter w.r.t. this error;

  4. Use the appropriate Res operations to compute the gradient of w.r.t. the parameters in , as described in Equation LABEL:eqn:msann_res_equation.

if max_depth has not been reached then
             for  do
                   MsANNCycle();
                   Train model for batches, as above
             end for
            
       end if
      ;
Algorithm 1 One ‘cycle’ of the MsANN procedure.

We also note here that our code implementation of this procedure does not make explicit use of the Res operator; instead, we use the automatic differentiation capability of Tensorflow [1] to compute this restricted gradient. This is necessary because data is supplied to the model, and error is calculated, at the finest scale only. Hence we calculate the gradient at this scale and restrict it to the coarser layers of the model. It may be possible to feed coarsened data through only the coarser layers of the model, eliminating the need for computing the gradient at the finest scale, but we do not explore this method in this paper.

Figure 4: Visits to models in a hierarchy of neural networks realized by several values of the recursion frequency parameter . The case and the case are referred to as “V-cycles” and “W-cycles”, respectively. Each time the multilevel training procedure visits a level, it performs some number, , of smoothing steps (i.e. gradient descent at that resolution) at that model.

4 Machine Learning Experiments

4.1 Preliminaries

We present four experiments using this Multiscale Neural Network method. All of the experiments below demonstrate that our multigrid method outperforms default training (i.e. training only the finest-scale network), in terms of the number of training examples (summed over all scales) needed to reach a particular mean-squared error (MSE) value. We perform two experiments with synthetic machine vision tasks, as well as two experiments with benchmark image datasets for machine learning. While all of the examples presented here are autoencoder networks (networks whose training task is to reproduce their input at the output layer, while passing through a bottleneck layer or layers), we do not mean to imply that MsANN techniques are constrained to autoencoder networks. All network training uses the standard backpropagation algorithm to compute training gradients, and this is the expected application domain of our method. Autoencoding image data is a good choice of machine learning task for our experiments for two main reasons. First, autoencoders are symmetric and learn to reproduce their input at their output. Other ML models (for instance, neural networks for classification) have output whose nodes are not spatially correlated, and it is not yet clear if our approach will generalize to this type of model. Secondly, since the single and double-object machine vision tasks operate on synthetic data, we can easily generate an arbitrary number of samples from the data distribution, which was useful in the early development of this procedure. Our initial successes on this synthetic data led us to try the same task with a standard benchmark real-world dataset. For each experiment, we use the following measure of computational cost to compare relative performance. Let be the number of trainable parameters in model . We compute the cost of a training step of the weights in model using a batch of size as . The total cost of training at step is the sum of this cost over all training steps thus far at all scales. This cost is motivated by the fact that the number of multiply operations for backpropagation is in the total number of network parameters and training examples , so we are adding up the relative cost of using a batch of size to adjust the weights in model , as compared to the cost of using that same batch to adjust the weights in .

4.2 Simple Machine Vision Task

As an initial experiment in the capabilities of hierarchical neural networks, we first try two simple examples: finding lower-dimensional representation of two artificial datasets. In both cases, we generate synthetic data by uniformly sampling from

  1. the set of binary-valued vectors with one “object” comprising a contiguous set of pixels one-eighth as long as the entire vector set to 1, and the rest zero; and

  2. the set of vectors with two such non-overlapping objects.

In each case, the number of possible unique data vectors is quite low: for inputs of size 1024, we have 1024 - 128 = 896 such vectors. Thus, for both of the synthetic datasets we add binary noise to each vector, where each “pixel” of the input has an independent chance of firing spuriously with . This noise in included only in the input vector, making these networks Denoising Autoencoders: models whose task is to remove noise from an input image.

4.2.1 Single-Object Autoencoder

We first test the performance of this procedure on a simple machine vision task. The neural networks in our hierarchy of models each have layer size specification (in number of units) for in , with a bias term at each layer and sigmoid logistic activation. We present the network with binary vectors which are 0 everywhere except for a contiguous segment of indices of length which are set to 1, with added binary noise as described above. The objective function to minimize is the mean-squared error (MSE) between the input and output layers. Each model in the hierarchy is trained using RMSPropOptimizer in Tensorflow, with learning rate .

The results of this experiment are plotted in Figure LABEL:fig:autofig and summarized in Table LABEL:tbl:oneobj. We perform multiple runs of the entire training procedure with differing values of (the number of smoothing steps), (the multigrid cycle parameter), and (depth of hierarchy). Notably, nearly all multigrid schedules demonstrate performance gains over the default network (i.e. the network which trains only at the scale), with more improvement for higher values of , , and . The hierarchy which learned most rapidly was the deepest model with and . Those multigrid models which did not improve over the default network were only slightly more computationally expensive per unit of accuracy than their default counterparts, and the multigrid models which did improve, improved significantly.


Figure 5: Log-log plots of accuracy as a function of training cost attained by a variety of hierarchical neural networks training on a simple machine vision task, demonstrating that deeper hierarchies with more mutligrid behavior learn more rapidly. Plots are ordered from top to bottom in increasing depth of recursion parameter ; left plots are the single-object experiments and right plots are the double-object experiments. Within each plot, different curves represent different values of the depth of hierarchy, from (lightest) to (darkest). Each line is the best run for that pair over all choices of (number of smoothing steps at each level) in .
Best MsANN Worst MsANN Default Best MsANN params
Final MSE
Cost to MSE
Table 1: Best performance (on validation dataset for the one-object autoencoding task) by any combination of parameters in our sweep over values for (recursion constant), (depth of network), and (number of batches processed at each visit to each level). We report the final Mean-Squared Error for both the best and worst combination of these parameters, as well as for default training. We also report the best combination of parameters. The second row indicates the cost necessary to train each model to of the error at which it began. The best MsANN network reaches this threshhold in an order of magnitude less cost, and its final error is roughly half that of the default model, demonstrating clear improvement over training without multigrid.

4.2.2 Double-Object Autoencoder

We repeat the above experiment with a slightly more difficult machine vision task - the network must learn to de-noise an image with two (non-overlapping) ‘objects’ in the visual field. We use the same network structure and training procedure, and note that we see again (plotted in Figure LABEL:fig:autofig and summarized in Table LABEL:tbl:twoobj) that the hierarchical model is more efficient, reaching lower error in the same amount of computational cost . The multigrid neural networks again typically learn much more rapidly than the non-multigrid models.

Best MsANN Worst MsANN Default Best MsANN params
Final MSE
Cost to MSE
Table 2: Best performance (on validation dataset for the two-object autoencoding task). Again the MsANN network demonstrates performance and accuracy gains over neural network training alone. See Table LABEL:tbl:oneobj.

4.3 Mnist

To supplement the above synthetic experiments with one using real-world data, we perform the same experiment with an autoencoder for the MNIST handwritten digit dataset [25, 26]. In this case, rather than the usual MNIST classification task, we use an autoencoder to map the MNIST images into a lower-dimensional space with good reconstruction. We use the same network structure as in the 1D vision example; also as in that example, each network in the hierarchy is constructed of fully connected layers with bias terms and sigmoid activation, and smoothing steps are performed with RMSProp [15] with learning rate 0.0005. The only difference is that in this example we do not add noise to the input images, since the dataset is larger by two orders of magnitude.

In this experiment, we see (in Figure LABEL:fig:mnistfig and Table LABEL:tbl:mnist) similar improvement in efficiency. Table LABEL:tbl:mnist summarizes these results: the best multilevel models learned more rapidly and achieved lower error than their single-level counterparts, whereas the worst multilevel models performed on par with the default model. Because the MNIST data is comprised of 2D images, we tried using matrices which were the optima of prolongation problems between grids of the appropriate sizes, in addition to the same 1D used in the prior two experiments. The difference in performance between these two choices of underlying structure for the prolongation maps can be seen in Figure LABEL:fig:mnistfig. With either approach, we see similar results to the synthetic data experiment, in that more training steps at the coarser layers results in improved learning performance of the finer networks in the hierarchy. However, the matrices optimized for 2D prolongation perform marginally better than their 1D cousins, - in particular, the multigrid hierarchy with 2D prolongations took 60% of the computational cost to reduce its error to of its original value, as compared to the 1D version. We explore the effect of varying the strategy used to pick in Subsection LABEL:subsec:p_choice.


Figure 6: Log-log plots of mean-squared error on MNIST autoencoding task as a function of computational cost ; the left plots represent multigrid performed with path graph prolongations for each layer while the right plots used grid-based prolongation. While both approaches show gains over default learning in both speed of learning and final error value, the one which respects the spatial structure of the input data improves more rapidly. Subplot explanations are the same as in Figure LABEL:fig:autofig.
Path-Based Matrices
Best MsANN Worst MsANN Default Best MsANN params
Final MSE
Cost to MSE N/A N/A
Grid-Based Matrices
Best MsANN Worst MsANN Default Best MsANN params
Final MSE
Cost to MSE N/A N/A
Table 3: Best performance (on validation dataset for the MNIST autoencoding task). See Table LABEL:tbl:oneobj. Upper section represents scores attained by a MsANN with path-based prolongation, lower section represents grid-based prolongation. Entries marked N/A did not reach of their initial error during training.

4.4 Experiments of Choice of

To further explore the role of the structure of in these machine learning models, we compare the performance of several MsANN models with generated according to various strategies. Our initial experiment on the MNIST dataset used the exact same hierarchical network structure and prolongation/restriction operators as the example with 1D data, and yielded marginal computational benefit. We were thus motivated to try this learning task with prolongations which are designed for for 2D grid-based model architectures, as well as trying unstructured (random orthogonal) matrices as a baseline. More precisely, our 1D experiments used matrices resembling those in column 3 of the “Cycle Graphs” section of Figure LABEL:fig:pspeciesfig. We instead, for the MNIST task, used matrices like those in column of the “Grid Graphs” section of the same figure. In Figure LABEL:fig:pcomp_mnist, we illustrate the difference in these choices for the MNIST training task, with the same choice of multigrid training parameters: . We compare the following strategies for generating :

  1. As local optima of a prolongation problem between 1D grids, with periodic boundary conditions;

  2. As local optima of a prolongation problem between 2D grids, with periodic boundary conditions;

  3. As in LABEL:misc:plist_elem, but shuffled along the first index of the array.

Strategy LABEL:misc:plist_elem_2 was chosen to provide the same degree of connectivity between each coarse variable and its related fine variables as strategy LABEL:misc:plist_elem, but in random order i.e. connected in a way which is unrelated to the 2D correlation between neighboring pixels. We see in Figure LABEL:fig:pcomp_mnist that the two strategies utilizing local optima outperform both the randomized strategy and default training (training only the finest scale). Furthermore, strategy LABEL:misc:plist_elem outperforms strategy LABEL:misc:plist_elem_0, although the latter eventually catches up at the end of training, when coarse-scale weight training has diminishing marginal returns. The random strategy is initially on par with the two optimized ones (we speculate that this is due to the ability to affect many fine-scale variables at once, even in random order, which may make the gradient direction easier to travel), but eventually falls behind, at times being less efficient than default training. We leave for further work the question of whether there are choices of prolongation problem which are even more efficient for this machine learning task. We also compare all of the preceeding models to a model which has the same structure as a MsANN model (a hierarchy of coarsened variables with Pro and Res operators between them), but which was trained by training all variables in the model simultaneously. This model performs on par with the default model, illustrating the need for the multilevel training schedule dictated by the choice of .

Figure 7: Comparison of several choices of Pro and Res operators for a Multiscale Neural Network training experiment, on MNIST data. Two choices for which are local optima of prolongation problems demonstrate more efficient training than default, while two strategies perform worse: multigrid training with random matrices, and training all varibles in the hierarchy simultaneously.

4.5 Summary

We see uniform improvement (as the parameters and are increased) in the rate of neural network learning when models are stacked in the type of multiscale hierarchy we define in equations LABEL:eqn:weightpro and LABEL:eqn:biaspro, despite the diversity of machine learning tasks we examine. Furthermore, this improvement is marked: the hierarchical models both learn more rapidly than training without multigrid and have final error lower than the default model. In many of our test cases, the hierarchical models reached the same level of MSE as the default in more than an order of magnitude fewer training examples, and continued to improve, surpassing the final level of error reached by the default network. Even in the worst case, our hierarchical model structure performed on par with neural networks which did not incorporate our weight prolongation and restriction operators. We leave the question of finding optimal for future work - see Section LABEL:sec:future for further discussion. Finally, we note that the model(s) in the experiments presented in section LABEL:subsec:auto were essentially the same MsANN models (same set of and same set of matrices), and showed similar performance gains on two different machine vision problems, indicating that it may be possible to develop general MsANN model-creation procedures that are applicable to a variety of problems (rather than needing to be hand-tuned).

5 Upper Bounds for Diffusion Term

In this section, we consider two theoretical concerns:

  1. Invariance in Frobenius norm of diffusion term solutions under transformation to a spectral basis; and

  2. Decoupling a prolongation problem between graph products into a sum of prolongation problems of the two sets of graph factors.

We will here rely heavily on various properties of the Kronecker sum and product of matrices which may be found in [17], Section 11.4.

5.1 Invariance of objective function evaluation of P under eigenspace transformation

For the purpose of the calculations in this section, we restrict ourselves to the “diffusion” term of our objective function LABEL:eqn:objfunction (the term which coerces two diffusion processes to agree), which we will write as

(5.1)

Because and are each real and symmetric, they may both be diagonalized as where is a rotation matrix and is a diagonal matrix with the eigenvalues of on the diagonal. Substituting into LABEL:eqn:distancedefn, and letting , we have

(5.2)

where is an orthogonal matrix if and only if is. Since the Frobenius norm is invariant under multiplication by rotation matrices, LABEL:eqn:spectral_form is a re-formulation of our original Laplacian matrix objective function in terms of the spectra of the two graphs. Optimization of this modified form of the objective function subject to orthogonality constraints on is upper-bounded by optimization over matchings of eigenvalues: for any fixed the eigenvalue-matching problem has the same objective function, but our optimization is over all real valued orthogonal . The orthogonality constraint is a relaxed version of the constraints on matching problems (Equation LABEL:eqn:matchingconstraints) discussed in subsection LABEL:subsub:definitions, since matching matrices M are also orthogonal . Many algorithms exist for solving the inner partial and 0-1 constrained minimum-cost assignment problems, such as the Munkres algorithm [28] (also in subsection LABEL:subsub:definitions).

We note three corollaries of the above argument. Namely, because the Frobenius norm is invariant under the mapping to and from eigenspace:

  1. Optimal or near-optimal in eigenvalue-space maintain their optimality through the mapping back to graph-space.

  2. Solutions which are within of the optimum in -space are also within of the optimum in -space; and

  3. More precisely, if they exist, zero-cost eigenvalue matchings correspond exactly with zero-cost .

A natural next question would be why it might be worthwhile to work in the original graph-space, rather than always optimizing this simpler eigenvalue-matching problem instead. In many cases (path graphs, cycle graphs) the spectrum of a member of a graph lineage is a subset of that of , guaranteeing that zero-cost eigenvalue matchings (and thus, by the argument above, prolongations with zero diffusion cost) exist. However, when this is not the case, the above argument only upper bounds the true distance, since the matching problem constraints are more strict. Thus, numerical optimization over , with orthogonality constraints only, may find a better bound on .

5.2 Decomposing Graph Product Prolongations

We next consider the problem of finding optimal prolongations between two graphs and when optimal prolongations are known between and , and and . We show that under some reasonable assumptions, these two prolongation optimizations decouple - we may thus solve them separately and combine the solutions to obtain the optimal prolongations between the two product graphs.

From the definition of graph box product, we have

where is the Kronecker sum of matrices as previously defined. See [12], Item 3.4 for more details on Laplacians of graph products. We calculate

Now we try out the assumption that , which restricts the search space over and may increase the objective function:
Since ,
Thus assuming