Resonator Circuits for factoring
University of California, Berkeley
Berkeley, CA 94702
We describe a type of neural network, called a Resonator Circuit, that factors high-dimensional vectors. Given a composite vector formed by the Hadamard product of several other vectors drawn from a discrete set, a Resonator Circuit can efficiently decompose the composite into these factors. This paper focuses on the case of “bipolar” vectors whose elements are and characterizes the solution quality, stability properties, and speed of Resonator Circuits in comparison to several benchmark optimization methods including Alternating Least Squares, Iterative Soft Thresholding, and Multiplicative Weights. We find that Resonator Circuits substantially outperform these alternative methods by leveraging a combination of powerful nonlinear dynamics and “searching in superposition”, by which we mean that estimates of the correct solution are, at any given time, formed from a weighted superposition of all possible solutions. The considered alternative methods also search in superposition, but the dynamics of Resonator Circuits allow them to strike a more natural balance between exploring the solution space and exploiting local information to drive the network toward probable solutions. Resonator Circuits can be conceptualized as a set of interconnected Hopfield Networks, and this leads to some interesting analysis. In particular, while a Hopfield Network descends an energy function and is guaranteed to converge, a Resonator Circuit is not. However, there exists a high-fidelity regime where Resonator Circuits almost always do converge, and they can solve the factorization problem extremely well. As factorization is central to many aspects of perception and cognition, we believe that Resonator Circuits may bring us a step closer to understanding how this computationally difficult problem is efficiently solved by neural circuits in brains.
One explanation for the brain’s ability to apply learned knowledge in seemingly novel contexts is that it is effective at decomposing complex patterns into ones that are simpler, more fundamental, and that can be recombined in new ways to explain situations never before encountered. That this is a fundamental property of perception and cognition has been pointed out by researchers for some time (see, for instance, Rosenblatt (1962), Barrow and Tenenbaum (1978), Kersten et al. (2004), and Mumford and Desolneux (2010)). When the combination of atomic patterns is via an operation which multiplies them together to generate a composite pattern, we can call the inverse of this process – the task of decomposing the composite pattern into its constituent atoms – factorization. Factorization problems are pervasive in perception, and in order to illustrate their fundamental and diverse nature, we will highlight a few examples of prior work that has grappled with factorization in visual scenes – this paper introduces a new algorithmic and representational tool that may be useful in revisiting some of these problems.
The signal measured by an optical sensor includes the conjunction of many different factors. For instance, luminance at a point on the sensor is a function of physical properties in the scene, such as the reflectance and orientation of object surfaces, as well as the direction and intensity of incident illumination. Barrow and Tenenbaum (1978) hypothesized that specific physical properties might be effectively represented in a series of ‘intrinsic images’, one each for object range, surface reflectance, surface orientation, and incident illumination. Among the many follow-on works, we highlight Adelson and Pentland (1996), which took this idea, specifically the factorization of measured luminance into reflectance and incident illumination, and put it on the firmer conceptual footing of Bayesian statistical inference. At another level of abstraction, one can consider the separation of visual object features defined in some canonical object-based reference frame from the pose of objects in a larger scene. Robust object recognition depends on the ability to separate these two factors. Hinton (1981) and Olshausen et al. (1993) both addressed the problem of inferring and separately representing visual object features from object pose, with a focus on both learning and neural mechanisms by which this computation might be carried out in brains. Related to this problem in static scenes is the factorization of dynamic scenes into object features (also called object ‘form’) and object motion. Cadieu and Olshausen (2012) presented a hierarchical, probabilistic generative model for learning a factorization of object form and motion from natural movies. Within the Computer Vision community, the intrinsic images perspective remains dominant, with many of the more recent attempts relying on deep neural network methods. Unfortunately, these approaches are often brittle, or inefficient, or both. Uncovering factors of variation in rich three-dimensional scenes is still an open problem.
One promising strategy for solving this kind of problem is to learn a mapping into some latent space that makes factorization more tractable for an iterative algorithm such as the one we propose in this work. Our focus for now is on how to solve factorization problems in a latent space of high-dimensional vectors, which are assumed to represent certain structures in the original signal space, and we propose a highly efficient algorithm for doing this. The objective of this paper is therfore to establish the theoretical underpinnings of this new approach – future work will be needed to develop applications to specific percept factorization problems. Early work along these lines (Kent and Olshausen, 2017; Frady et al., 2018a) leverages feedforward neural networks and random projections to generate high-dimensional vector representations of scenes, which can then be factored by Resonator Circuits.
Our main motivation for focusing on structure embedded in a latent space of high-dimensional vectors comes from a family of theories for cognitive representation called Vector Symbolic Algebras. This framework encodes concepts, be they visual scene concepts or language concepts etc., as high-dimensional vectors and defines an algebra over these vectors that can capture the structures necessary for high level cognitive processing, in particular sequences, hierarchy, and binding/multiplication. Inverting the binding operation of Vector Symbolic Algebras involves computing a factorization, and it was in this context that we first encountered the problem studied in this paper.
Building on ideas from Smolensky (Smolensky, 1990), Hinton (Hinton, 1990), and Touretzky (Touretzky, 1990), the first Vector Symbolic Algebra (VSA) was proposed by Plate in 1991 (Plate, 1991, 1995, 2003) under the name Holographic Reduced Representations. It underpins much of the power displayed by modern models under different names, for instance the Semantic Pointer Architecture (Eliasmith et al., 2012) and Holographic Embeddings of knowledge graphs (Nickel et al., 2016). Gayler’s Vector Symbolic Architecture (Gayler, 1998, 2004) and Kanerva’s Binary Spatter Codes (Kanerva, 1996, 2009) are also Vector Symbolic Algebras, which happen to operate with two-state vectors ( or ), and make particular use of element-wise vector multiplication to ‘bind’ factors together. We call vectors drawn from ‘bipolar’, and the multiplication operation between bipolar vectors is the Hadamard Product . This will be our focus for the rest of the paper.
There are a number of recent works establishing the representational power of Gayler and Kanerva’s models, and we highlight three of them here. Complex relations, such as multi-argument predicates, are necessary to solve reasoning tasks like Raven’s Progressive Matrices (Raven et al., 2003). Emruli et al. (2013) showed how these relations can be represented with VSAs, using the vector algebra framework to compute solutions to these challenging problems. This work was built on earlier ideas about computing analogies using the VSA framework (Kanerva, 1998). A completely different application of VSAs was shown by Joshi et al. (2016), who leveraged the vector operations specified by Vector Symbolic Algebras (Multiply, Add, and Permute) to generate novel vector encodings of n-gram structure in natural language and applied this encoding to a -class language identification problem. Their solution, which produced high-accuracy language classifications, was very efficient and produced geometric structure among the vectors reflecting known linguistic relationships between the languages. Another recent application of Vector Symbolic Algebras is in the encoding and classification of EEG, EMG, and ECoG biosignals. Rahimi et al. (2018) surveyed recent work on this topic and proposed a highly efficient and simple processor designed for these workloads.
The applications of Vector Symbolic Algebras are broad and exciting, although more theoretical and implementation work remains for them to reach their full potential. In particular, the result of computation in VSAs is often a composite vector that must be decomposed in order to decode what it actually represents (for instance, a particular n-gram). This commonly involves solving a factorization problem. Until now, there has been no efficient way to factor the composite vectors, representing a major limitation of these models. Our algorithm, called Resonator Circuits, can efficiently solve this factorization problem and makes the use of Vector Symbolic Architecture significantly more practical.
We now give a concrete example of the factorization problem before formalizing the general case in Section 2. Suppose that a visual scene containing a single object is generated by three factors – object shape, reflectance, and pose – each of which is represented by -dimensional bipolar vectors , , and , respectively. For now we sidestep the issue of how to get from pixels to these latent bipolar vectors, but as mentioned above please see Kent and Olshausen (2017), as well as Frady et al. (2018b), for some ideas along these lines. Assume there are ten possible shapes, six possible reflectances, and twenty-five possible poses – we write these possibilities as
The set contains all the possible shapes, contains all the possible reflectances, and contains all the possible poses. In this latent space, the generative model for data is the Hadamard product of three vectors, one from each of these sets. For instance, in a particular scene we might have
which contains the shape, the reflectance, and the pose. The structure of this dataset is in some sense quite simple – there are only sources of variation. Yet, these sources generate possible values for . The generative model may be simple, but the task of inference is more challenging – given only and the sets of possible vectors , , and , we desire a method for determining the correct factorization without having to search over too many of the possible factorizations.
Resonator Circuits do this by leveraging simple and powerful nonlinear dynamics as well as “search in superposition”, a notion that we make precise in the next section. There are in fact many ways to search in superposition, and we introduce a number of them in Section 2 as a way to build up to our model and to gain some deeper understanding of the problem. A Resonator Circuit is defined by equations (15) and (18), each representing two separate variants of the algorithm/network, and is named for the way in which correct factorizations seem to ‘resonate’ out of what is initially an uninformative circuit state. The size of the factorization problem that can be reliably solved, as well as the speed with which solutions are found, will characterize the performance of all the approaches we introduce – in these terms, Resonator Circuits are by far the most effective. Our main results are:
A characterization of solution stability compared to Hopfield Networks (Section 4.1).
The definition of “operational capacity” for a factorization algorithm and comparison of eight different algorithms on the basis of operational capacity (Section 4.2).
The discovery of a scaling law that relates the operational capacity of Resonator Circuits to various parameters of the factorization problem (also Section 4.2).
A theory for why Resonator Circuits perform well on this problem (Section 4.3).
We conclude with some discussion and future directions in Section 5.
2 Statement of the problem
We formalize the factorization problem in the following way: are sets of vectors called ‘codebooks’. The th codebook contains ‘codevectors’
and these vectors all live in . Composite vector is generated by computing the Hadamard product of vectors, one drawn from each of the codebooks .
The factorization problem we wish to study is
In our example from the introduction section, with , , and . Our assumption is generally that there are no repeated codevectors, so that the factorization of into codevectors, one from each codebook, is unique. Then, the total number of composite vectors that can be generated by the codebooks is :
The problem involves searching among possible factorizations to find the one that generates . We will refer to as the search space size, and at some level it captures the difficulty of the problem. The problem size is also affected by , the dimensionality of each vector.
Suppose we were to solve (1) using a brute force strategy. We might form all possible composite vectors from the sets , one at a time, until we generate the vector , which would indicate the appropriate factorization. Assuming no additional information is available, the number of trials taken to find the correct factorization is a uniform random variable and thus . If instead we could easily store all of the composite vectors ahead of time, we could compare them to any new composite vector via a single matrix-vector inner product, which, assuming the factorization is unique, will yield a value of for the correct factorization and values strictly less than for all other factorizations. The matrix containing all possible composite vectors requires bits to store. The core issue is that scales very poorly with the number of factors and number of possible codevectors to be entertained. If ( factors) and ( possible codevectors for each factor), then . In the context of Vector Symbolic Algebras, it is common to have . Therefore, the matrix with all possible composite vectors would require to store. We aspire to solve problems of this size (and much larger), which are clearly out of reach for brute-force approaches. Fortunately, they are solvable using Resonator Circuits.
2.1 Factoring by search in superposition
In our problem formulation (1) the factors interact multiplicatively to form , and this lies at the heart of what makes it hard to solve. One way to attempt a solution is to produce an estimate for each factor in turn, alternating between updates to a single factor on its own, with the others held fixed. In addition, it may make sense to simulatenously entertain all of the vectors in each , in some proportion that reflects our current confidence in each one being part of the correct solution. We call this searching in superposition and it is the general approach we take throughout the paper. What we mean by ‘superposition’ is that the estimate for the th factor, , is given by where is a matrix with each column a vector from . The vector contains the coefficients that define a linear combination of the elements of , and is a function from to , which we will call the activation function. ‘Search’ refers to the method by which we adapt over time. The estimate for each factor leads to an estimate for denoted by :
Suppose is the identity . Then becomes a multilinear function of the coefficients .
While this is a ‘nice’ relationship in the sense that it is linear in each of the coefficients separately (with the others held fixed), it is unfortunately not convex with respect to the coefficients taken all at once. We can rewrite it as a sum of different terms, one for each of the possible factorizations of :
Where ranges from to , ranges from to , and so on. The term in parentheses is a scalar that weights each of the possible Hadamard products. Our estimate is, at any given time, purely a superposition of all the possible factorizations. Moreover, the superposition weights can be approximately recovered from alone by computing the cosine similarity between and the vector . The source of ‘noise’ in this approximation is the fact that will have some nonzero inner product with the other vectors in the sum. When the codevectors are uncorrelated and high-dimensional, this noise is quite small: transparently reflects the proportion with which it contains each of the possible factorizations. When is a nonlinear activation function, for instance the sign function , this property is retained. The vector is no longer an exact superposition, but the scalar can still be decoded from in the same way – the vector is still an approximate superposition of all the possible factorizations, with the weight for each of these determined by the coefficients . This property, that thresholded superpositions retain relative similarity to each of their superimposed components, is heavily relied on throughout Kanerva’s and Gayler’s work on Vector Symbolic Algebras (Kanerva, 1996; Gayler, 1998).
2.2 Search as an optimization problem
The most natural strategy to find the correct factors is to define a loss function which compares the current estimate with the composite that is to be factored, . One can choose a loss function and a constraint set so that the global minimizer of this loss over the constraints yields the correct solution to (1). Let the loss function be and the feasible set for each be . This suggests a fairly generic optimization problem:
the details of which may be heavily dependent on our choices for , , , and the structure of the vectors in each codebook. Different optimization algorithms may be appropriate for this problem, depending on the particular form of (5), and we propose six candidate algorithms in this paper, which we refer to as the “benchmarks”. The dynamics for each of the benchmark algorithms are compiled in Table 1, briefly reviewed in Section 2.3, and discussed at some length in the Appendix. Our algorithm, called a Resonator Circuit, does not take this approach (i.e. defining a loss function which is consistently minimized by the dynamics). Instead, it is best understood as a nonlinear dynamical system whose fixed points are likely to yield the correct factorzation (by design), and whose dynamics approximately move the state towards .
Our notation will often include an explicit dependence on time like so: . Also, we define the vector to be the product of the estimates for the other factors:
This will come up in each of the algorithms under consideration and simplify the notation in what follows. Each of the algorithms considered in this paper updates one factor at a time, with the others held fixed so, at a given time , we will update the factors in order to , although this is a somewhat arbitrary choice. Including time dependence with , we have
which makes explicit that at the time of updating , the factors to have already been updated for this ‘iteration’ while the factors to have yet to be updated.
2.3 Benchmark algorithms
A common thread among all the algorithms that we introduce in this section is the fact that the function is the identity , making a multilinear function of the coefficients, as we discussed above. We consider two straightforward loss functions for comparing and . The first is one half the squared Euclidean norm of the error, , which we will call the squared error for short, and the second is the negative inner product . The squared error is minimized by , which is also true for the negative inner product when is constrained to . Both of these loss functions are convex, meaning that is a convex function of each separately111through the composition of an affine function with a convex function. Some of the benchmark algorithms constrain directly, and when that is the case, our focus is on three different convex sets, namely the simplex , the unit ball , and the closed zero-one hypercube . Therefore, solving (5) with respect to each separately is a convex optimization problem. In the case of the negative inner product loss and simplex constraints , it is a bonafide linear program. The correct factorization is given by such that , which we know to be vectors with a single entry and the rest – these are the standard basis vectors (where if and otherwise). According to the problem statement (1), any algorithm we choose to solve the superposition search must determine initial values for (which we notate as ) using only the vector and the sets . There is no additional information about how to set . A sensible thing to do would be to set each element of to the same value so that each of the potential codevectors is equally weighted in the superposition that forms the initial estimate .
2.3.1 Alternating Least Squares
Alternating Least Squares (ALS) locally minimizes the squared error loss in a fairly straightforward way: for each factor, one at a time, it solves a least squares problem for and updates the current state of the estimate to reflect this new value, then moves onto the next factor and repeats. Formally, the updates given by Alternating Least Squares are:
Alternating Least Squares is an algorithm that features prominently in the tensor decomposition literature (Kolda and Bader, 2009), but while ALS has been successful for a particular type of tensor decomposition, there are a few details which make our problem different from what is normally studied (see Appendix A). The updates in ALS are quite greedy – they exactly solve each least squares subproblem. It may make sense to more gradually modify the coefficients, a strategy that we turn to next.
2.3.2 Gradient-following algorithms
Another natural strategy for solving (5) is to make updates that incorporate the gradient of with respect to the coefficients – each of the next algorithms does this in a particular way. The squared error loss is globally minimized by , so one might be tempted to start from some initial values for the coefficients and make gradient updates . In Appendix B.1 we discuss why this does not work well – the difficulty is in being able to guarantee that the loss function is smooth enough that gradient descent iterates with a fixed stepsize will converge. Instead, the algorithms we apply to the squared error loss utilize a dynamic stepsize.
The global minimizers of (5) are maximally sparse, . If one aims to minimize the squared error loss while loosely constrained to sparse solutions, it may make sense to solve the problem with Iterative Soft Thesholding (ISTA). The dynamics for ISTA are given by equation (1). The dynamic stepsize can be set via backtracking line search or, as we did, by computing the Lipschitz constant of the function gradient. The only free parameter for ISTA is , which effectively sets the sparsity of solutions that are considered. We discuss this, as well as some context and motivation for ISTA, in Appendix C. We also considered Fast Iterative Soft Thesholding (FISTA), an enhancement due to Beck and Teboulle (2009), which utilizes Nesterov’s momentum for accelerating first-order methods in order to alleviate the sometimes slow convergence of ISTA (Bredies and Lorenz, 2008). Dynamics for FISTA are given in equation (1). Another benchmark algorithm we considered was Projected Gradient Descent on the negative inner product loss, where updates were projected onto either the simplex or unit ball (1). A detailed discussion of this approach can be found in Appendix D. Multiplicative Weights is an algorithm that can be applied to either loss function, although we found it worked best on the negative inner product. It very elegantly enforces a simplex constraint on by maintaining a set of auxilliary variables, the ’weights’, which are used to set at each iteration. Both Projected Gradient Descent and Multiplicative Weights have a single hyperparameter, , which sets the stepsize of each iteration. See equation (1) for the dynamics of Multiplicative Weights, as well as Appendix E. The final algorithm that we considered is called Map Seeking Circuits. Map Seeking Circuits are neural networks designed to solve invariant pattern recognition problems using the principle of superposition. Their dynamics are based on the gradient, but are different from what we have introduced so far – see equation (1) and Appendix F.
2.4 Summary of Benchmark Algorithms
The most natural way to approach vector factorization problem (1) is to cast it as a problem of optimization, centered around the minimization of a loss function . We have proposed six algorithms falling within this framework that all utilize the principle of search in superposition (see Section 2.1) but whose dynamics vary substantially. In the Appendix we prove, or refer to results proving, that each one of them converges for all initial conditions. That is, given any starting coefficients , their dynamics reach fixed points which are local minimizers of the loss function. The optimization problem being solved is fundamentally non-convex, so this fact alone may not tell us much about the quality of the local minima that are found in practice. In Section 4 we will establish a standard setting where we measure how likely it is that these local minima are actually global minima. We find that as long as – the size of the search space – is small enough, each of these algorithms can find the global minimizers reliably. The point at which the problem becomes too large to reliably solve is what we call the operational capacity of the algorithm, and will be the main point of comparison with our model, which we introduce next. The benchmarks we have proposed are a subset of the optimization-based approaches we tried for solving the factorization problem. This included experimenting with different activation functions, different constraints, and different dynamics. The algorithms presented in this section represent the strongest benchmarks we could find.
|\lrboxAlgorithm\endlrbox||\lrboxDynamics for updating factor coefficients \endlrbox||\lrboxEq.\endlrbox|
|\lrboxAlternating Least Squares\endlrbox||\lrbox\endlrbox||\lrbox(8)\endlrbox|
|\lrboxIterative Soft Thresholding\endlrbox||\lrbox\endlrbox||\lrbox(9)\endlrbox|
|\lrboxFast Iterative Soft Thresholding\endlrbox||\lrbox\endlrbox||\lrbox(10)\endlrbox|
|\lrboxProjected Gradient Descent\endlrbox||\lrbox\endlrbox||\lrbox(11)\endlrbox|
|\lrboxMap Seeking Circuits\endlrbox||\lrbox\endlrbox||\lrbox(13)\endlrbox|
3 Resonator Circuits
The benchmark algorithms we have so far introduced generate estimates for the factors that move through the interior of the hypercube. Our proposal, a neural network called a Resonator Circuit, does not. It uses the highly nonlinear activation function , which ‘bipolarizes’ inputs to the nearest vertex of the hypercube. It also does not use additive gradient-based updates but rather alternating projections onto the linear subspace spanned by the codevectors. We know the solutions exist at vertices of the hypercube and these points are very special geometrically in the sense that in high dimensions, most of the mass of is concentrated relatively far from the vertices – a fact we will not prove here but that is based on standard results from the study of concentration inequalities (Boucheron et al., 2013). It may be that moving through the interior of the hypercube during the search for a factorization is unwise, and instead the search should be limited to vertices of the hypercube.
It is the combination of projections and binarization that distinguishes our approach. This is not a trivial modification to what we have already covered – adding to the gradient-based algorithms a saturating nonlinearity (either the sign function or something else) does not yield improvement in their performance, and removing the activation function in a Resonator Circuit worsens its performance – it is the combination of these two choices that gives our model its significant advantage in operational capacity, a metric we define and measure in Section 4.2.
3.1 Resonator Circuit with Ordinary Least Squares weights
We will first write down the dynamics for this model and then discuss various algorithmic and neural network interpretations. A Resonator Circuit with Ordinary Least Squares (OLS) weights is given by:
Substituting this into the expression for , we can write
Recall that our definition of is given by (7).
3.1.1 Neural network interpretation
Suppose indicates the state of a population of neurons at time . Each neuron receives an input , modified by synapses modeled as a row of the “weight matrix” . This “synaptic current” is passed through the activation function in order to determine the output, which is either or . If the input to each neuron were instead , its own previous state, then what we would have is a Bipolar Hopfield Network (Hopfield, 1982) with a synaptic learning rule originally proposed by Personnaz, Guyon, and Dreyfus (Personnaz et al., 1986), the so-called “projection” rule. In our case however, rather than being autoassociative, in which is a direct function of , our dynamics are heteroassociative, basing the updates on the states of the other factors. Moreover, we imagine separate subpopulations of neurons which evolve together in time, each one responsible for estimating a different factor of . For now we are just specifying this as a discrete-time network in which updates are made one-at-a-time, but it can be extended as a continuous-valued, continuous-time dynamical system along the same lines as was done for Hopfield Networks (Hopfield, 1984). In that case, we can think about these subpopulations of neurons evolving in a truly parallel way. Connecting the subpopulations together as we have specified has a dramatic effect on the way that evolves compared to a classical Hopfield Network, which we discuss in the next section.
3.1.2 Algorithmic interpretation
The vector is a “suggestion” for based on the current states of the other factors. The matrix , when applied to any -dimensional vector, produces the Ordinary Least Squares projection of that vector onto the linear subspace spanned by the columns of , which is precisely the space within which we must search for a factor. Therefore, is a bipolarized Ordinary Least Squares projection of onto .
Another way of looking at the dynamics of a Resonator Circuit with OLS weights is that it is implementing a bipolarized version of the Alternating Least Squares algorithm we introduced in Section 2.3. Suppose we were to take the dynamics specified in (1) for making ALS updates to , but we also bipolarize the vector . When the estimates are bipolar, the vector is bipolar and we can simplify the term :
The ALS update is now the same as equation (14) – a Resonator Circuit with OLS weights is implementing a bipolarized version of Alternating Least Squares. It may be a bit of a misnomer to call this algorithm Bipolarized Alternating Least Squares because at each iteration it is not solving a least squares problem. To set according to (14) is to take the term present in and to treat the activation function as if it were linear. As a result, the iterates do not always descend a loss function and the dynamics are not guaranteed to converge. It is true that is a fixed point of these dynamics, but this is not particularly comforting given that our starting point will yield an estimate that is likely to be very far from . Unlike Hopfield Networks, which have a Lyapunov function certifying their global asymptotic stability, no such function (that we know of) exists for a Resonator Circuit. We have observed trajectories that collapse to limit cycles and trajectories that do not seem to converge in any reasonable time. There is, however, a large and practical regime of operation where is small enough that these non-converging trajectories are extremely rare. It is fairly simple to deal with these trajectories, making the model very useful in practice despite the lack of a convergence guarantee. It has also been argued in several places (see Van Vreeswijk and Sompolinsky (1996), for example) that cyclic or chaotic trajectories may be useful to a neural system, including in cases where there are multiple plausible states to entertain. This is just to say that we feel the lack of a convergence guarantee is not a critical weakness of our model but rather an interesting and potentially useful characteristic. We attempted many different modifications to the model’s dynamics which would provably cause it to converge, but these changes always hindered its ability to solve the factorization problem. We emphasize that unlike all of the models in Section 2.3, a Resonator Circuit is not descending a loss function. Rather, it makes use of the fact that:
Each iteration is a bipolarized ALS update – it approximately moves the state towards the Least Squares solution for each factor.
The correct solution is a fixed point.
There may be a sizeable ‘basin of attraction’ around this fixed point, which the iterates help us descend.
The number of spurious fixed points (which do not give the correct factorization) is relatively small.
All of these properties we will attempt to establish in Section 4. We believe that this makes the model more capable of exploring the search space (without getting stuck in spurious fixed points) and that it does so by pursuing a relatively intuitive strategy of making ‘noisy’ Alternating Least Squares updates. For this reason, it may be appropriate in the future to compare a Resonator Circuit to algorithms from the literature on stochastic optimization.
3.2 Resonator Circuit with outer product weights
The matrix is , which can be enormous. Implementing a Resonator Circuit on a conventional computer may require that we not store this matrix directly, but rather only store each codebook matrix (which is of size ) and instead compute on the fly. In this case, the computational expense of inverting the matrix may be a concern (this matrix is called the Gram matrix of and it contains the inner products between each of the codevectors). As an aside, if is quite large, it may be sensible to use a subroutine that computes based on the singular value decomposition of . This has the added benefit of being numerically more stable than computing directly. If the codevectors are orthonormal, the Gram matrix is the identity . When is large (roughly speaking ), and the codevectors are chosen randomly i.i.d. from , then they will be very nearly orthogonal and is a very close approximation to . Therefore it may make sense to set the weight matrix according to , instead of . Most readers will be familiar with the weight matrix as the so-called “outer product” learning rule of classical Hopfield Networks (Hopfield, 1982). This has the nice interpretation of Hebbian learning (Hebb, 1949) in which the strength of synapses between any two neurons (represented by this weight matrix) depends solely on their pairwise statistics over some dataset, in this case the codevectors. This yields the dynamics of a Resonator Circuit with outer product weights:
Whenever the codevectors deviate strongly from the approximate-orthogonality assumption, the behavior of the Ordinary Least Squares and outer product variants of Resonator Circuits may strongly differ. One immediate consequence is that the solution is no longer guaranteed to be a fixed point – it is only likely to be one when each is small enough. Other than that, it is hard to say which choice is appropriate for a particular set of codevectors. It may be the case that using weights given by the outer product will produce better trajectories for a particular problem. It is, however, almost always the case that using outer product weights will lead to much faster simulation and this is a clear choice when the codevectors are nearly orthogonal. We focus on this case in most of our results due to its relative analytical tractability and use in standard analysis of both Hopfield Networks and Vector Symbolic Algebras.
Another important note on practical implementation: the matrices consist of entries that are just and , so the only computations necessary to simulate this variant of Resonator Circuits are
Binary comparison and binary accumulation (for computing )
Integer - binary multiplication and integer sum (for computing the vector )
Integer sign (for computing )
Computing hardware with very large binary and integer arithmetic circuits can simulate this model very quickly. As we will discuss in Section 4.6, the factorization can still be done even with significant corruption to the vector . For these reasons, this model (and more generally, distributed cognitive architectures) are a good fit for emerging device nanotechnologies, a topic discussed in more detail in Rahimi et al. (2017).
We present a characterization of Resonator Circuits along three main directions. The first direction is the stability of the solutions , which we relate to the stability of classical Hopfield networks. The second is a fundamental measure of factorization capability we call the “operational capacity”. The third is the speed with which factorizations are found. We argue that the marked difference in factorization performance between Resonator Circuits and the benchmark algorithms lies in the relative scarcity of spurious fixed points enjoyed by Resonator Circuit dynamics.
In each of the simulations in this section we choose codevectors randomly i.i.d. from the discrete uniform distribution over the vertices of the hypercube – each element of each codevector is a Rademacher random variable (assuming the value with probability and with probability ). We generate by choosing one vector at random from each of the codebooks and then computing their Hadamard product. The reason we choose vectors randomly is because it makes the analysis of performance somewhat easier and more standardized, and it is the setting in which most of the well-known results on Hopfield Network capacity apply – we will make a few connections to these results. It is also often the setting in which we use Vector Symbolic Architecture (Gayler, 2004) and therefore these results on random vectors are immediately applicable to a variety of existing works using this model.
4.1 Stable-solution capacity and percolated noise
Suppose for all (we initialize it to the correct factorization; this will also apply to any at which the algorithm comes upon on its own). What is the probability that the state stays there – i.e. that the correct factorization is a fixed point of the dynamics? This is the basis of what researchers have called the “capacity” of Hopfield Networks, where are patterns that the network has been trained to store. We choose to call it the “stable-solution capacity” in order to distinguish it from operational capacity, which we define in Section 4.2. In general one may not care whether the state is completely stable – it may be tolerable that the dynamics flip some small fraction of the bits of as long as it does not move the state too far away from . Amit, Gutfreund, and Sompolinsky (Amit et al., 1985, 1987) established that in Hopfield Networks, an avalanche phenomenon occurs where bitflips accumulate and the network becomes essentially useless for values of , at which point the approximate bitflip probability is . While we don’t attempt any of this complicated analysis on Resonator Circuits, we do derive an expression for the bitflip probability of any particular factor that accounts for bitflips which “percolate” from factor to factor through the vector .
We first note that this analysis is necessary only for Resonator Circuits with outer product weights – Ordinary Least Squares weights guarantee that the solutions are stable. If for all , then factor “sees” an input at time . The vector is exactly by the definition of orthogonal projection. This is true as well for all subsequent factors, meaning that is a fixed point under these updates, which are the dynamics of a Resonator Circuit with Ordinary Least Squares weights. For a Resonator Circuit with outer product weights, we must examine the vector and the situation is significantly more complicated.
At issue is the probability that has a sign different from , i.e. that there is a bitflip in any particular component of the updated state. We start by noting that for factor , this bitflip probability is the same as a Hopfield network. Readers familiar with the literature on Hopfield Networks will know that with and reasonably large (approximately and ) can be well-approximated by a Gaussian with mean and variance ; see appendix G for a simple derivation. This is summarized as the Hopfield bitflip probability :
Where is the cumulative density function of the Normal distribution. Hopfield Networks are often specified with the diagonal of set to all zeros (having “no self-connections”), in which case the bitflip probability is . For large and this is often simplified to , which may be the expression most familiar to readers. Keeping the diagonal of makes the codevectors more stable (see appendix G) and while there are some arguments in favor of eliminating it, we have found Resonator Circuits to exhibit better performance by keeping these terms.
In Appendix G we derive the bitflip probability for an arbitrary factor in a Resonator Circuit with outer product weights. This probability depends on whether a component of the state has already been flipped by the previous factors, which is what we call percolated noise passed between the factors, and which increases the bitflip probability. The four relevant probabilities are:
Equation (20) is the probability of a bitflip compared to the correct value, the Resonator bitflip probability. Equation (21) gives the probability that the next factor will see a net bitflip, a bitflip which has percolated through the previous factors. Equations (22) and (23) give the probability of a bitflip conditioned on whether or not this factor sees a net bitflip, and they are different. It should be obvious that
and also that
We show via straightforward algebra in Appendix G that the conditional probabilities and can be written recursively in terms of :
The fact that the bitflip probability has to be split between these two conditions and that there is a dependence on is what makes this different from the bitflip probability for a Hopfield Network (compare equations (26) and (27) against equation (4.1)). But how much different? How much of an increased bitflip probability does any given factor suffer due to percolated noise from the other factors, compared to how much it would have had under its own dynamics as a Hopfield Network? Suppose there are Hopfield Networks all evolving under their own dynamics (they are not connected together). The probability is the bitflip probability for the th Hopfield Network, whereas the probability is the bitflip probability for the th factor of a Resonating Circuit. We are interested in the quantity . Let us first note that for a Hopfield network with self connections the maximum bitflip probability is , which occurs at . The ratio is what determines the bitflip probability. Please see Appendix G for an explanation. We plot for Hopfield Networks with in Figure 0(a). One can see that the curves all lie on top of one another – each factor has the same bitflip probability. Contrast this against Figure 0(b) where clearly the higher-numbered factors in a Resonator Circuit receive some percolated noise from lower-numbered factors and as a result have an increased bitflip probability. We show the difference between these these two sets of curves in Figure 2.
To see if there is some limiting behavior, we simulated , , and factors; the differences are shown in Figures 3 and 4. In Figure 4 there is clearly a phase change in residual bitflip probability which occurs at . In the Hopfield Network literature this is a very important number. It gives the point at which the codevectors transition away from being global minimizers of the Hopfield Network energy function. When falls in between and , the codevectors are only local minimizers, and there exist spin-glass states that have lower energy. What this result shows is that for , the stability of a Resonator Circuit with outer product weights is the same as the stability of a Hopfield Network. For , percolated noise between the factors causes the Resonator circuit to be strictly less stable than a Hopfield Network. We reiterate that this applies to outer product weights. When Resonator Circuit weights are set by Ordinary Least Squares, the solutions are always stable.
4.2 Operational Capacity
We now define a new, more useful notion of capacity. This performance measure, called the operational capacity, gives an expression for the maximum size of factorization problem that can be solved with high probability. This maximum problem size, which we denote by , varies as a function of the number of elements in each vector and the number of factors . It gives a very practical characterization of performance, and will form the basis of our comparison between Resonator Circuits and the benchmark algorithms we introduced in Section 2.3. When the problem size is below the operational capacity of the algorithm, one can be quite sure that the correct factorization will be found efficiently.
Each algorithm we have introduced attempts to solve the factorization problem (1) by initializing the state of the estimates, , and letting the dynamics evolve until some termination criterion is met. It is possible that the final state of may not equal the correct factors at each and every component, but we can ‘decode’ each by looking for its nearest neighbor (with respect to Hamming distance or cosine similarity) among the vectors in its respective codebook . This distance computation involves only vectors, rather than , which was what we encountered in one of the brute-force strategies of Section 2. Compared to the other computations involved in finding the correct factorization out of total possibilities, this last step of decoding has a very small cost. We define the total accuracy for any particular instantiation of the factorization problem to be the sum of accuracies for inferring each factor – if the nearest neighbor to is the correct factor, then this contributes to the total accuracy. For instance, correctly inferring one of three total factors gives a total accuracy of , two of three is , and three of three is .
Deriving the expected total accuracy appears to be quite challenging, especially for a Resonator Circuit, because it requires that we essentially predict how the nonlinear dynamics will evolve over time. There may be a region around each such that states in this region rapidly converge to , the so-called basin of attraction, but our initial estimate is likely not in the basin of attraction, and it is hard to predict when, if ever, the dynamics will enter this region. Even for Hopfield Networks, which obey much simpler dynamics than a Resonator Circuit, it is known that so-called “frozen noise” is built up in the network state, making the shapes of the basins highly anisotropic and difficult to analyze (Amari and Maginu, 1988). Essentially all of the analytical results on Hopfield Networks consider only the stability of the state as a (very poor) proxy for how the model behaves when it is initialized to other states. This less useful notion of capacity, the stable-solution capacity, was what we examined in the previous section.
We can, however, estimate the total accuracy by simulating many different factorization problems. This is just the fraction of factors that were correctly inferred over many, many trials. We remind the reader that our results in this paper pertain to factorization of randomly-drawn vectors which bear no particular correlational structure, but that notions of total accuracy and operational capacity would be relevant, and specific, to factorization of non-random vectors.
We first note that for fixed vector dimensionality , the empirical mean of the total accuracy depends strongly on , the search space size. We can see this clearly in Figure 5. We show this phenomenon for a Resonator Circuit with outer product weights, but this general behavior is true for all of the algorithms under consideration – one can always make the search space large enough that expected total accuracy goes to zero.
Our notion of operational capacity is concerned with the that causes expected total accuracy to drop below some value . We see here that there are a range of values for which the expected total accuracy is , beyond which this ceases to be the case. For all values of within this range, the algorithm essentially always solves the factorization problem. We now define the most important metric in this paper, the operational capacity of algorithms for solving (1):
The operational capacity of a factorization algorithm that solves (1) is the largest search space size such that the algorithm, when limited to a maximum number of iterations , gives a total accuracy .
In this paper we estimate operational capacity when ( of factors were inferred correctly) and (the model can search over at most of the entire search space). These choices are largely practical: accuracy makes the model very reliable in practice, and this operating point can be estimated from a reasonable number ( to ) of random trials. Setting allows the number of iterations to scale with the size of the problem, but restricts the algorithm to only consider a small fraction of the possible factorizations. While a Resonator Circuit has no guarantee of convergence, it almost always converges in far less than iterations, so long as we stay in this high-accuracy regime.
Operational capacity is in general a function of and , as well as how ‘balanced’ the codebooks are (the degree to which they each contain the same number of codevectors). We will deal with this dependence explicitly when we estimate a scaling law for the operational capacity of Resonator Circuits, a main result of this paper. First, however, we compare the different algorithms under consideration on the basis of operational capacity.
4.2.1 Resonator Circuits have superior operational capacity
We estimated the operational capacity of Alternating Least Squares, Iterative Soft Thresholding, Fast Iterative Soft Thresholding, Projected Gradient Descent, Multiplicative Weights, and Map Seeking Circuits, in addition to the two variants of our algorithm. What is shown in Figure 6 is the operational capacity estimated on several thousand random trials, where we display as a function of for both three-factor problems and four-factor problems. One can see that the operational capacity of Resonator Circuits is between two and three orders of magnitude greater than the operational capacity of the other algorithms. Each of the benchmark algorithms has a slightly different operational capacity (due to the fact that they each have different dynamics and will, in general, find different solutions) but they are all similarly poor compared to the two variants of Resonator Circuit.
As increases to and beyond, the performance difference between the two variants of the Resonator Circuit starts to disappear, ostensibly due to the fact that . The two variants are different in general (and we have found that when the codevectors have significant similarities the Ordinary Least Squares variant performs better), but the simulations in this paper do not particularly highlight the difference between the two.
Except for Alternating Least Squares, each of the benchmark algorithms has at least one hyperperparameter that must be chosen – we simulated many thousand random trials with a variety of hyperparameter settings for each algorithm and chose the hyperparameter values that performed best on average. We list these values for each of the algorithms in the Appendix. Each of the benchmark algorithms converge on their own and the tunable stepsizes make a comparison of the number of iterations non-standardized, so we did not impose a maximum number of iterations on these algorithms – the points shown represent the best the benchmark algorithms can do, even when not restricted to a maximum number of iterations. In fact, we experimented with many algorithms beyond those shown here in an attempt to find a competitive alternative to Resonator Circuits, but were unable to do so.
4.2.2 A scaling law for operational capacity of Resonator Circuits
The operational capacity is a very practical measure of performance, and we set out to estimate how it scales with parameters of the factorization problem. We are concerned with scaling in the limit of large (where cross-talk noise between codevectors is well-approximated as Gaussian, see Section 4.1). The capacity scaling we find does not apply to small values of .
We discovered a powerful and relatively simple scaling law for operational capacity which is illustrated by figure 7. The points are operational capacities estimated over many thousands of random trials and the lines give the best linear fits to a subset of these points – in the limit of large we find that increases as a linear function of . The slope of this relationship is heavily dependent on the number of factors , which we find follows a inverse power-law in (see figure 8). The linear dependence of on implies that scales according to
in the limit of large . The values and are parameters of this linear scaling and depend in general on , which one can see clearly in figure 7. From Figure 8 we estimate the expression for to be . Substituting this into the expression for , we have the main scaling law for operational capacity as a function of and :
Capacity is highest when the codebooks each have the same number of codevectors (), and this was the case for the operational capacity results we have shown so far. We chose this in order to have a simple standard for comparison among the different algorithms, but in general it is possible that the codebooks are unbalanced, so that we have the same but . In this case, capacity is lower than when the codebooks are balanced. We found that the most meaningful way to measure the degree of balance between codebooks was by the ratio of the smallest codebook to the largest codebook:
For we found that the effect on was simply an additive factor which can be absorbed into a (slightly smaller) parameter . It does not appear to affect the value of and therfore the scaling for large is still . For extreme values of , where there is one codebook that is for instance or times larger than another, then both and are reduced, sometimes significantly. It is important to note, however, that the linear dependence of on , and thus the power-law dependence of on , remains.
In summary, equation (28) is a scaling law for the operational capacity that holds for large , values of , and for a moderate degree of imbalance between the codebooks. The operational capacity of Resonator Circuits is expansive and has a strong power-law dependence on the dimensionality of each vector , allowing them to efficiently solve very large vector factorization problems.
4.3 An explanation for differences in operational capacity
We believe that the sizable advantage in operational capacity enjoyed by Resonator Circuits can be partially attributed to the fact that their dynamics have relatively few spurious fixed points (those yielding incorrect factorizations) compared to each of the benchmark algorithms. Among the benchmark algorithms, we focus on Projected Gradient Descent (applied to the negative inner product with the simplex constraint) to illustrate this point. We will show that the correct factorization is always stable under Projected Gradient Descent (as it is with the OLS variant of Resonator Circuits), but that incorrect factorizations are much more likely to be fixed points under Projected Gradient Descent.
4.3.1 Stability of the correct factorization
Coefficients are a fixed point of Projected Gradient Descent dynamics when the gradient at this point is exactly or when it is in the nullspace of the projection operator. We write
to denote this set of points. The nullspace of the projection operator is relatively small on the faces and edges of the simplex, but it becomes somewhat large at the vertices. The nullspace of the projection operator at a vertex of the simplex () is an intersection of halfspaces (each halfspace given by an edge of the simplex). We can compactly write the nullspace at a vertex with the following expression:
An equivalent way to express the nullspace is
In other words, for a vector to be in the nullspace at , the th element of the vector must be the largest by a margin of or more. This condition is met for the vector at the correct factorization, since . This vector has a value for the component corresponding to and values that are for all the other components. Thus, the correct factorization (the solution to (1) and global minimizer of (5)) is always a fixed point under the dynamics of Projected Gradient Descent (PGD).
4.3.2 Stability of incorrect factorizations
Suppose we initialize the algorithm to a factorization which does not solve (1) – we have chosen a set of vectors from the codebooks, but they don’t produce . Based on the assumption that each codevector was chosen randomly, the vector will be a completely random bipolar vector. So long as is not too close to , will be nearly orthogonal to every vector in and its projection onto will be small, with each component equally likely to be positive or negative. Therefore, under the dynamics of a Resonator Circuit with OLS weights, each component will flip its sign compared to with probability and therefore will remain unchanged with probability . This is obviously quite small for of any reasonable size – suboptimal factorizations are very unlikely to be a fixed points. The same is true for a Resonator Circuit with OP weights because has elements that are approximately Gaussian with mean zero – again each component is equally likely to flip or not flip compared to .
Contrast this against Projected Gradient Descent. We recall from (32) that the requirement for to be a fixed point is that the th component of the gradient at this point be largest by a margin of or more. This is a much looser condition than we had for Resonator Circuits – such a scenario will actually occur with probability for any of these arbitrary spurious factorizations. We can see from this argument that with Resonator Circuits, suboptimal factorizations will be fixed points with probability while with Projected Gradient Descent this will occur with probability . This does not directly address the case when is partially correlated with one of the codevectors, but it is fair to say that for a Resonator Circuit, such a state will be a fixed point with a probability in the interval while for Projected Gradient Descent the interval is . Therefore, Projected Gradient Descent is much more likely to exhibit spurious fixed points. While the details differ slightly, our experience indicates that this property is shared by each one of the benchmark algorithms.
4.3.3 Basins of attraction for benchmark algorithms
It may be that while there are sizable basins of attraction around the correct factorization, moving through the interior of the hypercube causes state trajectories to fall into the basin corresponding to a spurious fixed point. In a normal setting for several of the optimization-based approaches, we initialize to be at the center of the simplex, indicating that each of the factorizations is equally likely. Suppose we were to initialize so that it is just slightly nudged toward one of the simplex vertices. We might nudge it toward the correct vertex (the one given by ) or we might nudge it toward any of the other vertices, away from . We can parameterize this with a single scalar and chosen uniformly among the possible vertices:
We ran a simulation with and , at which Projected Gradient Descent and Multiplicative Weights both have a total accuracy around . We created random factorization problems, initializing the state slightly towards the correct factorization and then initializing it slightly towards a randomly chosen spurious factorization. The dynamics were allowed to run until convergence and then the implied factorization was decoded using the nearest-neighbor among the codevectors, as we introduced in Section 4.2.
What Figure 9 shows is that by moving just a small distance toward the correct vertex, we very quickly fall into its basin of attraction. However, moving toward any of the other vertices is actually somewhat likely to take us into a spurious basin of attraction (where the converged state is decoded into an incorrect factorization). The space is full of these bad directions. It would be very lucky indeed to start from the center of the simplex and move immediately toward the solution – it is far more likely that initial updates take us somewhere else in the space, toward one of the other vertices, and this plot shows that these trajectories often get pulled towards a spurious fixed point. What we are demonstrating here is that empirically, the interior of the hypercube is somewhat treacherous from an optimization perspective, and this might lie at the heart of why the benchmark algorithms fail.
4.4 Search efficiency
If a Resonator Circuit is not consistently descending an energy function, is it just aimlessly wandering around the space, trying every possible factorization until it finds the correct one? Figure 10 shows that it is not. We plot the mean