Convolutional Bipartite Attractor Networks
In human perception and cognition, a fundamental operation that brains perform is interpretation: constructing coherent neural states from noisy, incomplete, and intrinsically ambiguous evidence. The problem of interpretation is well matched to an early and often overlooked architecture, the attractor network—a recurrent neural net that performs constraint satisfaction, imputation of missing features, and clean up of noisy data via energy minimization dynamics. We revisit attractor nets in light of modern deep learning methods and propose a convolutional bipartite architecture with a novel training loss, activation function, and connectivity constraints. We tackle larger problems than have been previously explored with attractor nets and demonstrate their potential for image completion and super-resolution. We argue that this architecture is better motivated than ever-deeper feedforward models and is a viable alternative to more costly sampling-based generative methods on a range of supervised and unsupervised tasks.
Under ordinary conditions, human visual perception is quick and accurate. Studying circumstances that give rise to slow or inaccurate perception can help reveal the underlying mechanisms of visual information processing. Recent investigations of occluded (tang2018recurrent) and empirically challenging (Kar2019) scenes have led to the conclusion that recurrent brain circuits can play a critical role in object recognition. Further, recurrence can improve the classification performance of deep nets (tang2018recurrent; Nayebi2018), specifically for the same images with which humans and animals have the most difficulty (Kar2019).
Recurrent dynamics allow the brain to perform pattern completion, constructing a coherent neural state from noisy, incomplete, and intrinsically ambiguous evidence. This interpretive process is well matched to attractor networks (ANs) (Hopfield1982; Hopfield1984; KrotovHopfield2016; Zemel2001), a class of dynamical neural networks that converge to fixed-point attractor states (Figure 1a). Given evidence in the form of a static input, an AN settles to an asymptotic state—an interpretation or completion—that is as consistent as possible with the evidence and with implicit knowledge embodied in the network connectivity. We show examples from our model in Figure 1b.
ANs have played a pivotal role in characterizing computation in the brain (amit1992; McClelland1981), not only perception (e.g., Sterzer2007), but also language (Stowe2018) and awareness (Mozer2009). We revisit attractor nets in light of modern deep learning methods and propose a convolutional bipartite architecture for pattern completion tasks with a novel training loss, activation function, and connectivity constraints.
2 Background and Related Research
Although ANs have been mostly neglected in the recent literature, attractor-like dynamics can be seen in many models. For example, clustering and denoising autoencoders are used to clean up internal states and improve the robustness of deep models (Liao2016; tang2018recurrent; Lamb2019). In a range of image-processing domains, e.g., denoising, inpainting, and super-resolution, performance gains are realized by constructing deeper and deeper architectures (e.g., Lai2018). State-of-the-art results are often obtained using deep recursive architectures that replicate layers and weights (KimLeeLee2016; Tai2017), effectively implementing an unfolded-in-time recurrent net. This approach is sensible because image processing tasks are fundamentally constraint satisfaction problems: the value of any pixel depends on the values of its neighborhood, and iterative processing is required to converge on mutually consistent activation patterns. Because ANs are specifically designed to address constraint-satisfaction problems, our goal is to re-examine them from a modern deep-learning perspective.
Interest in ANs seems to be narrow for two reasons. First, in both early (Hopfield1982; Hopfield1984) and recent (Li2015; Wu2018a; Wu2018b; Chaudhuri2017) work, ANs are characterized as content-addressable memories: activation vectors are stored and can later be retrieved with only partial information. However, memory retrieval does not well characterize the model’s capabilities: like its probabilistic sibling the Boltzmann machine (Hinton2007; Welling2005), the AN is a general computational architecture for supervised and unsupervised learning. Second, ANs have been limited by training procedures. In Hopfield’s work, ANs are trained with a simple procedure—an outer product (Hebbian) rule—which cannot accommodate hidden units and the representational capacity they provide. Recent explorations have considered stronger training procedures (e.g., Wu2018b; Liao2018); however, as for all recurrent nets, training is complicated by the issue of vanishing/exploding gradients. To facilitate training and increase the computational power of ANs, we propose a set of extensions to the architecture and training procedures.
ANs are related to several popular architectures. Autoencoding models such as the VAE (Kingma2013) and denoising autoencoders (Vincent2008) can be viewed as approximating one step of attractor dynamics, directing the input toward the training data manifold (Alain2012). These models can be applied recursively, though convergence is not guaranteed, nor is improvement in output quality over iterations. Flow-based generative models (FBGMs) (e.g., Dinh2016) are invertible density-estimation models that can map between observations and latent states. Whereas FBGMs require invertibility of mappings, ANs require only a weaker constraint that weights in one direction are the transpose of the weights in the other direction.
Energy-based models (EBMs) are also density-estimation models that learn a mapping from input data to energies and are trained to assign low energy values to the data manifold (LeCun2006; Han2018; Xie2016; Du2019). Whereas AN dynamics are determined by an implicit energy function, the EBM dynamics are driven by optimizing or sampling from an explicit energy function. In the AN, lowering the energy for some states raises it for others, whereas the explicit EBM energy function requires well-chosen negative samples to ensure it discriminates likely from unlikely states. Although the EBM and FBGM seem well suited for synthesis and generation tasks, due to their probabilistic underpinnings, we show that ANs can be used for conditional generation (maximum likelihood completion) tasks.
3 Convolutional Bipartite Attractor Nets
Various types of recurrent nets have been shown to converge to activation fixed points, including fully interconnected networks of asynchronous binary units (Hopfield1982) and networks of continuous-valued units operating in continuous time (Hopfield1984). Most relevant to modern deep learning, Koiran1994 identified convergence conditions for synchronous update of continuous-valued units in discrete time: given a network with state , parallel updates of the full state with the standard activation rule,
will asymptote at either a fixed point or a limit cycle of 2. Sufficient conditions for this result are: initial , , , and piecewise continuous and strictly increasing with . The proof is cast in terms of an energy function,
With , we have the barrier function:
To ensure a fixed point (no limit cycle > 1), asynchronous updates are sufficient because the solution of is the standard update for unit (Equation 1). Because the energy function additively factorizes for units that have no direct connections, parallel updates of these units still ensure non-increasing energy, and hence attainment of a fixed point.
We adopt the bipartite architecture of a stacked restricted Boltzmann machine (Hinton2006), with bidirectional symmetric connections between adjacent layers of units and no connectivity within a layer (Figure 1c). We distinguish between visible layers, which contain inputs and/or outputs of the net, and hidden layers. The bipartite architecture allows for units within a layer to be updated in parallel while guaranteeing strictly non-increasing energy and attainment of a local energy minimum. We thus perform layerwise updating of units, defining one iteration as a sweep from one end of the architecture to the other and back. The 8-step update sequence for the architecture in Figure 1c is shown above the network.
3.1 Convolutional Weight Constraints
Weight constraints required for convergence can be achieved within a convolutional architecture as well (Figure 1d). In a feedforward convolutional architecture, the connectivity from layer to is represented by weights , where and are channel indices in the destination () and source () layers, respectively, and and specify the relative coordinate within the kernel, such that the weight modulates the input to the unit in layer , channel , absolute position —denoted —from the unit . If denotes the reverse weights to channel in layer from channel in layer , symmetry requires that
This follows from the fact that the weights are translation invariant: the reverse mapping from to has the same weight as from to , embodied in Equation 4. Implementation of the weight constraint is simple: is unconstrained, and is obtained by transposing the first two tensor dimensions of and flipping the indices of the last two. The convolutional bipartite architecture has energy function:
where is the activation in layer , are the channel biases, and is the barrier function (Equation 3), ‘’ is the convolution operator, and ‘’ is the element-wise sum of the Hadamard product of tensors. The factor of ordinarily found in energy functions is not present in the first term because, in contrast to Equation 2, each second-order term in appears only once. For a similar formulation in stacked restricted Boltzmann machines, see Lee2009.
3.2 Loss Functions
Evidence provided to the CBAN consists of activation constraints on a subset of the visible units. The CBAN is trained to fill-in or complete the activation pattern over the visible state. The manner in which evidence constrains activations depends on the nature of the evidence. In a scenario where all features are present but potentially noisy, one should treat them as soft constraints that can be overridden by the model; in a scenario where the evidence features are reliable but other features are entirely missing, one should treat the evidence as hard constraints.
We have focused on this latter scenario in our simulations, although we discuss the use of soft constraints in Appendix A. For a hard constraint, we clamp the visible units to the value of the evidence, meaning that activation is set to the observed value and not allowed to change. Energy is minimized conditioned on the clamped values. One extension to clamping is to replicate all visible units and designate one set as input, clamped to the evidence, and one set as output, which serves as the network read out. We considered using the evidence to initialize the visible state, but initialization is inadequate to anchor the visible state and it wanders. We also considered using the evidence as a fixed bias on the input to the visible state, but redundancy of the bias and top-down signals from the hidden layer can prevent the CBAN from achieving the desired activations.
An obvious loss function is squared error, , where is an index over visible units, is the visible state, and is the target visible state. However, this loss misses out on a key source of error. The clamped units have zero error under this loss. Consequently, we replace with , the value that unit would take were it unclamped, i.e., free to take on a value consistent with the hidden units driving it:
An alternative loss, related to the contrastive loss of the Boltzmann machine (see Appendix B), explicitly aims to ensure that the energy of the current state is higher than that of the target state. With being the complete state with all visible units clamped at their target values and the hidden units in some configuration , and being the complete state with the visible units unclamped, one can define the loss
We apply this loss by allowing the net to iterate for some number of steps given a partially clamped input, yielding a hidden state that is a plausible candidate to generate the target visible state. Note that is constant and although it does not factor into the gradient computation, it helps interpret : when , . This loss is curious in that it is a function not just of the visible state, but, through the term , it directly depends on the hidden state in the adjacent layer and the weights between these layers. A variant on is based on the observation that the goal of training is only to make the two energies equal, suggesting a soft hinge loss:
Both energy-based losses have an interpretation under the Boltzmann distribution: is related to the conditional likelihood ratio of the clamped to unclamped visible state, and is related to the conditional probability of the clamped versus unclamped visible state:
3.3 Preventing vanishing/exploding gradients
Although gradient descent is a more powerful method to train the CBAN than Hopfield’s Hebb rule or the Boltzmann machine’s contrastive loss, vanishing and exploding gradients are a concern as with any recurrent net (Hochreiter2001), particularly in the CBAN which may take 50 steps to fully relax. We address the gradient issue in two ways: through intermediate training signals and through a soft sigmoid activation function.
The aim of the CBAN is to produce a stable interpretation asymptotically. The appropriate way to achieve this is to apply the loss once activation converges. However, the loss can be applied prior to convergence as well, essentially training the net to achieve convergence as quickly as possible, while also introducing loss gradients deep inside the unrolled net. Assume a stability criterion that determines the iteration at which the net has effectively converged:
Training can be logically separated into pre- and post-convergence phases, which we will refer to as transient and stationary. In the stationary phase, the Almeida/Pineda algorithm (Pineda1987; Almeida1987) leverages the fact that activation is constant over iterations, permitting a computationally efficient gradient calculation with low memory requirements. In the transient phase, the loss can be injected at each step, which is exactly the temporal-difference method TD(1) (Sutton1988). Casting training as temporal-difference learning, one might consider other values of in TD(); for example, TD(0) trains the model to predict the visible state at the next time step, encouraging the model to reach the target state as quickly as feasible while not penalizing it for being unable to get to the target immediately.
Any of the losses, , , and , can be applied with a weighted mixture of training in the stationary and transient phases. Although we do not report systematic experiments in this article, we consistently find that transient training with is as efficient and effective as weighted mixtures including stationary-phase-only training, and that outperforms any . In our results, we thus conduct simulations with transient-phase training and .
We propose a second method of avoiding vanishing gradients specifically due to sigmoidal activation functions: a leaky sigmoid, analogous to a leaky ReLU, which allows gradients to propagate through the net more freely. The leaky sigmoid has activation and barrier functions
Parameter specifies the slope of the piecewise linear function outside the interval. As , loss gradients become flat and the CBAN fails to train well. As , activation magnitudes can blow up and the CBAN fails to reach a fixed point. In Appendix C, we show that convergence to a fixed point is guaranteed when , where . In practice, we have found that restricting is unnecessary and works well.
We report on a series of simulation studies of increasing complexity. First, we explore fully connected bipartite attractor net (FBAN) on a bar imputation task and then supervised MNIST image completion and classification. Second, we apply CBAN to unsupervised image completion tasks on Omniglot and CIFAR-10 and compare CBAN to CBAN-variants and denoising-VAEs. Lastly, we revision CBAN for the task of super-resolution and report promising results against competing models, such as DRCN and LapSRN. Details of architectures, parameters, and training are in Appendix D.
4.1 Bar Task
We studied a simple inference task on partial images that have exactly one correct interpretation. Images are binary pixel arrays consisting of two horizontal bars or two vertical bars. Twenty distinct images exist, shown in the top row of Figure 2. A subset of pixels is provided as evidence; examples are shown in the bottom row of Figure 2. The task is to fill in the masked pixels. Evidence is generated such that only one consistent completion exists. In some cases, a bar must be inferred without any white pixels as evidence (e.g., second column from the right). In other cases, the local evidence is consistent with both vertical and horizontal bars (e.g., first column from left).
An FBAN with one layer of 50 hidden units is sufficient for the task. Evidence is generated randomly on each trial, and evaluating on 10k random states after training, the model is 99.995% correct. The middle row in Figure 2 shows the FBAN response after one iteration. The net comes close to performing the task in a single shot, but after a second iteration of clean up and the asymptotic state is shown in the top row.
Figure 3 shows some visible-hidden weights learned by the FBAN. Each array depicts weights to/from one hidden unit. Weight sign and magnitude are indicated by coloring and area of the square, respectively. Units appear to select one row and one column, either with the same or opposite polarity. Same-polarity weights within a row or column induce coherence among pixels. Opposite-polarity weights between a row and a column allow the pixel at the intersection to activate either the row/column depending on the sign of the unit’s activation.
4.2 Supervised MNIST
We trained an FBAN with two hidden layers on a supervised version of MNIST in which the visible state consists of a array for an MNIST digit and an additional vector to code the class label. For the sake of graphical convenience, we allocate 28 units to the label, using the first 20 by redundantly coding the class label in pairs of units, and ignoring the final 8 units. Our architecture had 812 inputs, 200 units in the first hidden layer, and 50 units in the second. During training, all bits of the label were masked as well as one-third of image pixels. The image was masked with thresholded Perlin coherent noise (Perlin1985), which produces missing patches that are far more difficult to fill in than the isolated pixels produced by Bernoulli masking.
Figure 4 shows evidence provided to FBAN for 20 random test set items in the third row. The red masks indicate unobserved pixels; the other pixels are clamped in the visible state. The unobserved pixels include those representing the class label, coded in the bottom row of the pixel array. The top row of the Figure shows the target visible representation, with class labels indicated by the isolated white pixels. Even though the training loss treats all pixels as equivalent, the FBAN does learn to classify unlabeled images. On the test set, the model achieves a classification accuracy of 87.5% on Perlin-masked test images and 89.9% on noise-free test images. Note that the 20 pixels indicating class membership are no different than any other missing pixels in the input. The model learns to classify by virtue of the systematic relationship between images and labels. We can train the model with fully observed images and fully unobserved labels, and its performance is like that of any fully-connected MNIST classifier, achieving an accuracy of 98.5%.
The FBAN does an excellent job of filling in missing features in Figure 4 and in further examples in Appendix E. The FBAN’s interpretations of the input seem to be respectable in comparison to other recent recurrent associative memory models (Figures 8a,b). We mean no disrespect of other research efforts—which have very different foci than ours—but merely wish to indicate we are obtaining state-of-the-art results for associative memory models. Figure 8c shows some weights between visible and hidden units. Note that the weights link image pixels with multiple digit labels. These weights stand apart from the usual hidden representations found in feedforward classification networks.
4.3 Unsupervised Omniglot
We trained a CBAN with the Omniglot images (omniglot). Omniglot consists of multiple instances of 1623 characters from 50 different alphabets. The CBAN has one visible layer containing the character image, , and three successive hidden layers with dimensions , , and , all with average pooling between the layers and filters of size . Other network parameters and training details are presented in Appendix D. To experiment with a different type of masking, we used random square patches of diameter 3–6, which remove on average roughly 30% of the white pixels in the image.
We compared our CBAN to variants with critical properties removed: one without weight symmetry (CBAN-asym) and one in which the TD(1) training procedure is substituted for a standard squared loss at the final step (CBAN-noTD). We also compare to a convolutional denoising VAE (CD-VAE), which takes the masked image as input and outputs the completion. The CBAN with symmetric weights reaches a fixed point, whereas CBAN-asym appears to attain limit cycles of 2-10 iterations. Qualitatively, CBAN produces the best image reconstructions (Figure 5). CBAN-asym and CBAN-noTD tend to hallucinate additional strokes; and CBAN-noTD and CD-VAE produce less crisp edges. Quantitatively, we assess models with two measures of reconstruction quality, peak signal-to-noise ratio (PSNR) and structural similarity (SSIM, ssim); larger is better on each measure. CBAN is strictly superior to the alternatives on both measures (Figure 5, right panel). CBAN completions are not merely memorized instances; the CBAN has learned structural regularities of the images, allowing it to fill in big gaps in images that—with the missing pixels—are typically uninterpretable by both classifiers and humans. Additional CBAN image completion examples for Omniglot can be found in Appendix E.
4.4 Unsupervised CIFAR-10
We trained a CBAN with one visible and three hidden layers on CIFAR-10 images. The visible layer is the size of the input image, . The successive hidden layers had dimensions , , and , all with filters of size and average pooling between the hidden layers. Further details of architecture and training can be found in Appendix D. Figure 6 shows qualitative and quantitative comparisons of alternative models. Here, CBAN-asym performs about the same as CBAN. However, CBAN-asym typically attains bi-phasic limit cycles, and CBAN-asym sometimes produces splotchy artifacts in background regions (e.g., third image from left). CBAN-noTD and the CD-VAE are clearly inferior to CBAN. Additional CBAN image completions can be found in Appendix E.
Deep learning models have proliferated in many domains of image processing, perhaps none more than image super-resolution, which is concerned with recovering a high-resolution image from a low-resolution image. Many specialized architectures have been developed, and although common test data sets exist, comparisons are not as simple as one would hope due to subtle differences in methodology. (For example, even the baseline method, bicubic interpolation, yields different results depending on the implementation.) We set out to explore the feasibility of using CBANs for super-resolution. Our architecture processes color image patches, and the visible state included both the low- and high-resolution images, with the low-resolution version clamped and the high-resolution version read out from the net. Details can be found in Appendix D.
Table 1 presents two measures of performance, SSIM and PSNR, for the CBAN and various published alternatives. CBAN beats the baseline, bicubic interpolation, on both measures, and performs well on SSIM against some leading contenders (even beating LapSRN and DRCN on Set14 and Urban100), but poorly on PSNR. It is common for PSNR and SSIM to be in opposition: SSIM rewards crisp edges, PSNR rewards averaging toward the mean. The border sharpening and contrast enhancement that produce good perceptual quality and a high SSIM score (see Figure 7) are due to the fact that CBAN comes to an interpretation of the images: it imposes edges and textures in order to make the features mutually consistent. We believe that CBAN warrants further investigation for super-resolution; regardless of whether it becomes the winner in this competitive field, one can argue that it is performing a different type of computation than feedforward models like LapSRN and DRCN.
|Bicubic (baseline)||32.21 / 0.921||29.21 / 0.911||28.67 / 0.810||25.63 / 0.827|
|DRCN (KimLeeLee2016)||37.63 / 0.959||32.94 / 0.913||31.85 / 0.894||30.76 / 0.913|
|LapSRN (Lai2018)||37.52 / 0.959||33.08 / 0.913||31.80 / 0.895||30.41 / 0.910|
|CBAN (ours)||34.18 / 0.947||30.79 / 0.953||30.12 / 0.872||27.49 / 0.915|
In comparison to recent published results on image completion with attractor networks, our CBAN produces far more impressive results (see Appendix, Figure 8, for a contrast). The computational cost and challenge of training CBANs is no greater than those of training deep feedforward nets. CBANs seem to produce crisp images, on par with those produced by generative (e.g., energy- and flow-based) models. CBANs have potential to be applied in many contexts involving data interpretation, with the virtue that the computational resources they bring to bear on a task is dynamic and dependent on the difficulty of interpreting a given input. Although this article has focused on convolutional networks that have attractor dynamics between levels of representation, we have recently recognized the value of architectures that are fundamentally feedforward with attractor dynamics within a level. Our current research explores this variant of the CBAN as a biologically plausible account of intralaminar lateral inhibition.
Appendix A Using evidence
The CBAN is probed with an observation—a constraint on the activation of a subset of visible units. For any visible unit, we must specify how an observation is used to constrain activation. The possibilities include:
The unit is clamped, meaning that the unit activation is set to the observed value and is not allowed to change. Convergence is still guaranteed, and the energy is minimized conditional on the clamped value. However, clamping a unit has the disadvantage that any error signal back propagated to the unit will be lost (because changing the unit’s input does not change its output).
The unit is initialized to the observed value, instead of 0. This scheme has the disadvantage that activation dynamics can cause the network to wander away from the observed state. This problem occurs in practice and the consequences are so severe it is not a viable approach.
In principle, we might try an activation rule which sets the visible unit’s activation to be a convex combination of the observed value and the value that would be obtained via activation dynamics: . With this is simply the clamping scheme; with and appropriate start state, this is just the initialization scheme.
The unit has an external bias proportional to the observation. In this scenario, the net input to a visible unit is:
where . The initial activation can be either 0 or the observation. One concern with this scheme is that the ideal input to a unit will depend on whether or not the unit has this additional bias. For this reason the magnitude of the bias should probably be small. However, in order to have an impact, the bias must be larger.
We might replicate all visible units and designate one set for input (clamped) and one set for output (unclamped). The input is clamped to the observation (which may be zero). The output is allowed to settle. The hidden layer(s) would synchronize the inputs and outputs, but it could handle noisy inputs, which isn’t possible with clamping. Essentially, the input would serve as a bias, but on the hidden units, not on the inputs directly.
In practice, we have found that external biases work but are not as effective as clamping. Partial clamping with has partial effectiveness relative to clamping. And initialization is not effective; the state wanders from the initialized values. However, the replicated-visible scheme seems very promising and should be explored further.
Appendix B Loss functions
The training procedure for a Boltzmann machine aims to maximize the likelihood of the training data, which consist of a set of observations over the visible units. The complete states in a Boltzmann machine occur with probabilities specified by
where is a computational temperature and the likelihood of a visible state is obtained by marginalizing over the hidden states. Raising the likelihood of a visible state is achieved by lowering its energy.
The Boltzmann machine learning algorithm has a contrastive loss: it tries to maximize the energy of states with the visible units clamped to training observations and minimize the energy of states with the visible units unclamped and free to take on whatever values they want. This contrastive loss is an example of an energy-based loss, which expresses the training objective in terms of the network energies.
In our model, we will define an energy-based loss via matched pairs of states: is a state with the visible units clamped to observed values, and is a state in which the visible units are unclamped, i.e., they are free take on values consistent with the hidden units driving them. Although could be any unclamped state, it will be most useful for training if it is related to (i.e., it is a good point of contrast). To achieve this relationship, we propose to compute pairs by:
Clamp some portion of the visible units with a training example.
Run the net to some iteration, at which point the full hidden state is . (The point of this step is to identify a hidden state that is a plausible candidate to generate the target visible state.)
Set to be the complete state in which the hidden component of the state is and the visible component is the target visible state.
Set to be the complete state in which the hidden component of the state is and the visible component is the fully unclamped activation pattern that would be obtained by propagating activities from the hidden units to the (unclamped) visible units.
Note that the contrastive pair at this iteration, , are states close to the activation trajectory that the network is following. We might train the net only after it has reached convergence, but we’ve found that defining the loss for every iteration up until convergence improves training performance.
b.1 Loss 1: The difference of energies
This reduction depends on and sharing the same hidden state, a bipartite architecture in which visible and hidden are interconnected, all visible-to-visible connections are zero, a tanh activation function, , for all units, and symmetric weights.
b.2 Loss 2: The conditional probability of correct response
This loss aims to maximize the log probability of the clamped state conditional on the choice between unclamped and clamped states. Framed as a loss, we have a negative log likelihood:
The last step is attained using the Boltzmann distribution (Equation 7).
Appendix C Proof of convergence of CBAN with leaky sigmoid activation function
(1) (2) (3)
Appendix D Network architectures and hyperparameters
d.1 Bar task
Our architecture was a fully connected bipartite attractor net (FBAN) with one visible layer and two hidden layers having 48 and 24 channels. We trained using with the transient TD(1) procedure, defining network stability as the condition in which all changes in unit activation on successive iterations are less than 0.01 for a given input, tanh activation functions, batches of 20 examples (the complete data set), with masks randomly generated on each epoch subject to the constraint that only one completion is consistent with the evidence. Weights between layers and and the biases in layer are initialized from a mean-zero Gaussian with standard deviation , where is the number of units in layer . Optimization is via stochastic gradient descent with an initial learning rate of 0.01, dropped to .001; the gradients in a given layer of weights are renormalized to be 1.0 for a batch of examples, which we refer to as SGD-L2.
Our architecture was a fully connected bipartite attractor net (FBAN) with one visible layer to one hidden layer with 200 units to a second hidden layer with 50. We trained using with the transient TD(1) procedure, defining network stability as the condition in which all changes in unit activation on successive iterations are less than 0.01 for a given input, tanh activation functions, batches of 250 examples. Masks are generated randomly for each example on each epoch. The masks were produced by generating Perlin noise, frequency 7, thresholded such that one third of the pixels were obscured. Weights between layers and and the biases in layer are initialized from a mean-zero Gaussian with standard deviation , where is the number of units in layer . Optimization is via stochastic gradient descent with learning rate 0.01; the gradients in a given layer of weights are renormalized to be 1.0 for a batch of examples, which we refer to as SGD-Linf. Target activations scaled to lie in [-0.999,0.999].
The network architecture consists of four layers: one visible layer and three hidden layers. The visible layer dimensions match the input image dimensions: . The channel dimensions of the three hidden layers increase by 40, 120, and 440, respectively. We used filter sizes of between all layers. Beyond the first hidden layer, we introduce a average pooling operation followed by half-padded convolution going from layer to layer , and a half-padded convolution followed by a nearest-neighbor interpolation going from layer to layer . Consequently, the spatial dimensions of the hidden states, from lowest to highest, are (32,32), (16,16) and (8,8). A trainable bias is applied per-channel to each layer. All biases are initialized to 0, whereas kernel weights are Gaussian initialized with a standard deviation of 0.0001. The CBAN used tanh activation functions and with TD(1) transient training, as described in the main text.
We trained our model on 50,000 images from the CIFAR10 dataset (test set 10,000). The images are noised by online-generation of Perlin noise that masks 40% of the image. We optimized our mean-squared error objective using Adam. The learning rate is initially set to 0.0005 and then decreased manually by a factor of 10 every 20 epochs beyond training epoch 150. For each batch, the network runs until the state stabilizes, where the condition for stabilization is specified as the maximum absolute difference of the full network states between stabilization steps and being less than 0.01. The maximum number of stabilization steps was set to 100; the average stabilization iteration per batch over the course of training was 50 stabilization steps.
The network architecture consists of four layers: one visible layer and three hidden layers. The visible layer dimensions match the input image dimensions: . The channel dimensions of the three hidden layers increase by 128, 256, and 512, respectively. We used filter sizes of between all layers. Beyond the first hidden layer, we introduce a average pooling operation followed by half-padded convolution going from layer to layer , and a half-padded convolution followed by a nearest-neighbor interpolation going from layer to layer . Consequently, the spatial dimensions of the hidden states, from lowest to highest, are (28,28), (14,14) and (7,7). A trainable bias is applied per-channel to each layer. All biases are initialized to 0, whereas kernel weights are Gaussian initialized with a standard deviation of 0.01. The CBAN used tanh activation functions and with TD(1) transient training, as described in the main text.
We trained our model on 15,424 images from the Omniglot dataset (test set: 3856). The images are noised by online-generation of squares that mask 20-40% of the white pixels in the image. We optimized our mean-squared error objective using Adam. The learning rate is initially set to 0.0005 and then decreased manually by a factor of 10 every 20 epochs after training epoch 100. For each batch, the network runs until the state stabilizes, where the condition for stabilization is specified as the maximum absolute difference of the full network states between stabilization steps i and i+1 being less than 0.01. The maximum number of stabilization steps was set to 100; the average stabilization iteration per batch over the course of training was 50 stabilization steps.
Masks were formed by selecting patches of diameter 3–6 uniformly, in random, possibly overlapping locations, stopping when at least 25% of the white pixels have been masked.
The network architecture consists of four layers: one visible layer and three hidden layers. The visible layer spatial dimensions match the input patch dimensions, but consists of 6 channels: . The low-resolution evidence patch is clamped to the bottom 3 channels of the visible state; the top 3 channels of the visible state serve as the unclamped output against which the high-resolution target patch is compared and loss is computed as a mean-squared error. The channel dimensions of the three hidden layers are 300, 300, and 300. We used filter sizes of between all layers. All convolutions are half-padded and no average pooling operations are introduced in the SR network scheme. Consequently, the spatial dimensions of the hidden states remain constant and match the input patches of . A trainable bias is applied per-channel to each layer. All biases are initialized to 0, whereas kernel weights are Gaussian initialized with a standard deviation of 0.001.
We trained our model on 91 images from the T91 dataset [yang2010image] scaled at 2. We optimized our mean-squared error objective using Adam. The learning rate is initially set to 0.00005 and then decreased by a factor of 10 every 10 epochs. The stability conditions described in the CBAN models for CIFAR-10 and Omniglot are repeated for the SR task, except the stability threshold was set to 0.1 half way through training. We evaluated on four test datasets at 2 scaling: Set5 [bevilacqua2012low], Set14 [zeyde2010single], DSB100 [martin2001database], and Urban100 [huang2015single].