Regularized Dynamic Boltzmann Machine with Delay Pruning for Unsupervised Learning of Temporal Sequences
Abstract
We introduce Delay Pruning, a simple yet powerful technique to regularize dynamic Boltzmann machines (DyBM). The recently introduced DyBM provides a particularly structured Boltzmann machine, as a generative model of a multidimensional timeseries. This Boltzmann machine can have infinitely many layers of units but allows exact inference and learning based on its biologically motivated structure. DyBM uses the idea of conduction delays in the form of fixed length firstin firstout (FIFO) queues, with a neuron connected to another via this FIFO queue, and spikes from a presynaptic neuron travel along the queue to the postsynaptic neuron with a constant period of delay. Here, we present Delay Pruning as a mechanism to prune the lengths of the FIFO queues (making them zero) by setting some delay lengths to one with a fixed probability, and finally selecting the best performing model with fixed delays. The uniqueness of structure and a nonsampling based learning rule in DyBM, make the application of previously proposed regularization techniques like Dropout or DropConnect difficult, leading to poor generalization. First, we evaluate the performance of Delay Pruning to let DyBM learn a multidimensional temporal sequence generated by a Markov chain. Finally, we show the effectiveness of delay pruning in learning high dimensional sequences using the moving MNIST dataset, and compare it with Dropout and DropConnect methods.
I Introduction
Deep neural networks [1], [2] have been successfully applied for learning in a large number of image recognition and other machine learning tasks. However, neural network (NNs) based models are typically well suited on scenarios with large amounts of available labelled datasets. Increasing the network complexity (in terms of size or number of layers), one can achieve impressive levels of performance. A caveat is that this can lead to gross overfitting or generalization issues, when trained in the presence of limited amount of training samples. As a result, a wide range of techniques, like adding a penalty term, Bayesian methods [3], adding noise to training data [4] etc., for regularizing NNs have been developed.
More recently, with a focus on NNs with a deep architecture, Dropout [5] and DropConnect [6] techniques have been proposed as ways to prevent overfitting by randomly omitting some of the feature detectors on each training sample. Specifically, Dropout involves randomly deleting some of the activations (units) in each layer during a forward pass and then backpropagating the error only through the remaining units. DropConnect generalizes this to randomly omitting weights rather than the activations (units). Both these techniques have been shown to significantly improve the performance on standard fullyconnected deep neural network architectures.
In this work, we propose a novel regularization technique called Delay Pruning, designed for a recently introduced generative model called dynamic Boltzmann machine (DyBM) [7]. Unlike the conventional Boltzmann machine (BM) [8], which is trained with a collection of static patterns, DyBM is designed for unsupervised learning of temporal pattern sequences. DyBM is motivated by postulates and observations from biological neural networks, allowing exact inference and learning of weights based on the timing of spikes (spiketiming dependent plasticity  STDP). Unlike the restricted Boltzmann machine (RBM) [9], DyBM has no specific hidden units, and the network can be unfolded through time, allowing infinitely many layers [10]. Furthermore, DyBM can be viewed as fullyconnected recurrent neural network with memory units and with conduction delays between units implemented in the form of fixed length firstin firstout (FIFO) queues. A spike originating at a presynaptic neuron (unit) travels along this FIFO queue and reaches the postsynaptic neuron after a fixed delay. The length of the FIFO queues is equal to one minus the maximum delay value. Due to this completely novel architecture of DyBM applying existing regularization methods is difficult or does not lead to better generalization performance.
As such, the here proposed Delay Pruning technique allows a method for regularized training of NNs with FIFO queues. Specifically, during training, it truncates the lengths to zero, for randomly selected FIFO queues. We evaluate the performance of Delay Pruning on a stochastic multidimensional time series and then compare it with Dropout and DropConnect for unsupervised learning on the highdimensional moving MNIST dataset. In the next sections, we first give a brief overview of DyBM and its learning rule, followed by the Delay Pruning algorithm, experimental results and conclusion.
Ii Dynamic Boltzmann Machine
Iia Overview
In this paper, we use DyBM [7] for unsupervised learning of temporal sequences and show better generalised performance using our Delay Pruning algorithm. Unlike standard Boltzmann machines, DyBM can be trained with a timeseries of patterns. Specifically, the DyBM gives the conditional probability of the next values (patterns) of a timeseries given its historical values. This conditional probability can depend on the whole history of the timeseries, and the DyBM can thus be used iteratively as a generative model of a timeseries.
DyBM can be defined from BM having multiple layers of units, where one layer represents the most recent values of a timeseries, and the remaining layers represent the historical values of the timeseries. The most recent values are conditionally independent of each other given the historical values. The DyBM is equivalent to such a BM having an infinite number of layers, so that the most recent values can depend on the whole history of the time series. We train the DyBM in such a way that the likelihood of given timeseries is maximized with respect to the conditional distribution of the next values given the historical values. Similar to a BM, a DyBM consists of a network of artificial neurons. Each neuron takes a binary value, 0 or 1, following a probability distribution that depends on the parameters of the DyBM. Unlike the BM, the values of the DyBM can change over time in a way that depends on its previous values. That is, the DyBM stochastically generates a multidimensional series of binary values.
Learning in conventional BMs is based on an Hebbian formulation, but is often approximated with sampling based strategy like contrastive divergence. In this formulation the concept of time is largely missing. In DyBM, like biological networks, learning is dependent on the timing of spikes. This is called spiketiming dependent plasticity, or STDP [11], which states that a synapse is strengthened if the spike of a presynaptic neuron precedes the spike of a postsynaptic neuron (long term potentiation  LTP), and the synapse is weakened if the temporal order is reversed (long term depression  LTD). DyBM uses an exact online learning rule, that has the properties of LTP and LTD.
The learning rule of DyBM exhibits some of the key properties of STDP due to its structure consisting of conduction delays and memory units, which are illustrated in Figure 1. A neuron is connected to another in a way that a spike from a presynaptic neuron, , travels along an axon and reaches a postsynaptic neuron, , via a synapse after a delay consisting of a constant period, . In the DyBM, a FIFO queue causes this conduction delay. The FIFO queue stores the values of the presynaptic neuron for the last units of time. Each stored value is pushed one position toward the head of the queue when the time is incremented by one unit. The value of the presynaptic neuron is thus given to the postsynaptic neuron after the conduction delay. Moreover, the DyBM aggregates information about the spikes in the past into neural eligibility traces and synaptic eligibility traces, which are stored in the memory units. Each neuron is associated with a learnable parameter called bias. The strength of the synapse between a presynaptic neuron and a postsynaptic neuron is represented by learnable parameters called weights. Those are further divided into LTP and LTD components.
IiB Definition
The DyBM shown in Figure 2 (b) can be shown to be equivalent to a BM having infinitely many layers of units [10]. Similar to the RBM (Figure 2 (a)), the DyBM has no weight between the units in the rightmost layer of Figure 2 (b). Unlike the RBM [9], each layer of the DyBM has a common number, , of units, and the bias and the weight in the DyBM can be shared among different units in a particular manner.
Formally, the DyBM is a BM having layers from to , where is a positive integer or infinity. Let , where is the values of the units in the th layer, which we consider as the values at time . The units at the 0th layer (the rightmost layer of Figure 2 (b)) have an associated bias term . For any , gives the matrix whose element, , denotes the weight between the th unit at time and the th unit at time for any . This weight can in turn be divided into LTP and LTD components. As introduced in the previous section, each neuron stores a fixed number, , of neural eligibility traces. For and , is the th neural eligibility trace of the th neuron immediately before time . This is calculated as weighted sum of the past values of that neuron, with recent values weighing more:
(1) 
where, is the decay rate for the th neural eligibility trace. Each neuron also stores synaptic eligibility traces as weighted sum of the values that has reached neuron, , from a presynaptic neuron, , after the conduction delay, , with recent values weighing more. Namely, the postsynaptic neuron stores a fixed number, , of synaptic eligibility traces. For , is the th synaptic eligibility trace of the neuron for the presynaptic neuron immediately before time :
(2) 
here, is the decay rate for the th synaptic eligibility trace. Both of the eligibility traces are updated locally in time as follows:
(3)  
(4) 
for and , and for neurons that are connected to .
For a DyBM, , is the conditional probability of given , where we use for an interval such as to denote . Because the units in the 0th layer have no weight with each other, this conditional probability has the property of conditional independence analogous to RBMs.
DyBM can be seen as a model of a timeseries in the following sense. Specifically, given a history of a timeseries, the DyBM gives the probability of the next values, of the timeseries with . With a DyBM, the next values can depend on the whole history of the timeseries. In principle, the DyBM can thus model any timeseries possibly with longterm dependency, as long as the values of the timeseries at a moment is conditionally independent of each other given its values preceding that moment. Using the conditional probability given by a DyBM, the probability of a sequence, , of length is given by
(5) 
where we arbitrarily define for . Namely, the values are set to zero if there are no corresponding history.
The STDP based learning rule for a DyBMT is derived such that the loglikelihood of a given set () of timeseries is maximised by maximising the sum of the loglikelihood of . Using Eq. 5, the loglikelihood of has the following gradient:
(6) 
Typically, the computation of this gradient can be intractable for large , however in DyBM using a specific form of weight sharing [7], exact and efficient gradient calculation is possible. Specifically, in the limit of using the formulation of neural and synaptic eligibility traces, the parameters of DyBM can be computed exactly using an online stochastic gradient rule that maximizes the logliklihood of the given set :
(7) 
Due to space limitations, the weight update rules are not provided here. See [7] for details.
Iii Regularization with Delay Pruning
Delay Pruning provides a method of training DyBM, and in general neural networks with FIFO queues, with regularization and then choosing the best performing model for improved prediction on test dataset. Specifically it refers to truncating the FIFO queue lengths to zero, by setting their respective delay values to unit length, for randomly selected axons with a probability . Figure 3. displays the difference in architecture of two connected neurons with a FIFO queue, for the original DyBM and a delay pruned version. The procedure is carried out as follows:
Initialize DyBM parameters, with the delay length () for the FIFO queues connecting neurons and , selected randomly within a certain range. Here we use . Each neuron is connected to another neuron with two FIFO queues (outgoing and incoming axon) of lengths initialized to . Calculate the negative loglikelihood (original negative loglikelihood  ONL) with respect to the true distribution of the temporalpattern from the training sample.
For each training sample and current training cycle:

For every fixed number of epochs, validate the previously learned DyBM for predicting a temporal sequence pattern. Calculate negative loglikelihood ( training negative loglikelihood  TNL) of the training (or validation) data with respect to the distribution defined by the trained DyBM. Update the performance evaluation measure by calculating the difference between ONL and TNL (any other appropriate performance measure e.g. crossentropy, can also be used instead). Update a best model pointer to point towards the learnt network with minimum so far.

Select random variable () from a Bernoulli distribution with probability . If , keep the original maximum delay (FIFO Queue length), otherwise, set the current maximum delay . Thus truncating the current FIFO queue length to zero.

Repeat till all training cycles are exhausted.

The best performing model (with all parameters fixed) from the training and validation process is selected for final testing.
Similar to Dropout and DropConnect, applying the delay pruning algorithm amounts to sampling the best performing ”thinned” network. In this case the thinned network consists of all FIFO queues that survived the pruning procedure. For each presentation of training cycle, a new thinned network is sampled and trained. As a result of this procedure, one can train across an ensemble of models and thus effectively regularize DyBM to prevent overfitting. Finally instead of averaging across the ensemble, we select the best performing model. This is analogous to bagging based ensemble learning method in other machine learning areas [12].
Iv Experiments
We designed two different experiments of increasing complexity in order to evaluate the effect of Delay Pruning on DyBM. Given that DyBM is a neural network suitable for learning a generative model of temporal sequences, the two tasks were chosen so as to show the effect of regularization for modelling and predicting highdimensional temporal patterns. The experiments were conducted using a purely CPU based, Java^{®} implementation of the DyBM on a MacBook Air with an Intel Core i5, 1.6 GHz.
Iva Training
DyBM was trained using minibatches of samples from the training set in both cases. Each sample was trained for a maximum of fifty thousand time steps. The Delay Pruning was carried out continuously. Every epoch, DyBM was tested on a sample from a validation set (generating a validation temporal pattern sequence), and the currently best performing model was updated. After every minibatch, all the eligibility traces in DyBM were reinitialized with the learned weights from the previous minibatch transferred to the next batch. Training stopped if the maximum time was reached or if the estimated negative loglikelihood of trained DyBM matched that of the true negative loglikelihood of validation set for an entire epoch. The learning rates were initially fixed to a small value and then adjusted during training using the optimisation technique of adaptive moment estimation [13], while the parameters of DyBM were learned using a stochastic gradient method.
The bias and weight parameters were initialized randomly from a normal distribution with mean and standard deviation . DyBM uses a fullyconnected network with each neuron having a selfconnection via a FIFO queue. Each neuron held three () neural and three () synaptic eligibility traces, respectively.
IvB Learning multidimensional stochastic timeseries
This task involved, learning to model and predict the next sequence of a 7dimensional stochastic timeseries , where . Here, is the length of timeseries. The timeseries was synthetically generated using a discretetime oneMarkov process as depicted in Fig. 4(a). The probability to generate the same state of a ’0’ or ’1’ was fixed at , while the transition probability to a different state from ’0’ to ’1’ or vice versa, was fixed at . The number of training and testing data were set as and , respectively. A small training set was deliberately chosen in order to check for the generalisation ability of the original DyBM as compared with the DyBM regularised with Delay Pruning during training. Each FIFO queue connection delay was initialized randomly as . The number of neurons , with each neuron encoding one of the input dimensions. The probability of pruning was fixed at . Fig. 4 (b) shows an example plot with a section of the training dataset.
We first trained and tested with the original DyBM without any regularization. In Fig. 4(c), we plot the negative loglikelihood with respect to the true data distribution against the estimated negative loglikelihood of the test data with respect to the distribution defined by the trained DyBM. As observed, it achieved a poor generalization with a low correlation coefficient of . Keeping all parameters the same, retraining DyBM with Delay Pruning regularization method (as explained in Section. 3) resulted in significantly better generalization in the prediction of the test data timeseries. This is clearly observed from the high correlation coefficient of between the estimated negative loglikelihood and the true negative loglikelihood of the data distribution.
IvC Moving MNIST Prediction
Unsupervised learning of image sequences [14] is a difficult problem. Avoiding overfitting in order to predict future pattern sequences is considerably challenging. As such, this task was designed so as to test the ability of delay pruned DyBM to go through a temporal sequence of image frames and learn the underlying representation. We then test it for generating the original input sequences and also for predicting future image frames in the correct temporal order.
Moving MNIST digits: This dataset consists of videos of MNIST digits. Each video was frames long and consisted of two digits moving inside a patch. The digits were chosen randomly from the training set and placed initially at random locations inside the patch. As depicted in [14], each digit was assigned a velocity whose direction was chosen uniformly randomly on a unit circle and whose magnitude was also chosen uniformly at random over a fixed range. The digits bouncedoff the edges of the patch and overlapped if they were at the same location. In order to reduce the learning time complexity of the task but preserve spatial complexity, we reduced the resolution of the original image patches to bits (see top panels of Fig. 5). This makes it considerably difficult to recognize the original digits, but the patterns move in between frames in the same temporal order. We binarized the image patches using a RGB threshold value of . DyBM was trained on sample videos selected randomly from the original dataset
Unlike in [14], we used a single, considerably smaller DyBM network to learn to reconstruct the original input sequences as well as to predict into the future. Here reconstruction was tested by letting the DyBM trained on the input sequences to run forward in time up to first frames. As observed from Fig. 5, DyBM with Delay Pruning did a significantly good job in not only reconstructing the original input sequences but was also able to predict the next frames. Despite the relatively small training dataset, the best case test prediction accuracy as compared with the ground truth was significantly high at , for the DyBM with Delay Pruning (with a probability of ). Baseline performance of the standard DyBM (without delay pruning) was at . As observed from the the bottom panels of Fig. 5, randomizing the learned weights completely destroyed the ability of the network to either reconstruct or predict the sequences. Prediction beyond 5 frames into the future got considerably worse, with error starting to accumulate after the predicted third frame.
In order to compare the performance of Delay Pruning against other state of the art regularization techniques, we trained DyBM with Dropout and Dropconnect on the same task. It should be noted that, due to the peculiarity of the structure of DyBM (absence of hidden units), straightforward application of Dropout and DropConnect is difficult. In this case, we apply these techniques by considering the time unfolded DyBM, with regularisation being applied for units or connections in all layers except the units in the th layer. This layer acts analogous to the visible layer in standard RBMs. From Fig. 6 we see that the probability of deletion or pruning () effects the test prediction accuracy in all cases. However, DyBM with Delay Pruning significantly outperformed both DropConnect and Dropout regularization techniques. We thus confirmed that Delay Pruning allows robust unsupervised modelling of the video frame sequences.
V Conclusion
We have demonstrated a novel regularization technique called Delay Pruning for the Dynamic Boltzmann Machine, specially suitable for learning a generative model of multidimensional temporal pattern sequences. Even in the presence of a relatively small training and test dataset, Delay Pruning prevents overfitting to give good generalized performance. Due to the uniqueness of the structure of DyBM, Delay Pruning, in the form of randomly truncating the length of FIFO queues, leads to change in the spiking dynamics of the network by shortening the memory of spikes from a presynaptic to postsynaptic neuron. Experimental results show that Delay Pruning significantly outperforms other state of the art methods, enabling a 256 unit DyBM network to give a prediction accuracy of on the reduced moving MNIST dataset.
Acknowledgement: This work was supported by CREST, JST.
Footnotes
 The Moving MNIST dataset is available from http://www.cs.utoronto.ca/~nitish/unsupervised_video/.
References
 Y. LeCun, Y. Bengio, and G. E. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
 R. Salakhutdinov and G. E. Hinton, “Deep boltzmann machines,” in International Conference on Artificial Intelligence and Statistics, 2009, pp. 448–455.
 P. M. Williams, “Bayesian regularization and pruning using a laplace prior,” Neural Computation, vol. 7, no. 1, pp. 117–143, 1995.
 C. M. Bishop, “Training with noise is equivalent to tikhonov regularization,” Neural Computation, vol. 7, no. 1, pp. 108–116, 1995.
 N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
 L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus, “Regularization of neural networks using dropconnect,” in Proceedings of the 30th International Conference on Machine Learning, 2013, pp. 1058–1066.
 T. Osogami and M. Otsuka, “Seven neurons memorizing sequences of alphabetical images via spiketiming dependent plasticity,” Scientific Reports, vol. 5, 2015.
 D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, “A learning algorithm for boltzmann machines,” Cognitive Science, vol. 9, no. 1, pp. 147–169, 1985.
 R. Salakhutdinov, A. Mnih, and G. Hinton, “Restricted boltzmann machines for collaborative filtering,” in Proceedings of the 24th International Conference on Machine Learning. ACM, 2007, pp. 791–798.
 T. Osogami and M. Otsuka, “Learning dynamic boltzmann machines with spiketiming dependent plasticity,” arXiv preprint arXiv:1509.08634, 2015.
 S. Song and L. F. Abbott, “Cortical development and remapping through spike timingdependent plasticity,” Neuron, vol. 32, no. 2, pp. 339–350, 2001.
 L. K. Hansen and P. Salamon, “Neural network ensembles,” IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 10, pp. 993–1001, 1990.
 D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 N. Srivastava, E. Mansimov, and R. Salakhutdinov, “Unsupervised learning of video representations using lstms,” arXiv preprint arXiv:1502.04681, 2015.