Training Generative Adversarial Networks with Binary Neurons by Endtoend Backpropagation
Abstract
We propose the BinaryGAN, a novel generative adversarial network (GAN) that uses binary neurons at the output layer of the generator. We employ the sigmoidadjusted straightthrough estimators to estimate the gradients for the binary neurons and train the whole network by endtoend backpropogation. The proposed model is able to directly generate binaryvalued predictions at test time. We implement such a model to generate binarized MNIST digits and experimentally compare the performance for different types of binary neurons, GAN objectives and network architectures. Although the results are still preliminary, we show that it is possible to train a GAN that has binary neurons and that the use of gradient estimators can be a promising direction for modeling discrete distributions with GANs. For reproducibility, the source code is available at https://github.com/salu133445/binarygan.
Training Generative Adversarial Networks with Binary Neurons by Endtoend Backpropagation
HaoWen Dong, YiHsuan Yang Research Center for IT Innovation, Academia Sinica, Taipei, Taiwan {salu133445,yang}@citi.sinica.edu.tw
1 Introduction
Generative adversarial networks (GAN) [7] have enjoyed great success in modeling continuous distributions. However, applying GANs to discrete data have been shown nontrivial in that it is difficult to optimize the model distribution toward the target data distribution in a highdimensional discrete space. Approaches adopted in the literature to utilize GANs on discrete data can roughly be divided into three directions.
One direction is to replace the target discrete outputs with continuous relaxations. Kusner et al. [13] proposed to use the continuous GumbelSoftmax distribution to approximate a categorical distribution and generate sequences of discrete elements using onehot encoding. Using the Wasserstein GANs [2], Gulrajani et al. [8] and Subramanian et al. [16] have developed in parallel models that can handle discrete data by simply passing the continuous, probabilistic outputs (i.e., softmax relaxation) of the generator to the discriminator.
The second direction is to view the generator as an agent in reinforcement learning (RL) and introduce RLbased training strategies. Yu et al. [19] considered the generator as a stochastic parametrized policy and trained the generator via policy gradient and Monte Carlo search. Hjelm et al. [10] used estimated difference measure from the discriminator to compute importance weights for generated samples, which provides a policy gradient for training the generator. More examples can be seen in natural language processing (NLP), including dialogue generation [15] and machine translation [18].
The third direction is to introduce gradient estimators to estimate the gradients for the nondifferentiable discretization operations for the generator. Since the discretization operations are used in the forward pass, the generator is able to provide discrete predictions to the discriminator during the training and directly generate discrete predictions at test time without any further postprocessing. Moreover, it can support conditional computation graphs [3, 4] that allow the system to make discrete decisions for more advanced designs. However, to the best of our knowledge and as pointed out by [10], none of these has yet been shown to work with GANs.
In our previous work on generating binaryvalued music pianorolls [6], we proposed to append to the generator a refiner network that learns to binarize the generator’s outputs to binary ones. However, the whole network was trained in a twostage setting. It remains unclear whether and how we can train a GAN that has binary neurons in an endtoend manner. We intend to study this issue in this paper and consider here the generation of binarized MNIST handwritten digits (see Figure 1) as a case study, assuming that people are more familiar with the MNIST digits than the pianorolls.
In this paper, we employ either stochastic or deterministic binary neurons at the output layer of the generator. In order to train the whole network by endtoend backpropagation, we use the sigmoidadjusted straightthrough estimators [9, 3, 1] to estimate the gradients for the binary neurons. We experimentally compare the performance of the proposed model, which we dub BinaryGAN, using different types of binary neurons, GAN objectives and network architectures.
2 Backgrounds
2.1 Generative Adversarial Networks
A generative adversarial network (GAN) [7] is a generative model that can learn a data distribution in an unsupervised manner. It contains two components: a generator and a discriminator . The generator takes as input a random vector sampled from a prior distribution and outputs a fake sample. The discriminator take as input either a real sample drawn from the data distribution or a fake sample generated by and outputs a scalar representing the genuineness of that sample.
The training is formulated as a twoplayer game: the discriminator aims to tell the fake data from the real ones, while the generator aims to fool the discriminator. Note that in this paper we refer to GAN as its nonsaturating version, which can provide stronger gradients in the early stage of the training as suggested by [7]. The objectives for the generator and the discriminator are
(1)  
(2) 
Another form called WGAN [2] was later proposed with the intuition to estimate the Wasserstein distance between the real and the model distributions by a deep neural network and use it as a critic for the generator. It can be formulated as
(3) 
where the discriminator must be Lipschitz continuous.
In [2], the Lipschitz constraint on the discriminator is imposed by weight clipping. However, this can lead to undesired behaviors as discussed in [8]. A gradient penalty (GP) term that punishes the discriminator when it violates the Lipschitz constraint is then proposed in [8]. The objectives become
(4)  
(5) 
where is defined sampling uniformly along straight lines between pairs of points sampled from and , the model distribution.
2.2 Deterministic and Stochastic Binary Neurons
Binary neurons are neurons that output binaryvalued predictions. In this work, we consider two types of them. A deterministic binary neuron (DBN) acts like a neuron with the hard thresholding function as its activation function. The output of a DBN for a realvalued input is defined as
(6) 
where denotes the unit step function and is the sigmoid function.
2.3 Sigmoidadjusted Straightthrough Estimators
Backpropgating through binary neurons, however, is intractable. The reason is that, for a DBN, the threshold function is nondifferentiable and that, for an SBN, it requires the computation of the expected gradient averaged on all possible combinations of values taken by the binary neurons, where the number of such combinations is exponential to the total number of binary neurons.
The straightthrough estimator was first proposed as a regularizer in [9]. It simply ignores the gradient of a binary neuron and treats it as an identity function in the backward pass. A variant called sigmoidadjusted straightthrough estimator [3] replaces the gradient of a binary neuron with the gradient of the sigmoid function instead. The latter is found to achieve better performance in a classification task presented in [1]. Hence, when training networks with binary neurons, we resort to the sigmoidadjusted straightthrough estimators to provide the gradients for the binary neurons.
3 BinaryGAN
We propose the BinaryGAN, a model that can generate binaryvalued predictions without any further postprocessing and can be trained by endtoend backpropagation. The proposed model consists of a generator and a discriminator . The generator takes as input a random vector drawn from a prior distribution and generate a fake sample . The discriminator takes as input either a real sample drawn from the data distribution or a fake sample generated by the generator and outputs a scalar indicating the genuineness of that sample.
In order to handle binary data, we propose to use binary neurons, either deterministic or stochastic ones, at the output layer (i.e., the final layer) of the generator. Hence, the model space (i.e., the output space of the generator) is , where is the number of visible binary neurons at the output layer. We employ the sigmoidadjusted straightthrough estimators to estimate the gradients for binary neurons and train the whole network by endtoend backpropagation. Figure 2 shows the system diagram for the proposed model implemented by multilayer perceptrons (MLPs).
4 Experiments and Results
4.1 Training Data—Binarized MNIST Database
4.2 Implementation Details

The batch size for all the experiments is set to .

We train the proposed model with the WGANGP objective. Note that other GAN objectives will be compared in Section 4.5.

We apply the slope annealing trick [4]. Specifically, we multiply the slopes of the sigmoid functions in the sigmoidadjusted straightthrough estimators by after each epoch.
Our implementation of binary neurons are mostly based on the code provided in the blog post—“Binary Stochastic Neurons in Tensorflow”—on the R2RT blog [1].
4.3 Experiment I—Comparison of the proposed model using deterministic binary neurons and stochastic binary neurons
In the first experiment, we compare the performance of using deterministic binary neurons (DBNs) and stochastic binary neurons (SBNs) in the proposed model. We show in Figures 3(a) and 3(c) some sample generated digits for the two models. We can see that the proposed model with DBNs and SBNs can achieve similar qualities. However, from Figures 3(b) and 3(d) we can see that the preactivated outputs (i.e., the realvalued, intermediate values right before the binarization operation; see Section 2.2) for the two models show distinct characteristics. In order to see how DBNs and SBNs work differently, we compute the histograms of their preactivated outputs, as shown in Figure 4.
We can see from Figure 4 that the proposed model with DBNs outputs more preactivated values in the middle of zero and one, which results in a flatter histogram. We attribute this phenomenon to the fact that the output of a DBN is less sensitive to the absolute value for it depends only on whether the preactivated value is greater than the threshold. Moreover, we observe a notch around , the threshold value we use in our implementation. It seems that DBNs tend to avoid producing preactivated values around the decision boundary (i.e., the threshold).
In contrast, the proposed model with SBNs outputs more preactivated values close to zero and one, which we attribute to the fact that the output of an SBN is more sensitive to the absolute value for it relies on Bernoulli sampling (e.g., an SBN may fire even with a tiny preactivated value). As a result, in order to avoid false triggering, it seems that an SBN tends to produce a preactivated value closer to zero and one.
4.4 Experiment II—Comparison of the proposed model and the realvalued model
In the second experiment, we compare the proposed model with a variant that uses normal neurons at the output layer (with sigmoid functions as the activation functions).^{1}^{1}1This is how we train the MuseGAN model in [5]. After the training, we binarize the realvalued predictions with a threshold of to obtain the final binaryvalued results. We refer to this model as the realvalued model. Figure 5(a) shows some sample raw, probabilistic predictions of this model. Figures 5(b) and 5(c) show the final binarized results using two common postprocessing strategies: hard thresholding and Bernoulli sampling, respectively.
We also show in Figure 4 the histogram of its probabilistic predictions. We can see that the histogram of this realvalued model is more Ushaped than that of the proposed model with SBNs. Moreover, there is no notch in the middle of the curve as compared to the proposed model with DBNs. From here we can see how different binarization strategies can shape the characteristics of the preactivated outputs of binary neurons. This also emphasizes the importance of including the binarization operations in the training so that the binarization operations themselves can also be optimized.
4.5 Experiment III—Comparison of the proposed model trained with the GAN, WGAN and WGANGP objectives
In the third experiment, we compare the proposed model trained by the WGANGP objective with that trained by the GAN objective and by the WGAN objective (using weight clipping). Implementation details are summarized as follows.

For the GAN model, we apply batch normalization to the generator and the discriminator.

For the WGAN model, we find it works better to apply batch normalization to the generator while omitting it for the discriminator.

For the GAN model, we employ the Adam optimizer [12] with hyperparameters and .
As can be seen from Figure 6, the WGAN model is able to generate digits with similar qualities as the WGANGP model does, while the GAN model suffers from the socalled mode collapse issue.
4.6 Experiment IV—Comparison of the proposed model using multilayer perceptrons and convolutional neural networks
In the last experiment, we compare the performance of using multilayer perceptrons (MLPs) and convolutional neural networks (CNNs). For the CNN model, we implement both the generator and the discriminator as deep CNNs (see Tables 3 and 4 for the network architectures). Note that the number of trainable parameters for the MLP and CNN models are 0.53M and 1.4M, respectively.
Layer  Number of filters  Kernel  Strides  Activation 

transconv  ReLU  
transconv  ReLU  
transconv  ReLU  
transconv  sigmoid 
Layer  Number of filters  Kernel  Strides  Activation 

conv  LeakyReLU  
maxpool  
conv  LeakyReLU  
maxpool  
flatten  
dense  LeakyReLU  
dense  sigmoid^{*} 
^{*}No activation for the WGAN and WGANGP models.
We present in Figure 7 some sample generated digits by the proposed and the realvalued model implemented by CNNs. It can be clearly seen that the CNN model can better capture the characteristics of different digits and generate less artifacts even with a smaller number of trainable parameters as compared to the MLP model.
5 Discussions and Conclusions
We have presented a novel GANbased model that can generate binaryvalued predictions without any further postprocessing and can be trained by endtoend backpropagation. We have implemented such a model to generate binarized MNIST digits and experimentally compared the proposed model for different types of binary neurons, GAN objectives and network architectures. Although the results are still preliminary, we have shown that the use of gradient estimators can be a promising direction for modeling discrete distributions with GANs. A future direction is to examine the use of gradient estimators for training a GAN that has a conditional computation graph [3, 4], which allows the system to make binary decisions by binary neurons for more advanced designs.
References
 [1] Binary stochastic neurons in tensorflow, 2016. Blog post on the R2RT blog. [Online] https://r2rt.com/binarystochasticneuronsintensorflow.
 [2] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In Proc. ICML, 2017.
 [3] Yoshua Bengio, Nicholas Léonard, and Aaron C. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
 [4] Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical multiscale recurrent neural networks. In Proc. ICLR, 2017.
 [5] HaoWen Dong, WenYi Hsiao, LiChia Yang, and YiHsuan Yang. MuseGAN: Multitrack sequential generative adversarial networks for symbolic music generation and accompaniment. In Proc. AAAI, 2018.
 [6] HaoWen Dong and YiHsuan Yang. Convolutional generative adversarial networks with binary neurons for polyphonic music generation. In Proc. ISMIR, 2018.
 [7] Ian J. Goodfellow et al. Generative adversarial nets. In Proc. NIPS, 2014.
 [8] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training of Wasserstein GANs. In Proc. NIPS, 2017.
 [9] Geoffrey Hinton. Neural networks for machine learning—Using noise as a regularizer (lecture 9c), 2012. Coursera, video lectures. [Online] https://www.coursera.org/lecture/neuralnetworks/usingnoiseasaregularizer7minwbw7b.
 [10] R. Devon Hjelm et al. Boundaryseeking generative adversarial networks. In Proc. ICLR, 2018.
 [11] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proc. ICML, 2015.
 [12] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [13] Matt J. Kusner and José Miguel HernándezLobato. GANS for sequences of discrete elements with the Gumbelsoftmax distribution. In Proc. NIPS Workshop on Adversarial Training, 2016.
 [14] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proc. IEEE, 86(11):2278–2324, 1998.
 [15] Jiwei Li et al. Adversarial learning for neural dialogue generation. In Proc. EMNLP, 2017.
 [16] Sai Rajeswar, Sandeep Subramanian, Francis Dutil, Christopher Pal, and Aaron Courville. Adversarial generation of natural language. In Proc. ACL Workshop on Representation Learning for NLP, 2017.
 [17] Tijmen Tieleman and Geoffrey Hinton. Neural networks for machine learning—Rmsprop: Divide the gradient by a running average of its recent magnitude (lecture 6e), 2012. Coursera, video lectures. [Online] https://www.coursera.org/lecture/neuralnetworks/rmspropdividethegradientbyarunningaverageofitsrecentmagnitudeYQHki.
 [18] Zhen Yang, Wei Chen, Feng Wang, and Bo Xu. Improving neural machine translation with conditional sequence generative adversarial nets. In Proc. NAACL, 2018.
 [19] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. SeqGAN: Sequence generative adversarial nets with policy gradient. In Proc. AAAI, 2017.
DBNs  DBNs (preactivated)  SBNs  SBNs (preactivated)  

WGANGP (w/ BN in )  
WGANGP (w/o BN in )  
WGAN (w/ BN in )  
WGAN (w/o BN in )  
GAN (w/ BN in )  
GAN (w/o BN in ) 