Neural Bayes

# Neural Bayes

## Abstract

We introduce a parameterization method called Neural Bayes which allows computing statistical quantities that are in general difficult to compute and opens avenues for formulating new objectives for unsupervised representation learning. Specifically, given an observed random variable and a latent discrete variable , we can express , and in closed form in terms of a sufficiently expressive function (Eg. neural network) using our parameterization without restricting the class of these distributions. To demonstrate its usefulness, we develop two independent use cases for this parameterization:

1. Mutual Information Maximization (MIM): MIM has become a popular means for self-supervised representation learning. Neural Bayes allows us to compute mutual information between observed random variables and latent discrete random variables in closed form. We use this for learning image representations and show its usefulness on downstream classification tasks.

2. Disjoint Manifold Labeling: Neural Bayes allows us to formulate an objective which can optimally label samples from disjoint manifolds present in the support of a continuous distribution. This can be seen as a specific form of clustering where each disjoint manifold in the support is a separate cluster. We design clustering tasks that obey this formulation and empirically show that the model optimally labels the disjoint manifolds. Our code is available at https://github.com/salesforce/NeuralBayes

\printAffiliationsAndNotice

## 1 Introduction

Humans have the ability to automatically categorize objects and entities through sub-consciously defined notions of similarity, and often in the absence of any supervised signal. For instance, studies have shown that young infants are capable of automatically forming categories based on gender Younger and Fearing (1999); Johnston et al. (2001), types of animals Gopnik and Meltzoff (1987); Bornstein and Arterberry (2010), shapes Smith and Samuelson (2006), etc. It is generally hypothesized that such discrete categorizations result in efficient encoding of sensory input that reduces the amount of information processing required by the brain Rakison and Yermolayeva (2010). Therefore, unsupervised categorization can be seen as a means of learning useful encoding for real world data. This skill is extremely valuable since the majority of data available in the real world is unlabeled.

In this spirit, we introduce a generic parameterization that allows learning representations from unlabeled data by categorizing them. Specifically, our parameterization implicitly maps samples from an observed random variable to a latent discrete space where the distribution gets segmented into a finite number of arbitrary conditional distributions. Imposing different conditions on the latent space through different objective functions will result in learning qualitatively different representations.

We note that our parameterization may be used to compute statistical quantities involving observed variables and latent discrete variables that are in general difficult to compute, thus providing a flexible framework for unsupervised representation learning. To illustrate this aspect, we develop two independent use cases for this parameterization– mutual information maximization (Linsker, 1988) and disjoint manifold labeling, as described in the abstract. For the MIM task, we experiment with benchmark image datasets and show that the unsupervised representation learned by the network achieves good performance on downstream classification tasks. For the manifold labeling task, we show experiments on 2D datasets and their high-dimensional counter-parts designed as per the problem formulation, and show that the proposed objective can optimally label disjoint manifolds. For both objectives we design regularizations necessary to achieve the desired behavior in practice.

The paper is organized as follows. We introduce the parameterization in section 2. We then develop the two applications of the parameterization, viz, mutual information maximization and disjoint manifold labeling, in section 3 and section 4 respectively. Finally we show experiments in section 5 followed by related work and conclusion. All the proofs can be found in the appendix.

## 2 Neural Bayes

Consider a data distribution from which we have access to i.i.d. samples . We suppose that this marginal distribution is a union of conditionals where the density is denoted by and the corresponding probability mass denoted by . Here is a discrete random variable with states. We now introduce the parameterization that allows us to implicitly factorize any marginal distribution into conditionals as described above. Aside from the technical details, the key idea behind this parameterization is the Bayes’ rule.

###### Lemma 1

Let and be any conditional and marginal distribution defined for continuous random variable and discrete random variable . If , then there exists a non-parametric function for any given input with the property such that,

 p(x|z=k) =Lk(x)⋅p(x)Ex∼p(x)[Lk(x)],p(z=k)=Ex[Lk(x)] p(z=k|x) =Lk(x) (1)

and this parameterization is consistent.

Thus the function can be seen as a form of soft categorization of input samples. In practice, we use a neural network with sufficient capacity and softmax output to realize this function . We name our parameterization method Neural Bayes and replace with to denote the parameters of the network. By imposing different conditions on the structure of by formulating meaningful objectives, we will get qualitatively different kinds of factorization of the marginal , and therefore the function will encode the posterior for that factorization. In summary, if one formulates any objective that involves the terms , or , where is an observed random variable and is a discrete latent random variable, then they can be substituted with , and respectively.

On an important note, Neural Bayes parameterization requires using the term , through which computing gradient is infeasible in general. A general discussion around this can be found in appendix A. Nonetheless, we show that mini-batch gradients can have good fidelity for one of the objectives we propose using our parameterization. In the next two sections, we explore two different ways of factorizing resulting in qualitatively different goals of unsupervised representation learning.

## 3 Mutual Information Maximization (MIM)

### 3.1 Theory

Suppose we want to find a discrete latent representation (with states) for the distribution such that the mutual information is maximized (Linsker, 1988). Such an encoding demands that it must be very efficient since it has to capture maximum possible information about the continuous distribution in just discrete states. Assuming we can learn such an encoding, we are interested in computing since it tells us the likelihood of belonging to each discrete state of , thereby performing soft categorization which may be useful for downstream tasks. In the proposition below, we show an objective for computing for a discrete latent representation that maximizes .

###### Proposition 1

(Neural Bayes-MIM-v1) Let be a non-parametric function for any given input with the property . Consider the following objective,

 L∗=argmaxLEx[K∑k=1Lk(x)logLk(x)Ex[Lk(x)]] (2)

Then , where .

The proof essentially involves expressing MI in terms of , and , which can be substituted using Neural Bayes parameterization. However, the objective proposed in the above theorem poses a challenge– the objective contains the term for which computing high fidelity gradient in a batch setting is problematic (see appendix A). However, we can overcome this problem for the MIM objective because it turns out that gradient through certain terms are 0 as shown by the following theorem.

###### Theorem 1

 J(θ)=−Ex[K∑k=1Lθk(x)logLθk(x)Ex[Lθk(x)]] (3)
 ^J(θ)=−Ex[K∑k=1Lθk(x)log⟨Lθk(x)Ex[Lθk(x)]⟩] (4)

where indicates that gradients are not computed through the argument. Then .

The above theorem implies that as long as we plugin a decent estimate of in the objective, unbiased gradients can be computed without the need to compute gradients using the entire dataset. Note that the objective can be re-written as,

 minθ −Ex[K∑k=1Lθk(x)log⟨Lθk(x)⟩] +K∑k=1Ex[Lk(x)]log⟨Ex[Lk(x)]⟩ (5)

The second term is the negative entropy of the discrete latent representation which acts as a uniform prior. In other words, this term encourages learning a latent code such that all states of activate uniformly over the marginal input distribution . This is an attribute of distributed representation which is a fundamental goal in deep learning. We can therefore further encourage this behavior by treating the coefficient of this term as a hyper-parameter. In our experiments we confirm both the distributed representation behavior of this term as well as the benefit of using a hyper-parameter as our coefficient.

### 3.2 Implementation Details

Alternative Formulation of Uniform Prior: In practice we found that an alternative formulation of the second term in Eq 3.1 results in better performance and more interpretable filters. Specifically, we replace it with the following cross-entropy formulation,

 Rp(θ) :=−K∑k=11Klog(Ex[Lk(x)]) +K−1Klog(1−Ex[Lk(x)]) (6)

While both, the second term in Eq 3.1 as well as are minimized when , the latter formulation provides much stronger gradients during optimization when approaches 1 (see appendix C.1 for details); is undesirable since it discourages distributed representation. Finally, unbiased gradients can be computed through Eq 3.2 as long as a good estimate of is plugged in. Also note that the condition in lemma 1 is met by the Neural Bayes-MIM objective implicitly during optimization as discussed in the above paragraph in regards to distributed representation.

Implementation: The final Neural Bayes-MIM-v2 objective is,

 minθ −Ex[K∑k=1Lθk(x)log⟨Lθk(x)+ϵ⟩] +(1+α)⋅Rp(θ)+β⋅Rc (7)

where and are hyper-parameters, is a smoothness regularization introduced in section 4.2, is a small scalar used to prevent numerical instability. Qualitatively, we find that the regularization prevents filters from memorizing the input samples. Finally, we apply the first two terms in Eq 3.2 to all hidden layers of a deep network at different scales (computed by spatially average pooling and applying Softmax). These two regularizations gave a significant performance boost. Thorough implementation details are provided in appendix B. For brevity, we refer to our final objective as Neural Bayes-MIM in the rest of the paper.

On the other hand, to compute a good estimate of gradients, we use the following trick. During optimization, we compute gradients using a sufficiently large mini-batch of size MBS (Eg. ) that fits in memory (so that the estimate of is reasonable), and accumulate these gradients until BS samples are seen (Eg. 2000), and averaged before updating the parameters to further reduce estimation error.

## 4 Disjoint Manifold Labeling (DML)

### 4.1 Theory

A distribution is defined over a support. In many cases, the support may be a set of disjoint manifolds. In this task, our goal is to label samples from each disjoint manifold with a distinct value. This formulation can be seen as a generalization of subspace clustering (Ma et al., 2008) where affine manifolds are considered. To make the problem concrete, we first formalize the definition of a disjoint manifold.

###### Definition 1

(Connected Set) We say that a set is a connected set (disjoint manifold) if for any , there exists a continuous path between and such that all the points on the path also belong to .

To identify such disjoint manifolds in a distribution, we exploit the observation that only partitions that separate one disjoint manifold from others have high divergence between the respective conditional distributions while partitions that cut through a disjoint manifold result in conditional distributions with low divergence between them. Therefore, the objective we propose for this task is to partition the unlabeled data distribution into conditional distributions ’s such that a divergence between them is maximized. By doing so we recover the conditional distributions defined over the disjoint manifolds (we prove its optimality in theorem 2). We begin with two disjoint manifolds and extend this idea to multiple disjoint manifolds in appendix G.

Let be a symmetric divergence (Eg. Jensen-Shannon divergence, Wasserstein divergence, etc), and and be the disjoint conditional distributions that we want to learn. Then the aforementioned objective can be written formally as follows:

 maxq0,q1π∈(0,1)J(q0(x)||q1(x)) (8) s.t.∫xq0(x)=1,∫xq1(x)=1 q1(x)⋅π+q0(x)⋅(1−π)=p(x).

Since our goal is to simply assign labels to data samples corresponding to which manifold they belong instead of learning conditional distributions as achieved by Eq. (8), we would like to learn a function which maps samples from disjoint manifolds to distinct labels. To do so, below we derive an objective equivalent to Eq. (8) that learns such a function .

###### Proposition 2

(Neural Bayes-DML) Let be a non-parametric function for any given input , and let be the Jensen-Shannon divergence. Define scalars and . Then the objective in Eq. (8) is equivalent to,

 maxL 12⋅Ex[f1(x)⋅log(f1(x)f1(x)+f0(x))]+ (9) 12⋅Ex[f0(x)⋅log(f0(x)f1(x)+f0(x))]+log2 s.t.Ex[L(x)]∉{0,1}. (10)

Optimality: We now prove the optimality of the proposed objective towards discovering disjoint manifolds present in the support of a probability density function .

###### Theorem 2

(optimality) Let be a probability density function over whose support is the union of two non-empty connected sets (definition 1) and that are disjoint, i.e. . Let belong to the class of continuous functions which is learned by solving the objective in Eq. (9). Then the objective in Eq. (9) is maximized if and only if one of the following is true:

The above theorem proves that optimizing the derived objective over the space of functions implicitly partitions the data distribution into maximally separated conditionals by assigning a distinct label to points in each manifold. Most importantly, the theorem shows that the continuity condition on the function plays an important role. Without this condition, the network cannot identify disjoint manifolds.

### 4.2 Implementation Details

Prior Collapse: The constraint in proposition 2 is a boundary condition required for technical reasons in lemma 1. In practice we do not worry about them because optimization itself avoids situations where . To see the reason behind this, note that except when initialized in a way such that , the log terms are negative by definition. Since the denominators of and are and respectively, the objective is maximized when moves away from 0 and 1. Thus, for any reasonable initialization, optimization itself pushes away from 0 and 1.

Smoothness of : As shown in theorem 2, the proposed objectives can optimally recover disjoint manifolds only when the function is continuous. In practice we found enforcing the function to be smooth (thus also continuous) helps significantly. Therefore, after experimenting with a handful of heuristics for regularizing , we found the following finite difference Jacobian regularization to be effective ( can be scalar or vector),

 Rc=1BB∑i=1∥Lθ(xi)−Lθ(xi+ζ⋅^δi)∥2ζ2 (11)

where is a normalized noise vector computed independently for each sample in a batch of size as,

 δi:=Xvi. (12)

Here is the matrix containing the batch of samples, and each dimension of is sampled i.i.d. from a standard Gaussian. This computation ensures that the perturbation lies in the span of data, which we found to be important. Finally is the scale of normalized noise added to all samples in a batch. In our experiments, since we always normalize the datasets to have zero mean and unit variance across all dimensions, we sample .

Implementation: We implement the binary-partition Neural Bayes-DML using the Monte-Carlo sampling approximation of the following objective,

 minθ 12⋅Ex[f1(x)⋅log(1+f0(x)f1(x))]+ 12⋅Ex[f0(x)⋅log(1+f1(x)f0(x))]+β⋅Rc (13)

where and . Here is a small scalar used to prevent numerical instability, and is a hyper-parameter to control the continuity of . The multi-partition case can be implemented in a similar way. Due to the need for computing in the objective, optimizing it using gradient descent methods with small batch-sizes is not possible. Therefore we experiment with this method on datasets where gradients can be computed for a very large batch-size needed to approximate the gradient through sufficiently well.

## 5 Experiments

### 5.1 Mutual Information Maximization

Instead of aiming for state-of-the-art results, our goal in this section is to conduct a preliminary (but thorough) set of experiments using Neural Bayes-MIM to understand the behavior of the algorithm, the hyper-parameters involved and do a fair comparison with popular existing methods for self-supervised learning. Therefore, we use the following simple CNN encoder architecture1 in our experiments: . For an input image of size , the output of this encoder has size . The encoder is initialized using orthogonal initialization (Saxe et al., 2013), batch normalization (Ioffe and Szegedy, 2015) is used after each convolution layer and ReLU non-linearities are used. All datasets are normalized to have dimension-wise 0 mean and unit variance. Early stopping in all experiments is done using the test set (following previous work). We broadly follow the experimental setup of Hjelm et al. (2019). We do not use any data augmentation in our experiments. After training the encoder, we freeze its features and train a 1 hidden layer (200 units) classifier to get the final test accuracy. Extending the algorithm to more complex architectures (Eg. ResNets He et al. (2016)), use of multiple data augmentation techniques and other advanced regularizations (Eg. see Bachman et al. (2019)) is left as future work.

#### Ablation Studies

Behavior of Neural Bayes-MIM-v1 (Eq 3.1) vs Neural Bayes-MIM (v2, Eq 3.2): The experiments and details are discussed in appendix C.2. The main differences are: 1. majority of the filters learned by the v1 objective are dead, as opposed to the v2 objective which encourages distributed representation; 2. the performance of v2 is better than that of the v1 objective.

Visualization of Filters: We visualize the filters learned by the Neural Bayes-MIM objective on MNIST digits and qualitatively study the effects of the regularizations used. For this we train a deep fully connected network with 3 hidden layers each of width 500 using Adam with learning rate 0.001, batch size 500, 0 weight decay for 50 epochs (other Adam hyper-parameters are kept standard). We train three configurations: 1. , ; 2. , ; 3. , . The learned filters are shown in figure 4. We find that the uniform prior regularization () prevents dead filters while the smoothness regularization () prevents input memorization.

Performance due to Regularizations and State Scaling: We now evaluate the effects of the various components involved in the Neural Bayes-MIM objective– coefficients and , and applying the objective at different scales of hidden states. We use the CIFAR-10 dataset for these experiments.

In the first experiment, for each value of the number of different scales considered, we vary , and record the final performance, thus capturing the variation in performance due to all these three components. We consider two scaling configurations: 1. no pooling is applied to the hidden layers; 2. for each hidden layer, we spatially average pool the state using a pooing filter with a stride of 2. For the encoder used in our experiments (which has 4 internal hidden layers post ReLU), this gives us 4 and 8 states respectively (including the original un-scaled hidden layers) to apply the Neural Bayes-MIM objective. After getting all the states, we apply the Softmax activation to each state along the channel dimension so that the Neural Bayes parameterization holds. Thus for states with height and width, the objective is applied to each spatial location separately and averaged. Also, for states with height (or width) less than the pooling size, we use the height (or width) as pooling size.

We train Neural Bayes-MIM on the full training set for 100 epochs using Adam with learning rate 0.001 (other Adam hyper-parameters are standard), mini-batch size 500 and batch size 2000, 0 weight decay. In the first 32 experiments, and are sampled uniformly from and respectively. In the next 5 experiments, is set to be 0 while is sampled uniformly. In the next 5 experiments, is set to be 0 while is sampled uniformly. Thus in total we run 42 experiments for each number of scaling considered.

Once we get a trained , we train a 1 hidden layer (with 200 units) MLP classifier on the frozen features from using the labels in the training set. This training is done for 100 epochs using Adam with learning rate 0.001 (other Adam hyper-parameters are standard), batch size 128 and weight decay 0.

As a baseline for these experiments, we use a randomly initialed encoder . Since there are no tunable hyper-parameters in this case, we perform a grid search on the classifier hyper-parameters. Specifically, we choose weight decay from , batch size from , and learning rate from . This yields a total of 16 configurations. The test accuracy from these runs varied between and . We consider as our baseline.

The performance of encoders under the aforementioned configurations is shown in figure 5. It is clear that both the hyper-parameters and especially play an important role in the quality of representations learned. Also, applying Neural Bayes-MIM at different scales of the network states significantly improves the average and best performance.

Effect of Mini-batch size (MBS) and Batch size (BS): During implementation, we proposed to compute gradients using a reasonably large mini-batch of size MBS and accumulate gradients until BS samples are seen. This is done to overcome the gradient estimation problem due to the term in Neural Bayes-MIM. Here we evaluate the effect of these two hyper-parameters on the final test performance. We choose MBS from and BS from . For each combination of MBS and BS, we train the CNN encoder using Neural Bayes-MIM with and (chosen by examining figure 5); the rest of the training settings are kept identical to those used for figure 5 experiment. Table 1 shows the final test accuracy on CIFAR-10 for each combination of hyper-parameters MBS and BS. We make two observations: 1. using very small MBS (Eg. 50 and 100) typically results in poor (even worse than that of a random encoder ()), while larger MBS significantly improves performance; 2. using a larger BS further improves performance in most cases (even when MBS is small).

Accuracy vs Epochs: Finally, we plot the evolution of accuracy over epochs for all the models learned in the experiments of figure 5. For Neural Bayes-MIM we use the models with scaling (42 in total), and all 16 models for the random encoder. The convergence plot is shown in figure 6.

#### Final Classification Performance

We compare the final test accuracy of Neural Bayes-MIM with 3 baselines– a random encoder (described in ablation studies), Deep Infomax (Hjelm et al., 2019), and Rotation Prediction based representation learning (Gidaris et al., 2018) on benchmark image datasets– CIFAR10 and CIFAR-100 (Krizhevsky, 2009) and STL-10 (Coates et al., 2011). Random Network refers to the use of a randomly initialized network. The experimental details for them are identical to those in our ablation involving a hyper-parameter search over 16 configurations done for each dataset separately.

DIM results are reported from Hjelm et al. (2019). We omit STL-10 number for DIM because we resize images to a much smaller size of in our runs instead of as used in DIM.

Rotation prediction refers to the algorithm in Gidaris et al. (2018) where the encoder is learned by training it to predict the rotation of unlabeled images. We use the same CNN architecture used in previous experiments, with a linear classifier added on top, and train it to predict 4 rotations angles– 0, 90, 180, 270. We run this pre-training with 8 configurations of hyper-parameters– batch-size (each batch further includes rotated copies of each sample making the total batch-size 100, 200), weight decay and learning rate . For each run, we then train a 1 hidden layer (200 units) classifier on top of the frozen features with learning rate . We report the best performance of all runs. Since Kolesnikov et al. (2019) report that lower layers in CNN architectures trained with rotation prediction have better performance on downstream tasks, we also train classifiers on the layer and report their performance which is significantly better.

The following describes the experiment details for Neural Bayes-MIM. We use and (chosen roughly by examining figure 5), and MBS=500, BS=4000 in all the experiments. Note these values are not tuned for STL-10 and CIFAR-100. For CIFAR-10 and STL-10 each, we run 4 configurations of Neural Bayes-MIM over hyper-parameters learning rate and weight decay . For each run, we then train a 1 hidden layer (200 units) classifier on top of the frozen features with learning rate . We report the best performance of all runs. For CIFAR-100, we take the encoder that produces the best performance on CIFAR-10, and train a classifier with the 2 learning rates and report the best of the 2 runs. Similar to rotation prediction, we also train classifiers on the layer and report their performance.

Table 2 reports the classification performance of all the methods. We note that all experiments were done with CNN architecture without any data augmentation. Neural Bayes-MIM outperforms baseline methods in general. However, when using layer features, rotation prediction (RP) performs better. We hope to further improve the performance of Neural Bayes-MIM with additional regularizations similar to Bachman et al. (2019).

### 5.2 Disjoint Manifold Labeling

Clustering in general is an ill posed problem. However, in our problem setup, the definition is precise, i.e., our goal is to optimally label all the disjoint manifolds present in the support of a distribution. Since this is a unique goal that is not generally considered in literature, as empirical verification, we show qualitative results on 2D synthetic datasets in figure 7. Top 2 sub-figures have 2 clusters and the bottom 2 have 3 clusters. For all experiments we use a 4 layer MLP with 400 hidden units each, batchnorm, ReLU activation, and last layer Softmax activation. In all cases we train using Adam optimizer with a learning rate of 0.001, batch size of 400 and no weight decay, and trained until convergence. Regularization coefficient was chosen from that resulted in optimal clustering. For generality in these experiments, these 2D datasets were projected to high dimensions (512) by appending 510 dimensions of 0 entries to each sample and then randomly rotated before performing clustering. The datasets were then projected back to the original 2D space for visualizing predictions. Additional experiments can be found in appendix H.

## 6 Related Work

Neural Bayes-MIM maximizes mutual information for learning useful representations in a self-supervised way. Introduced in Linsker (1988) and Bell and Sejnowski (1995), there are a myriad of self-supervised methods that involve MIM. As discussed in Vincent et al. (2010), auto-encoder based methods achieve this goal implicitly by minimizing the reconstruction error of the input samples under isotropic Gaussian assumption. Deep infomax (DIM, Hjelm et al. (2019)) instead uses MINE Belghazi et al. (2018) to estimate MI and maximize it while applying it to both local and global features and imposing priors on the learned representation. Hjelm et al. (2019) have also shown that DIM performs better than representations learned by auto-encoder based methods such as VAE (Kingma and Welling, 2013), -VAE (Higgins et al., 2017) and adversarial auto-encoder (Makhzani et al., 2015), among others such as noise as targets (Bojanowski and Joulin, 2017) and BiGAN (Donahue et al., 2016). Contrastive Predictive Coding (Oord et al., 2018) also maximizes MI by predicting lower layer representations from higher layers using a contrastive loss instead of reconstruction loss.

Unlike the aforementioned methods that learn continuous latent representation, Neural Bayes-MIM implicitly learns discrete latent representations. We note that the estimation of mutual information due to Neural Bayes parameterization in the Neural Bayes-MIM-v1 objective (Eq 3.1) turns out to be identical to the one proposed in IMSAT (Hu et al., 2017). However, there are important differences: 1. we provide theoretical justifications for the parameterization used (lemma 1) and show in theorem 1 why it is feasible to compute high fidelity gradients using this objective in the mini-batch setting even though it contains the term . On the other hand, the justification used in IMSAT is that optimizing using mini-batches is equivalent to optimizing an upper bound of the original objective; 2. while the MI part of IMSAT was introduced in the context of clustering, we improve the MI formulation (Eq 3.2), and introduce regularization terms and state scaling which are important for learning useful representations using the Neural Bayes-MIM objective that perform well on downstream classification tasks; 3. we perform extensive ablation studies exposing the role of the introduced regularizations; 4. the goal of our paper is broader, i.e., to introduce the Neural Bayes parameterization that can be used for formulating new objectives. From the aspect of learning discrete latent representation, Neural Bayes-MIM has similarities with VQ-VAE (Oord et al., 2017). However, similar to other auto-encoder based methods, VQ-VAE imposes the isotropy assumption in the reconstruction loss.

In many self-supervised methods, the idea is to learn useful representations by predicting non-trivial information about the input. Examples of such methods are Rotation Prediction (Gidaris et al., 2018), Exemplar (Dosovitskiy et al., 2014), Jigsaw (Noroozi and Favaro, 2016) and Relative Patch Location (Doersch et al., 2015). Kolesnikov et al. (2019) have extensively compared these methods and found that Rotation Prediction (RP) in general outperforms or performs at par with the latter methods. For the aforementioned reasons, we compared Neural Bayes-MIM with RP and DIM.

Numerous recent papers have proposed clustering algorithm for unsupervised representation learning such as Deep Clustering (Caron et al., 2018), information based clustering (Ji et al., 2019), Spectral Clustering (Shaham et al., 2018), Assosiative Deep Clustering (Haeusser et al., 2018) etc. Our goal in regards to clustering in Neural Bayes-DML is in general different from such methods. Our objective is aimed at labeling disjoint manifolds in a distribution. Thus it can be seen as a generalization of the traditional subspace clustering methods (Ma et al., 2008; Liu et al., 2010) from affine subspaces to arbitrary manifolds.

## 7 Conclusion

We proposed a parameterization method that can be used to express an arbitrary set of distributions , and in closed form using a neural network with sufficient capacity, which can in turn be used to formulate new objective functions. We formulated two different objectives that use this parameterization which were aimed towards different goals of self-supervised learning– learning deep network features using the infomax principle, and identification of disjoint manifolds in the support of continuous distributions. We presented theoretical and empirical analysis of both the objectives while especially focusing on the former since it has broader applications.

## Acknowledgments

I (DA) was supported by IVADO during my time at MILA and currently supported by Salesforce. There are many people who have directly or indirectly contributed to this work and we would like to thank them. During the early phase of research on Neural Bayes-DML (in the context of which the Neural Bayes parameterization was developed), Chen Xing pointed out an intuition which led me to simplify its optimization procedure. We thank Ali Madani and Ehsan Hosseini-Asl for exploring Neural Bayes-DML for unsupervised representation learning for images. We thank Min Lin for taking interest in the connection between Neural Bayes-DML and mutual information, which led me to the idea that mutual information can be computed using the parameterization. We thank Aadyot Bhatnagar and Weiran Wang for proof-checking the paper and providing helpful feedback. We thank Devon Hjelm and Alex Fedorov for discussing their algorithm Deep Infomax in great detail. Finally, we thank Aaron Courville, Sharan Vaswani, Nikhil Naik, Isabela Albuquerque, Lav Varshney, Yu Bai, Jonathan Binas, David Krueger, Tegan Maharaj and Govardana Sachithanandam Ramachandran for helpful discussions.

## Appendix A Gradient Computation Problem for the Ex[Lθ(x)] term

The Neural Bayes parameterization contains the term . Computing unbiased gradient through this term is in general difficult without the use of very large batch-sizes even though the quantity itself may have a good estimate using very few samples. For instance, consider the scalar function . Consider the scenario when . The quantity can be estimated very accurately using even one example. Further, , hence . However, when using a finite number of samples, the approximation of can have a very high variance estimate due to improper cancelling of gradient terms from individual samples.

In the case of Neural Bayes-MIM we found that gradients through terms involving were 0. This allows us to estimate gradients for this objective reliably in the mini-batch setting. But in general it may be challenging to do so and solving objectives using Neural Bayes parameterization may require a customized work-around for each objective.

## Appendix B Implementation Details of the Neural Bayes-MIM Objective

We apply the Neural Bayes-MIM objective (Eq 3.2) to all the hidden layers at different scales (using average pooling). We now discuss its implementation details. Consider the CNN architecture used in our experiments– . Denote () be the 4 hidden layer ReLU outputs after the 4 convolution layers. For input of size , all these hidden states have height and width dimension in addition to channel dimension. For a mini-batch , these hidden states are therefore 4 dimensional tensors. Let these 4 dimensions for the state be denoted by , where the dimensions denote batch-size, number of channels, height and width. Denote to be the Softmax function applied along the channel dimension, and to be . Further, denote () as the scaled version of the original states computed by average pooling, and define numbers accordingly. Then the total Neural Bayes-MIM objective for this architecture is given by,

 minθ −1|B|∑x∈B[187∑i=01HiWiHi,Wi∑h,w=1Ci∑k=1S(hik,h,w(x))log⟨S(hik,h,w(x))+ϵ⟩] +(1+α)⋅Rp(θ)+β⋅Rc (14)

where,

 Rp(θ) :=−187∑i=01HiWiHi,Wi∑h,w=1[Ci∑k=11Cilog[1|B|∑x∈BS(hik,h,w(x))]+Ci−1Cilog[1−1|B|∑x∈BS(hik,h,w(x))]] (15)

and,

 Rc=1|B|∑x∈B∥P(h3k(x))−P(h3k(x+ζ⋅^δ)∥2ζ2 (16)

where is a normalized noise vector computed independently for each sample in the batch as,

 δ:=Xv. (17)

Here is the matrix containing the batch of samples, and each dimension of is sampled i.i.d. from a standard Gaussian. This computation ensures that the perturbation lies in the span of data. Finally is the scale of normalized noise added to all samples in a batch. In our experiments, since we always normalize the datasets to have zero mean and unit variance across all dimensions, we sample . Note that for the architecture used, results in an output with height and width equal to 1, hence the output is effectively a 2D matrix of size . Finally, the gradient form this mini-batch is accumulated and averaged over multiple batches before updating the parameters for a more accurate estimate of gradients.

## Appendix C Additional Analysis of Neural Bayes-MIM

### c.1 Gradient Strength of Uniform Prior in Neural Bayes-MIM-v1 (Eq 3.1) vs Neural Bayes-MIM-v2 (3.2)

As discussed in the main text, the term,

 Rv1p(θ):=K∑k=1Ex[Lk(x)]log⟨Ex[Lk(x)]⟩ (18)

acts as a uniform prior encouraging the representations to be distributed. However, gradients are much stronger when approaches 1 for the alternative cross-entropy formulation,

 Rv2p(θ):=−K∑k=11KlogEx[Lk(x)]+K−1Klog(1−Ex[Lk(x)]) (19)

To see this, note that gradient for is given by,

 ∂Rv1p(θ)∂θ =K∑k=1∂Ex[Lθk(x)]∂θlogEx[Lθk(x)]−K∑k=1∂Ex[Lθk(x)]∂θ (20) =K∑k=1∂Ex[Lθk(x)]∂θlogEx[Lθk(x)]−∂Ex[∑Kk=1Lθk(x)]∂θ (21) =K∑k=1∂Ex[Lθk(x)]∂θlogEx[Lθk(x)] (22)

where the last equality holds due to the linearity of expectation and because by design. On the other hand, gradients for is given by,

 ∂Rv2p(θ)∂θ =−K∑k=11K(1Ex[Lk(x)]−K−11−Ex[Lk(x)])∂Ex[Lθk(x)]∂θ (23)

When the representation being learned is such that the marginal peaks along a single state , i.e., (making the representation degenerate), the gradient for the term for v1 is given by,

 ∂Ex[Lθk(x)]∂θlogEx[Lθk(x)]≈0 (24)

while that for v2 is given by,

 −1K(1Ex[Lk(x)]−K−11−Ex[Lk(x)])∂Ex[Lθk(x)]∂θ≈limc→01c⋅∂Ex[Lθk(x)]∂θ (25)

whose magnitude approaches infinity as . Thus is beneficial in terms of gradient strength.

### c.2 Empirical Comparison between Neural Bayes-MIM-v1 (Eq 3.1) and Neural Bayes-MIM-v2 (3.2)

To empirically understand the difference in behavior of Neural Bayes-MIM objective v1 vs v2, we first plot the filters learned by the v1 objective and compare it with those learned by the v2 objective. The filters learned by the v1 objective are shown in figure 8 using the configuration , . It can be seen that most filters are dead. We tried other configurations as well without any change in the outcome. Since the v1 and v2 objective differ only in the formulation of the uniform prior regularization, as explained in the previous section, we believe that v1 leads to dead filters because of weak gradients from its regularization term.

In the second set of experiments, we train many models using Neural Bayes-MIM-v1 and Neural Bayes-MIM-v2 objectives separately with different hyper-parameter configurations similar to the setting of figure 5. The performance scatter plot is shown in figure 13. We find that Neural Bayes-MIM-v2 has better average and best performance compared with Neural Bayes-MIM-v1.

## Appendix D Proof of Lemma 1

###### Lemma 1

Let and be any conditional and marginal distribution defined for continuous random variable and discrete random variable . If , then there exists a non-parametric function for any given input with the property such that,

 p(x|z=k)=Lk(x)⋅p(x)Ex∼p(x)[Lk(x)],p(z=k)=Ex[Lk(x)],p(z=k|x)=Lk(x) (26)

and this parameterization is consistent.

Proof: First we show the existence proof. Notice that there exists a non-parametric function . Denote . Then,

 Ex[Gk(x)]=Ex[p(z=k)gk(x)]=p(z=k) (27)

and,

 Gk(x)Ex[Gk(x)] =p(z=k)gk(x)p(z=k)=p(x|z=k)p(x) (28)

Thus works. To verify that this parameterization is consistent, note that for any ,

 ∫xp(x|z=k) =∫xLk(x)⋅p(x)Ex∼p(x)[Lk(x)]=1 (29)

where we use the condition . Secondly, we note that,

 K∑k=1p(x|z=k)⋅p(z=k) =K∑k=1Lk(x)⋅p(x)Ex[Lk(x)]⋅Ex[Lk(x)] (30) =K∑k=1Lk(x)⋅p(x) (31) =p(x) (32)

where the last equality is due to the conditions . Thirdly,

 K∑k=1p(z=k) =K∑k=1Ex[Lk(x)] (33) =Ex[K∑k=1Lk(x)] (34) =1

Finally, we have from Bayes’ rule:

 p(z=k|x) =p(x|z=k)⋅p(z=k)p(x) (35) =Lk(x)⋅p(x)Ex∼p(x)[Lk(x)]⋅Ex∼p(x)[Lk(x)]p(x) (36) =Lk(x) (37)

where the second equality holds because of the existence and consistency proofs of and shown above.

## Appendix E Proofs for Neural Bayes-MIM

###### Proposition 1

(Neural Bayes-MIM-v1) (proposition 1 in main text) Let be a non-parametric function for any given input with the property . Consider the following objective,

 L∗=argmaxLEx[K∑k=1Lk(x)logLk(x)Ex[Lk(x)]] (38)

Then , where .

Proof: Using the Neural Bayes parameterization in lemma 1, we have,

 MI(x,z)=∫xK∑k=1p(x,z=k)logp(x,z=k)p(x)p(z=k) (39) =∫xK∑k=1p(z=k|x)p(x)logp(z=k|x)p(z=k) (40) (41) =Ex∼p(x)[K∑k=1Lk(x)logLk(x)Ex[Lk(x)]] (42)

Therefore the two objectives are equivalent and we have a closed form estimate of mutual information. Given is a maximizer of , since is a non-parametric function, there exists such that due to lemma 1.

###### Theorem 1

(Theorem 1 in main text) Denote,

 J(θ)=−Ex[K∑k=1Lθk(x)logLθk(x)Ex[Lθk(x)]] (43)
 ^J(θ)=−Ex[K∑k=1Lθk(x)log⟨Lθk(x)Ex[Lθk(x)]⟩] (44)

where denotes gradients are not computed through the argument. Then .

Proof: We note that,

 J(θ) =−Ex[K∑k=1Lθk(x)logLθk(x)Ex[Lθk(x)]] (45) (46)

Denote the first term by . Then due to chain rule,

 −∂T1∂θ =Ex[K∑k=1∂Lθk(x)∂θlogLθk(x)]−Ex[K∑k=1Lθk(x)Lθk(x)⋅∂Lθk(x)∂θ] (47) =Ex[K∑k=1∂Lθk(x)∂θlogLθk(x)]−Ex[K∑k=1∂Lθk(x)∂θ] (48) =Ex[K∑k=1∂Lθk(x)∂θlogLθk(x)]−Ex[∂∑Kk=1Lθk(x)∂θ] (49) =Ex[K∑k=1