1 Introduction
Abstract

Unsupervised and semi-supervised learning are important problems that are especially challenging with complex data like natural images. Progress on these problems would accelerate if we had access to appropriate generative models under which to pose the associated inference tasks. Inspired by the success of Convolutional Neural Networks (CNNs) for supervised prediction in images, we design the Neural Rendering Model (NRM), a new probabilistic generative model whose inference calculations correspond to those in a given CNN architecture. The NRM uses the given CNN to design the prior distribution in the probabilistic model. Furthermore, the NRM generates images from coarse to finer scales. It introduces a small set of latent variables at each level, and enforces dependencies among all the latent variables via a conjugate prior distribution. This conjugate prior yields a new regularizer based on paths rendered in the generative model for training CNNs–the Rendering Path Normalization (RPN). We demonstrate that this regularizer improves generalization, both in theory and in practice. In addition, likelihood estimation in the NRM yields training losses for CNNs, and inspired by this, we design a new loss termed as the Max-Min cross entropy which outperforms the traditional cross-entropy loss for object classification. The Max-Min cross entropy suggests a new deep network architecture, namely the Max-Min network, which can learn from less labeled data while maintaining good prediction performance. Our experiments demonstrate that the NRM with the RPN and the Max-Min architecture exceeds or matches the-state-of-art on benchmarks including SVHN, CIFAR10, and CIFAR100 for semi-supervised and supervised learning tasks.

Keywords: neural nets, generative models, semi-supervised learning, cross-entropy, statistical guarantee




Neural Rendering Model: Joint Generation

and Prediction for Semi-Supervised Learning



Nhat Ho, Tan Nguyen, Ankit Patel,
Anima Anandkumar, Michael I. Jordan, Richard G. Baraniuk


University of California at Berkeley, Berkeley, USA
Rice University, Houston, USA
Baylor College of Medicine, Houston, USA
California Institute of Technology, Pasadena, USA
NVIDIA, Santa Clara, USA


July 19, 2019

footnotetext: Nhat Ho and Tan Nguyen contributed equally to this work.    Anima Anandkumar, Michael I. Jordan, and Richard G. Baraniuk contributed equally to this work.

1 Introduction

Unsupervised and semi-supervised learning have still lagged behind compared to performance leaps we have seen in supervised learning over the last five years. This is partly due to a lack of good generative models that can capture all latent variations in complex domains such as natural images and provide useful structures that help learning. When it comes to probabilistic generative models, it is hard to design good priors for the latent variables that drive the generation.

Instead, recent approaches avoid the explicit design of image priors. For instance, Generative Adversarial Networks (GANs) use implicit feedback from an additional discriminator that distinguishes real from fake images [1]. Using such feedback helps GANs to generate visually realistic images, but it is not clear if this is the most effective form of feedback for predictive tasks. Moreover, due to separation of generation and discrimination in GANs, there are typically more parameters to train, and this might make it harder to obtain gains for semi-supervised learning in the low (labeled) sample setting.

We propose an alternative approach to GANs by designing a class of probabilistic generative models, such that inference in those models also has good performance on predictive tasks. This approach is well-suited for semi-supervised learning since it eliminates the need for a separate prediction network. Specifically, we answer the following question: what generative processes output Convolutional Neural Networks (CNNs) when inference is carried out? This is natural to ask since CNNs are state-of-the-art (SOTA) predictive models for images, and intuitively, such powerful predictive models should capture some essence of image generation. However, standard CNNs are not directly reversible and likely do not have all the information for generation since they are trained for predictive tasks such as image classification. We can instead invert the irreversible operations in CNNs, e.g., the rectified linear units (ReLUs) and spatial pooling, by assigning auxiliary latent variables to account for uncertainty in the CNN’s inversion process due to the information loss.

(a)
(b)
Figure 1: (a) Neural Rendering Model (NRM) is a probabilistic generative model that captures latent variations in the data and reduces CNNs as its inference algorithm for supervised learning. In particular, CNN architectures and training losses can be derived from a joint maximum a posteriori (JMAP) inference and likelihood estimation in NRM, respectively. NRM can learn from unlabeled data, overcoming the weakness of CNNs, and achieve good performance on various semi-supervised learning tasks. (b) Graphical model depiction of NRM. Latent variables in NRM depend on each other. Starting from the top of the model with the object category , new latent variables are incorporated into the model at each layer, and intermediate images are rendered with finer details. At the bottom of NRM, pixel noise is added to render the final image .

Contribution 1 – Neural Rendering Model: We develop the Neural Rendering Model (NRM) whose bottom-up inference corresponds to a CNN architecture of choice (see Figure 1a). The “reverse” top-down process of image generation is through coarse-to-fine rendering, which progressively increases the resolution of the rendered image (see Figure 1b). This is intuitive since the reverse process of bottom-up inference reduces the resolution (and dimension) through operations such as spatial pooling. We also introduce structured stochasticity in the rendering process through a small set of discrete latent variables, which capture the uncertainty in reversing the CNN feed-forward process. The rendering in NRM follows a product of linear transformations, which can be considered as the transpose of the inference process in CNNs. In particular, the rendering weights in NRM are proportional to the transpose of the filters in CNNs. Furthermore, the bias terms in the ReLU units at each layer (after the convolution operator) make the latent variables in different network layers dependent (when the bias terms are non-zero). This design of image prior has an interesting interpretation from a predictive-coding perspective in neuroscience: the dependency between latent variables can be considered as a form of backward connections that captures prior knowledge from coarser levels in NRM and helps adjust the estimation at the finer levels [2, 3]. The correspondence between NRM and CNN is given in Figure 2 and Table 1 below.

Figure 2: CNN is the E-step JMAP inference of the optimal latent variable in NRM. In particular, inferring the template selecting latent variables in NRM results in the ReLU non-linearities in CNN. Similarly, inferring the local translation latent variables and inverting the zero-padding in NRM yield the MaxPool operators in CNN. In addition, during inference, the rendering step using the templates in NRM becomes the convolutions with weights in the CNN.

NRM is a likelihood-based framework, where unsupervised learning can be derived by maximizing the expected complete-data log-likelihood of the model while supervised learning is done through optimizing the class-conditional log-likelihood. Semi-supervised learning unifies both log-likelihoods into an objective cost for learning from both labeled and unlabeled data. The NRM prior has the desirable property of being a conjugate prior, which makes learning in NRM computationally efficient.

Interestingly, we derive the popular cross-entropy loss used to train CNNs for supervised learning as an upper bound of the NRM’s negative class-conditional log-likelihood. A broadly-accepted interpretation of cross-entropy loss in training CNNs is from logistic regression perspective. Given features extracted from the data by the network, logistic regression is applied to classify those features into different classes, which yields the cross-entropy. In this interpretation, there is a gap between feature extraction and classification. On the contrary, our derivation ties feature extraction and learning for classification in CNN into an end-to-end optimization problem that estimates the conditional log-likelihood of NRM. This new interpretation of cross-entropy allow us to develop better losses for training CNNs. An example is the Max-Min cross-entropy discussed in Contribution 2 and Section 5.

NRM CNN
Rendering templates
Transpose of filter weights
Class templates
Softmax weights
Parameters
of the conjugate prior
Bias terms in the ReLUs
after each convolution
Max-marginalize over
ReLU
Max-marginalize over
MaxPool
Conditional log-likelihood
Cross-entropy loss
Expected complete-data log-likelihood
Reconstruction loss
Normalize the intermediate image
Batch Normalization
Table 1: Correspondence between NRM and CNN

Contribution 2 – New regularization, loss function, architecture and generalization bounds: The joint nature of generation, inference, and learning in NRM allows us to develop new training procedures for semi-supervised and supervised learning, as well as new theoretical (statistical) guarantees for learning. In particular, for training, we derive a new form of regularization termed as the Rendering Path Normalization (RPN) from the NRM’s conjugate prior. A rendering path is a set of latent variable values in NRM. Unlike the path-wise regularizer in  [4], RPN uses information from a generative model to penalizes the number of the possible rendering paths and, therefore, encourages the network to be compact in terms of representing the image. It also helps to enforce the dependency among different layers in NRM during training and improves classification performance.

We provide new theoretical bounds based on NRM. In particular, we prove that NRM is statistically consistent and derive a generalization bound of NRM for (semi-)supervised learning tasks. Our generalization bound is proportional to the number of active rendering paths that generate close-to-real images. This suggests that RPN regularization may help in generalization since RPN enforces the dependencies among latent variables in NRM and, therefore, reduces the number of active rendering paths. We observe that RPN helps improve generalization in our experiments.

Max-Min cross-entropy and network: We propose the new Max-Min cross-entropy loss function for learning, based on negative class-conditional log-likelihood in NRM. It combines the traditional cross-entropy with another loss, which we term as the Min cross-entropy. While the traditional (Max) cross-entropy maximizes the probability of the correct labels, the Min cross-entropy minimizes the probability of the incorrect labels. We show that the Max-Min cross-entropy is also an upper bound to the negative conditional log-likehood of NRM, just like the cross-entropy loss. The Max-Min cross-entropy is realized through a new CNN architecture, namely the Max-Min network, which is a CNN with an additional branch sharing weights with the original CNN but containing minimum pooling (MinPool) operator and negative rectified linear units (NReLUs), i.e., (see Figure 5). Although the Max-Min network is derived from NRM, it is a meta-architecture that can be applied independently on any CNN architecture. We show empirically that Max-Min networks and cross-entropy help improve the SOTA on object classification for supervised and semi-supervised learning.

Contribution 3 – State-of-the-art empirical results for semi-supervised and supervised learning: We show strong results for semi-supervised learning over CIFAR10, CIFAR100 and SVHN benchmarks in comparison with SOTA methods that use and do not use consistency regularization. Consistency regularization, such as those used in Temporal Ensembling  [5] and Mean Teacher  [6], enforces the networks to learn representation invariant to realistic perturbations of the data. NRM alone outperforms most SOTA methods which do not use consistency regularization [7, 8] in most settings. Max-Min cross-entropy then helps improves NRM’s semi-supervised learning results significantly. When combining NRM, Max-Min cross-entropy, and Mean Teacher, we achieve SOTA results or very close to those on CIFAR10, CIFAR100, and SVHN (see Table 3, 4, and 5). Interestingly, compared to the other competitors, our method is consistently good, achieving either best or second best results in all experiments. Furthermore, Max-Min cross-entropy also helps supervised learning. Using the Max-Min cross-entropy, we achieve SOTA result for supervised learning on CIFAR10 (2.30% test error). Similarly, Max-Min cross-entropy helps improve supervised training on ImageNet.

Despite good classification results, there is a caveat that NRM may not generate good looking images since that objective is not “baked” into its training. NRM is primarily aimed at improving semi-supervised and supervised learning through better regularization. Potentially, an adversarial loss can be added to NRM to improve visual characteristics of the image, but that is beyond the scope of this paper.

Notation: To facilitate the presentation, the NRM’s notations are explained in Table 2. Throughout this paper, we denote and the Euclidean norm and transpose of respectively for any . Additionally, for any matrix and of the same dimension, denotes the Hadamard product between and . For any vector and of the same dimension, denotes the dot product between and .

Variables
input image of size
object category
all latent variables of size in layer
switching latent variable at pixel location in layer
local translation latent variable at pixel location in layer
intermediate rendered image of size in layer
rendered image of size from NRM before adding noise
corresponding feature maps in layer in CNNs.
Parameters
the template of class , as well as the coarsest image of size determined by the category at the top of NRM before adding any fine detail. is learned from the data.
rendering matrix of size at layer .
dictionary of rendering template of size at layer . is learned from the data.
corresponding weight at the layer in CNNs
set of zero-padding matrices at layer
set of local translation matrices at layer . is chosen according to value of
parameter of the conjugate prior at layer . This term is of size and becomes the bias term after convolutions in CNNs. It can be made independent of , which is equivalent to using the same bias in each feature map in CNNs. Here, is learned from data.
probability of object category .
pixel noise variance
Other Notations
.
.
RPN .
(y,z(L), …, z(1)) rendering configuration.
Table 2: Table of notations for NRM

2 Related Work

Deep Generative Models:

In addition to GANs, other recently developed deep generative models include the Variational Autoencoders (VAE) [9] and the Deep Generative Networks [10]. Unlike these models, which replace complicated or intractable inference by CNNs, NRM derives CNNs as its inference. This advantage allows us to develop better learning algorithms for CNNs with statistical guarantees, as being discussed in Section 3.3 and 4. Recent works including the Bidirectional GANs [11] and the Adversarially Learned Inference model [8] try to make the discriminators and generators in GANs reversible of each other, thereby providing an alternative way to invert CNNs. These approaches, nevertheless, still employ a separate network to bypass the irreversible operators in CNNs. Furthermore, the flow-based generative models such as NICE [12], Real NVP [13], and Glow [14] are invertible. However, the inference algorithms of these models, although being exact, do not match the CNN architecture. NRM is also close in spirit to the Deep Rendering Model (DRM) [15] but markedly different. Compared to NRM, DRM has several limitations. In particular, all the latent variables in DRM are assumed to be independent, which is rather unrealistic. This lack of dependency causes the missing of the bias terms in the ReLUs of the CNN derived from DRM. Furthermore, the cross-entropy loss used in training CNNs for supervised learning tasks is not captured naturally by DRM. Due to these limitations, model consistency and generalization bounds are not derived for DRM.

Semi-Supervised Learning:

In addition to deep generative model approach, consistency regularization methods, such as Temporal Ensembling [5] and Mean Teacher [6], have been recently developed for semi-supervised learning and achieved state-of-the-art results. These methods enforce that the baseline network learns invariant representations of the data under different realistic perturbations. Consistency regularization approaches are complimentary to and can be applied on most deep generative models, including NRM, to further increase the baseline model’s performance on semi-supervised learning tasks. Experiments in Section 6 demonstrate that NRM achieves better test accuracy on CIFAR10 and CIFAR100 when combined with Mean Teacher.

Explaining Architecture of CNNs:

The architectures and training losses of CNNs have been studied from other perspectives. [16, 17] employ principles from information theory such as the Information Bottleneck Lagrangian introduced by [18] to show that stacking layers encourages CNNs to learn representations invariant to latent variations. They also study the cross-entropy loss to understand possible causes of over-fitting in CNNs and suggest a new regularization term for training that helps the trained CNNs generalize better. This regularization term relates to the amount of information about the labels memorized in the weights. Additionally, [19] suggests a connection between CNNs and the convolutional sparse coding (CSC) [20, 21, 22, 23]. They propose a multi-layer CSC (ML-CSC) model and prove that CNNs are the thresholding pursuit serving the ML-CSC model. This thresholding pursuit framework implies alternatives to CNNs, which is related to the deconvolutional and recurrent networks. The architecture of CNNs is also investigated using the wavelet scattering transform. In particular, scattering operators help elucidate different properties of CNNs, including how the image sparsity and geometry captured by the networks [24, 25]. In addition, CNNs are also studied from the optimization perspective [26, 27, 28, 29], statistical learning theory perspective [30, 31], approximation theory perspective [32], and other approaches [33, 34].

Like these works, NRM helps explain different components in CNNs. It employs tools and methods in probabilistic inference to interpret CNN from a probabilistic perspective. That says, NRM can potentially be combined with aforementioned approaches to gain a better understanding of CNNs.

3 The Neural Rendering Model

We first define the Neural Rendering Model (NRM). Then we discuss the inference in NRM. Finally, we derive different learning losses from NRM, including the cross-entropy loss, the reconstruction loss, and the RPN regularization, for supervised learning, unsupervised learning, and semi-supervised learning.

3.1 Generative Model

NRM attempts to invert CNNs as its inference so that the information in the posterior can be used to inform the generation process of the model. NRM realizes this inversion by employing the structure of its latent variables. Furthermore, the joint prior distribution of latent variables in the model is parametrized such that it is the conjugate prior to the likelihood of the model. This conjugate prior is the function of intermediate rendered images in NRM and implicitly captures the dependencies among latent variables in the model. More precisely, NRM can be defined as follows:

Definition 3.1 (Neural Rendering Model (NRM)).

NRM is a deep generative model in which the latent variables at different layers are dependent. Let be the input image and be the target variable, e.g. object category. Generation in NRM takes the form:

(1)
(2)

where:

(3)
(4)

The generation process in NRM can be summarized in the following steps:

  1. Given the class label , NRM first samples the latent variables from a categorical distribution whose prior is .

  2. Starting from the class label at the top layer of the model, NRM renders its coarsest image, , which is also the object template of class .

  3. At layer , a set of of latent variations is incorporated into via a linear transformation to render the finer image . The same process is repeated at each subsequent layer to finally render the finest image at the bottom of NRM.

  4. Gaussian pixel noise is added into to render the final image .

Figure 3: Rendering process from layer to layer in NRM. At each pixel in the intermediate image , NRM decides to render if the template selecting latent variable . Otherwise, it does not render. If rendering, the template is multiplied by the pixel value . Then the matrix zero-pads the result to the size of the intermediate image at layer , which is . After that, the translation matrix locally translates the rendered image to location specified by the latent variable . The same rendering process repeats at pixel , as well as at other pixels of . NRM adds all rendered images to achieve the intermediate image at layer .

In the generation process above, can be any linear transformation, and the latent variables , in their most generic form, can capture any latent variation in the data. While such generative model can represent any possible imagery data, it cannot be learned to capture the properties of natural images in reasonable time due to its huge number of degrees of freedom. Therefore, it is necessary to have more structures in NRM given our prior knowledge of natural images. One such prior knowledge is from classification models, . In particular, since classification models like CNNs have achieved excellent performance on a wide range of object classification tasks, we hypothesize that a generative model whose inference yields CNNs will also be a good model for natural images. As a result, we would like to introduce new structures into NRM so that CNNs can be derived as NRM’s inference. In other word, we use the posterior , i.e., CNNs, to inform the likelihood in designing NRM.

In our attempt to invert CNNs, we constrain the latent variables at layer in NRM to a set of template selecting latent variables and local translation latent variables . As been shown later in Section 3.2, during inference of NRM, the ReLU non-linearity at layer “inverts” to find if particular features are in the image or not. Similarly, the MaxPool operator “inverts” to locate where particular features, if exist, are in the image. Both and are vectors indexed by .

The rendering matrix is now a function of and , and the rendering process from layer to layer using is described in the following equation:

(5)

Even though the rendering equation above seems complicated at first, it is quite intuitive as illustrated in Figure 3. At each pixel in the intermediate image at layer , NRM decides to use that pixel to render or not according to the value of the template selecting latent variable at that pixel location. If , then NRM renders. Otherwise, it does not. If rendering, then the pixel value is used to scale the rendering template . This rendering template is local. It has the same number of feature maps as the next rendered image , but is of smaller size, e.g. or . As a result, the rendered image corresponds to a local patch in . Next, the padding matrix pads the resultant patch to the size of the image with zeros, and the translation matrix translates the result to a local location. NRM then keeps rendering at other pixel location of following the same process. All rendered images are added to form the final rendered image at layer .

Note that in NRM, there is one rendering template at each pixel location in the image . For example, if is of size , then the NRM uses templates to render at layer . This is too many rendering templates and would require a very large amount of data to learn, considering all layers in NRM. Therefore, we further constrain NRM by enforcing all pixels in the same feature maps of share the same rendering template. In other word, are the same if are in the same feature map. This constrain helps yield convolutions in CNNs during the inference of NRM, and the rendering templates in NRM now correspond to the convolution filters in CNNs.

While and can be let independent, we further constrain the model by enforcing the dependency among and at different layers in NRM. This constraint is motivated from realistic rendering of natural objects: different parts of a natural object are dependent on each other. For example, in an image of a person, the locations of the eyes in the image are restricted by the location of the head or if the face is not painted, then it is likely that we cannot find the eyes in the image either. Thus, NRM tries to capture such dependency in natural objects by imposing more structures into the joint prior of latent variables and at all layer , in the model. In particular, the joint prior is given by Eqn. 1. The form of the joint prior might look mysterious at first, but NRM parametrizes in this particular way so that is the conjugate prior of the model likelihood as proven in Appendix C.4. Specifically, in order to derive conjugate prior, we would like the log conditional distribution to have the linear piece-wise form as the CNNs which compute the posterior . This design criterion results in each term in the joint prior in Eqn. 1. The conjugate form of allows efficient inference in the NRM. Note that are the parameters of the conjugate prior . Due to the form of , during inference, will become the bias terms after convolutions in CNNs as will be shown in Theorem 3.2. Furthermore, when training in an unsupervised setup, the conjugate prior results in the RPN regularization as shown in Theorem 3.3(b). This RPN regularization helps enforce the dependencies among latent variables in the model and increases the likelihood of latent configuration presents in the data during training. For the sake of clarity, we summarize all the notations used in NRM in Table 2.

We summarize the NRM’s rendering process in Algorithm 1 below. Reconstructed images at each layer of a 5-layered NRM trained on MNIST are visualized in Figure 4. NRM reconstructs the images in two steps. First, the bottom-up E-step inference in NRM, which has a CNN form, keeps track of the optimal latent variables and from the input image. Second, in the top-down E-step reconstruction, NRM uses and to render the reconstructed image according to Eqn. 2 and 5. The network is trained using the semi-supervised learning framework discussed in Section 3.3. The reconstructed images show that NRM renders images from coarse to fine. Early layers in the model such as layer 4 and 3 capture coarse-scale features of the image while later layers such as layer 2 and 1 capture finer-scale features. Starting from layer 2, we begin to see the gist of the rendered digits which become clearer at layer 1. Note that images at layer 4 represent the class template in NRM, which is also the softmax weights in CNN.

Layer 4 Layer 3 Layer 2 Layer 1 Layer 0 Original
Figure 4: Reconstructed images at each layer in a 5-layer NRM trained on MNIST with 50K labeled data. Original images are in the rightmost column. Early layers in the rendering such as layer 4 and 3 capture coarse-scale features of the image while later layers such as layer 2 and 1 capture finer-scale features. From layer 2, we begin to see the gist of the rendered digits.

In this paper, in order to simplify the notations, we only model . More precisely, in the subsequent sections, is set to for unlabeled data. An extension to model can be achieved by adding the term inside the Softmax operator in Eqn. 1. All theorems and proofs can also be easily extended to the case when is modeled. Furthermore, to facilitate the discussion, we will utilize , , and to denote the set of all rendering templates , zero-padding matrices , and local translation matrices at layer , respectively. We will also use , , and interchangeably. Similarly, the notations and , as well as and are used interchangeably.

  Input: Object category .
  Output: Rendered image given the object category .
  Parameters: where is the class template, is the rendering template at layer , and is the parameters of the conjugate prior at layer , which turn out to be the bias terms in the ReLU after convolutions at each layer in CNNs.
  
  1. Use Markov chain Monte Carlo method to sample the latent variables in NRM, from , where .
  
  2. Render , using the recursion in Eqn. 5, in which and and are the local translation matrix and the zero-padding matrix at pixel location in layer as described above.
  
  3. Add Gaussian pixel noise into to achieve the final rendered image , where is the dimension, i.e. the number of pixels, of and .
Algorithm 1 Rendering Process in NRM

NRM with skip connections: In Section 3.2 below, we will show that CNNs can be derived as a joint maximum a posteriori (JMAP) inference in NRM. By introducing skip connections into the structure of the rendering matrices , we can also derive the convolutional neural networks with skip connections including ResNet and DenseNet. The detail derivation of ResNet and DenseNet can be found in Appendix B.7.

3.2 Inference

We would like to show that inference in NRM has the CNN form (see Figure 2) and, therefore, is still tractable and efficient. This correspondence between NRM and CNN helps us achieve two important goals when developing a new generative model. First, the desired generative model is complex enough with rich needed structures to capture the great diversity of forms and appearances of the objects surrounding us. Second, the inference in the model is fast and efficient so that the model can be learned in reasonable time. Such advantages justify our modeling choice of using classification models like CNNs to inform our design of generative models like NRM.

The following theorem establishes the aforementioned correspondence, showing that the JMAP inference of the optimal latent variables in NRM is indeed a CNN:

Theorem 3.2.

Denote the set of all parameters in NRM. The JMAP inference of latent variable z in NRM is the feedforward step in CNNs. Particularly, we have:

(6)

where is computed recursively. In particular, and:

(7)

The equality holds in Eqn. 6 when the parameters in NRM satisfy the non-negativity assumption that the intermediate rendered image .

There are four takeaways from the results in Theorem 3.2:

  1. ReLU non-linearities in CNNs find the optimal value for the template selecting latent variables at each layer in NRM, detecting if particular features exist in the image or not.

  2. MaxPool operators in CNNs find the optimal value for the local translation latent variables at each layer in NRM, locating where particular features are rendered in the image.

  3. Bias terms after each convolution in CNNs are from the prior distribution of latent variables in the model. Those bias terms update the posterior estimation of latent variables from data using the knowledge encoded in the prior distribution of those latent variables.

  4. Convolutions in CNNs result from reversing the local rendering operators, which use the rendering template , in NRM. Instead of rendering as in NRM, convolutions in CNNs perform template matching. Particularly, it can be shown that convolution weights in CNNs are the transposes of the rendering templates in NRM.

Table 1 summarizes the correspondences between NRM and CNNs. The proofs for these correspondences are postponed to Appendix C. The non-negativity assumption that the intermediate rendered image allows us to apply the max-product message passing and send the over latent variables operator through the product of rendering matrices . Thus, given this assumption, the equality holds in Eqn. 6. In Eqn. 7, we have removed the generative constraints inherited from NRM to derive the weights in CNNs, which are free parameters. As a result, when faced with training data that violates NRM’s underlying assumptions, CNNs will have more freedom to compensate. We refer to this process as a discriminative relaxation of a generative classifier [35, 36]. Finally, the dot product with the object template in Eqn. 6 corresponds to the fully connected layer before the Softmax non-linearity is applied in CNNs.

Given Theorem 3.2 and the four takeaways above, NRM has successfully reverse-engineered CNNs. However, the impact of Theorem 3.2 goes beyond a reverse-engineering effort. First, it provides probabilistic semantics for components in CNNs, justifying their usage, and providing an opportunity to employ probabilistic inference methods in the context of CNNs. In particular, convolution operators in CNNs can be seen as factor nodes in the factor graph associated with NRM. Similarly, activations from the convolutions in CNNs correspond to bottom-up messages in that factor graph. The bias terms added to the activations in CNNs, which are from the joint prior distribution of latent variables, are equivalent to the top-down messages from the top layers of NRM. These top-down messages have receptive fields of the whole image and are used to update the bottom-up messages, which are estimated from local information with smaller receptive fields. Finally, ReLU non-linearities and MaxPool operators in CNNs are max-marginalization operators over the template selecting and local translation latent variables and in NRM, respectively. These max-marginalization operators are from max-product message passing used to infer the latent variables in NRM.

Second, Theorem 3.2 provides a flexible framework to design CNNs. Instead of directly engineering CNNs for new tasks and datasets, we can modify NRM to incorporate our knowledge of the tasks and datasets into the model and then perform JMAP inference to achieve a new CNN architecture. For example, in Theorem 3.2, we show how ReLU can be derived from max-marginalization of . By changing the distribution of , we can derive Leaky ReLU. Furthermore, batch normalization in CNNs can be derived from NRM by normalizing intermediate rendered images at each layer in NRM. Also, as mentioned above, by introducing skip connections into the rendering matrices , we can derive ResNet and DenseNet. Details of those derivations can be found in Appendix B and C.

3.3 Learning

NRM learns from both labeled and unlabeled data. Learning in NRM can be posed as likelihood estimation problems which optimize the conditional log-likelihood and the expected complete-data log-likelihood for supervised and unsupervised learning respectively. Interestingly, the cross-entropy loss used in training CNNs with labeled data is the upper bound of the NRM’s negative conditional log-likelihood. NRM solves these likelihood optimization problems via the Expectation- Maximization (EM) approach. In the E-step, inference in NRM finds the optimal latent variables . This inference has the form of a CNN as shown in Theorem 3.2. In the M-step, given , NRM maximizes the corresponding likelihood objective functions or their lower bounds as in the case of cross-entropy loss. There is no closed-form M-step update for deep models like NRM, so NRM employs the generalized EM instead [37, 38]. In generalized EM, the M-step seeks to increase value of the likelihood objective function instead of maximizing it. In particular, in the M-step, NRM uses gradient-based methods such as Stochastic Gradient Descent (SGD) [39, 40] to update its parameters. The following theorem derives the learning objectives for NRM in both supervised and unsupervised settings.

Theorem 3.3.

Denote . For any , let be i.i.d. samples from NRM. Assume that the final rendered template is normalized such that its norm is constant. The following holds:
(a) Cross-entropy loss for supervised training CNNs with labeled data:

(8)

where is the posterior estimated by CNN, and is the cross-entropy between and the true posterior given by the ground truth.
(b) Reconstruction loss with RPN for unsupervised training of CNNs with labeled and unlabeled data:

(9)

where the latent variable is estimated by the CNN as described in Theorem 3.2, is the reconstructed image, and the RPN regularization is the negative log prior defined as follows:

(10)
Cross-Entropy Loss for Training Convolutional Neural Networks with Labeled Data:

Part (a) of Theorem 3.3 establishes the cross-entropy loss in the context of CNNs as an upper bound of the NRM’s negative conditional log-likelihood . Different from other derivations of cross-entropy loss via logistic regression, Theorem 3.3(a) derives the cross-entropy loss in conjunction with the architecture of CNNs since the estimation of the optimal latent variables is part of the optimization in Eqn. 8. In other word, Theorem 3.3(a) ties feature extraction and learning for classification in CNNs into an end-to-end conditional likelihood estimation problem in NRM. This new interpretation of the cross-entropy loss suggests an interesting direction in which better losses for training CNNs with labeled data for supervised classification tasks can be derived from tighter upper bounds for . The Max-Min cross-entropy in Section 5 is an example. Note that the assumption that the rendered image has constant norm is solely for the ease of presentation. Later, in Appendix B, we extend the result of Theorem 3.3(a) to the setting in which the norm of rendered image is bounded.

In order to estimate how tight the cross-entropy upper bound is, we prove the lower bound for . The gap between this lower bound and the cross-entropy upper bound suggests the quality of the estimation in Theorem 3.3(a). In particular, this gap is given by:

(11)

where for . More details can be found in Appendix B while its detail proof is deferred to Appendix C.

Reconstruction Loss with the Rendering Path Normalization (RPN) Regularization for Unsupervised Learning with Both Labeled and Unlabeled Data:

Part (b) of Theorem 3.3 suggests that NRM learns from both labeled and unlabeled data by maximizing its expected complete-data log-likelihood, , which is the sum of a reconstruction loss and the RPN regularization. Deriving the E-step and M-step of generalized EM when is rather complicated; therefore, for the simplicity of the paper, we only focus on the setting in which goes to 0. Under that setting, in the M-step, NRM minimizes the objective function with respect to the parameters of the model. The first term in this objective function is the reconstruction loss between the input image and the reconstructed template . The second term is the Rendering Path Normalization (RPN) defined in Eqn. 10. RPN encourages the inferred in the bottom-up E- step to have higher prior among all possible values of . Due to the parametric form of as in Eqn. 1, RPN also enforces the dependencies among latent variables at different layers in NRM. An approximation to this RPN regularization is discussed in Appendix B.2.

Semi-Supervised Learning with the Latent-Dependent Deep Rendering Model:

NRM learns from both labeled and unlabeled data simultaneously by maximizing a weighted combination of the cross-entropy loss for supervised learning and the reconstruction loss with RPN regularization for unsupervised learning as in Theorem 3.3. We now formulate the semi-supervised learning problem in NRM. In particular, let be i.i.d. samples from NRM and assume that the labels are unknown for some , NRM utilizes the following model to determine optimal parameters employed for the semi-supervised classification task:

(12)

where and are non-negative weights associated with the reconstruction loss/RPN regularization and the cross-entropy loss, respectively. Again, the optimal latent variables are inferred in the E-step as in Theorem 3.2. For unlabeled data, is set to .

In summary, combining Theorem 3.3(a) and 3.2, NRM allows us to derive CNNs with convolution layer, the ReLU non-linearity, and the MaxPool layer. These CNNs optimize the cross-entropy loss for supervised classification tasks with labeled data. Combining Theorem 3.3(b) and 3.2, NRM extends the traditional CNNs for unsupervised learning tasks in which the networks optimize the reconstruction loss with the RPN regularization. NRM does semi-supervised learning by optimizing the weighted combination of the losses in Theorem 3.3(a) and Theorem 3.3(b). Inference in the semi-supervised learning setup still follows Theorem 3.2. NRM can also be extended to explain other variants of CNNs, including ResNet and DenseNet, as well as other components in CNNs such as Leaky ReLU and batch normalization.

4 Statistical Guarantees for the Neural Rendering Model in the Supervised Setting

We provide statistical guarantees for NRM to establish that NRM is well defined statistically. First, we prove that NRM is consistent under a supervised learning setup. Second, we provide a generalization bound for NRM, which is proportional to the ratio of the number of active rendering paths and the total number of rendering paths in the trained NRM. A rendering path is a configuration of all latent variables in NRM as defined in Table 2, and active rendering paths are those among rendering paths whose corresponding rendered image is sufficiently close to one of the data point from the input data distribution. Our key results are summarized below. More details and proofs are deferred to Appendix B and C.

Governed by the connection between the cross-entropy and the posterior class probabilities under NRM, for the supervised setting of i.i.d. data , where is the data distribution, we utilize the following model to determine optimal parameters employed for the classification task

(13)

where , and are non-negative weights associated with the reconstruction loss and the cross-entropy loss respectively. Here, the approximate posterior is chosen according to Theorem 3.3(a) under the regime . The optimal solutions of objective function (13) induce a corresponding set of optimal (active) rendering paths that play a central role for an understanding of generalization bound regarding the classification tasks.

Before proceeding to the generalization bound, we first state the informal result regarding the consistency of optimal solutions of (13) when the sample size goes to infinity.

Theorem 4.1.

(Informal) Under the appropriate conditions regarding parameter spaces of , the optimal solutions of objective function (13) converge almost surely to those of the following population objective function

where

In Appendix C, we provide detail formulations of Theorem 4.1 for the supervised learning. Additionally, the detail proof of this result is presented in Appendix D. The statistical guarantee regarding optimal solutions of (13) validates their usage for the classification task. Given that NRM is consistent under a supervised learning setup, the following theorem establish a generalization bound for the model.

Theorem 4.2.

(Informal) Let and denote the population and empirical losses on the data population and the training set of NRM, respectively. Under the margin-based loss, the generalization gap of the classification framework with optimal solutions from (13) is controlled by the following term

with probability . Here, denotes the ratio of active optimal rendering paths among all the optimal rendering paths, is the total number of rendering paths, is the lower bound of prior probability regarding labels, and is the radius of the sphere that the rendered images belong to.

The detail formulations of the above theorem are postponed to Appendix C. The dependence of generalization bound on the number of active rendering paths helps to justify our modeling assumptions. In particular, NRM helps to reduce the number of active rendering paths thanks to the dependencies among its latent variables, thereby tightening the generalization bound. Nevertheless, there is a limitation regarding the current generalization bound. In particular, the bound involves the number of rendering paths , which is usually large. This is mainly because our bound has not fully taken into account the structures of CNNs, which is the limitation shared among other latest generalization bounds for CNN. It is interesting to explore if techniques in works by [41] and [42] can be employed to improve the term in our bound.

Extension to unsupervised and semi-supervised settings

Apart from the statistical guarantee and generalization bound established for the supervised setting, we also provide careful theoretical studies as well as detailed proofs regarding these results for the unsupervised and semi-supervised setting in Appendix B, Appendix C, Appendix D, and Appendix E.

5 New Max-Min Cross Entropy From The Neural Rendering Model

In this section, we explore a particular way to derive an alternative to cross-entropy inspired by the results in Theorem 3.3(a). In particular, denoting and , the new cross-entropy , which is called the Max-Min cross-entropy, is the weighted average of the cross-entropy losses from and :

Here the Max cross-entropy and Min cross entropy maximizes the correct target posterior and minimizes the incorrect target posterior, respectively. Similar to the cross-entropy loss, the Max-Min cross-entropy can also be shown to be an upper bound to the negative conditional log-likelihood of the NRM and has the same generalization bound derived in Section 4. The Max-Min networks in Figure 5 realize this new loss. These networks have two CNN-like branches that share weights. The max branch estimates using ReLU and Max-Pooling, and the min branch estimates using the Negative ReLU, i.e., , and Min-Pooling. The Max-Min networks can be interpreted as a form of knowledge distillation like the Born Again networks [43] and the Mean Teacher networks. However, instead of a student network learning from a teacher network, in Max-Min networks, two students networks, the Max and the Min networks, cooperate and learn from each other during the training.

Figure 5: The Max-Min network. The Max branch in the Max-Min network maximizes the correct target posterior while the Min branch minimizes the incorrect target posterior. These two networks share weights. The Max-Min cross-entropy loss is the weighted average of the cross-entropy losses from the Max and the Min networks.

6 Experiments

6.1 Semi-Supervised Learning

We show NRM armed with Max-Min cross-entropy and Mean Teacher regularizer achieves SOTA on benchmark datasets. We discuss the experimental results for CIFAR10 and CIFAR100 here. The results for SVHN, training losses, and training details, can be found in the Appendix A & F.

Cifar-10:
     1K labels 50K images      2K labels 50K images      4K labels 50K images   50K labels 50K images

 

Adversarial Learned Inference [8]  
Improved GAN [7]  
Ladder Network [44]
model [5]  
Temporal Ensembling [5]
Mean Teacher [6]  
VAT+EntMin [45]
DRM [15, 46]  

 

Supervised-only  
NRM without RPN  
NRM+RPN  
NRM+RPN+Max-Min  
NRM+RPN+Max-Min+Mean Teacher  

 

Table 3: Error rate percentage on CIFAR-10 over 3 runs.

Table 3 shows comparable results of NRM to SOTA methods. NRM is also better than the best methods that do not use consistency regularization like GAN, Ladder network, and ALI when using only =2K and 4K labeled images. NRM outperform DRM in all settings. Also, among methods in our comparison, NRM achieves the best test accuracy when using all available labeled data (=50K). We hypothesize that NRM has the advantage over consistency regularization methods like Temporal Ensembling and Mean Teacher when there are enough labeled data is because the consistency regularization in those methods tries to match the activations in the network, but does not take into account the available class labels. On the contrary, NRM employs the class labels, if they are available, in its reconstruction loss and RPN regularization as in Eqns. 9 and 10. In all settings, RPN regularizer improves NRM performance. Even though the improvement from RPN is small, it is consistent across the experiments. Furthermore, using Max-Min cross-entropy significantly reduces the test errors. When combining with Mean-Teacher, our Max-Min NRM improves upon Mean-Teacher and consistently achieves either SOTA results or second best results in all settings. This consistency in performance is only observed in our method and Mean-Teacher. Also, like with Mean-Teacher, NRM can potentially be combined with other consistency regularization methods, e.g., the Virtual Adversarial Training (VAT) [45], to obtain better results.

Cifar-100:

Table 4 shows NRM’s comparable results to model and Temporal Ensembling, as well as better results than DRM. Same as with CIFAR10, using the RPN regularizer results in a slightly better test accuracy, and NRM achieves better results than model and Temporal Ensembling method when using all available labeled data. Notice that combining with Mean-Teacher just slightly improves NRM’s performance when training with 10K labeled data. This is again because consistency regularization methods like Mean-Teacher do not add much advantage when there are enough labeled data. However, NRM+Max-Min still yields better test errors and achieves SOTA result in all settings. Note that since combining with Mean-Teacher does not help much here, we only show result for NRM+Max-Min.

  10K labels 50K images   50K labels 50K images

 

model [5]  
Temporal Ensembling [5]  
DRM [15, 46]  

 

Supervised-only  
NRM without RPN  
NRM+RPN  
NRM+RPN+Mean Teacher  
NRM+RPN+Max-Min  

 

Table 4: Error rate percentage on CIFAR-100 over 3 runs.

6.2 Supervised Learning with Max-Min Cross-Entropy

The Max-Min cross-entropy can be applied not only to improve semi-supervised learning but also on deep models including CNNs to enhance their supervised learning performance. In our experiments, we indeed observe Max-Min cross-entropy reduces the test error for supervised object classification on CIFAR10. In particular, using the Max-Min cross-entropy loss on a 29-layer ResNet [47] trained with the Shake-Shake regularization [48] and Cutout data augmentation [49], we are able to achieve SOTA test error of 2.30% on CIFAR10, an improvement of 0.26% over the test error of the baseline architecture trained with the traditional cross-entropy loss. While 0.26% improvement seems small, it is a meaningful enhancement given that our baseline architecture (ResNeXt + Shake-Shake + Cutout) is the second best model for supervised learning on CIFAR10. Such small improvement over an already very accurate model is significant in applications in which high accuracy is demanded such as self-driving cars or medical diagnostics. Similarly, we observe Max-Min improves the top-5 test error of the Squeeze-and-Excitation ResNeXt-50 network [50] on ImageNet by 0.17% compared to the baseline (7.04% vs. 7.21%). For a fair comparison, we re-train the baseline models and report the scores in the re-implementation.

7 Discussion

We present the NRM, a general and an effective framework for semi-supervised learning that combines generation and prediction in an end-to-end optimization. Using NRM, we can explain operations used in CNNs and develop new features that help learning in CNNs. For example, we derive the new Max-Min cross-entropy loss for training CNNs, which outperforms the traditional cross-entropy.

In addition to the results discussed above, there are still many open problems related to NRM that have not been addressed in the paper. We give several examples below:

  • An adversarial loss like in GANs can be incorporated into the NRM so that the model can generate realistic images. Furthermore, more knowledge of image generation from graphics and physics can be integrated in NRM so that the model can employ more structures to help learning and generation.

  • The unsupervised and (semi)-supervised models that we consider throughout the paper are under the assumption that the noise of NRM goes to 0. Governed by this assumption, we are able to derive efficient inference algorithms as well as rigorous statistical guarantees with these models. For the setting that is not close to 0, the inference with these models will rely on vanilla Expectation-Maximization (EM) algorithm for mixture models to obtain reliable estimators for the rendering templates. Since the parameters of interest are being shared among different rendering templates and have high dimensional structures, it is of practical interest to develop efficient EM algorithm to capture these properties of parameters under that setting of in NRM.

  • Thus far, the statistical guarantees with parameter estimation in the paper are established under the ideal assumptions that the optimal global solutions are obtained. However, it happens in practice that the inference algorithms based on SGD with the unsupervised and (semi)-supervised models usually lead to (bad) local minima. As a consequence, investigating sufficient conditions for the inference algorithms to avoid being trapped at bad local minima is an important venue of future work.

  • NRM hinges upon the assumption that the data is generated from mixture of Gaussian distributions with the mean parameters characterizing the complex rendering templates. However, it may happen in reality that the underlying distribution of each component of mixtures is not Gaussian distribution. Therefore, extending the current understandings with NRM under Gaussian distribution to other choices of underlying distributions is an interesting direction to explore.

We would like to end the paper with a remark that NRM is a flexible framework that enables us to introduce new components in the generative process and the corresponding features for CNNs can be derived in the inference. This hallmark of NRM provide a more fundamental and systematic way to design and study CNNs.

8 Acknowledgements

First of all, we are very grateful to Amazon AI for providing a highly stimulating research environment for us to start this research project and further supporting our research through their cloud credits program. We would also like to express our sincere thanks to Gautam Dasarathy for great discussions. Furthermore, we would also like to thank Doris Y. Tsao for suggesting and providing references for connections between our model and feedforward and feedback connections in the brain.

Many people during Tan Nguyen’s internship at Amazon AI have helped by providing comments and suggestions on our work, including Stefano Soatto, Zack C. Lipton, Yu-Xiang Wang, Kamyar Azizzadenesheli, Fanny Yang, Jean Kossaifi, Michael Tschannen, Ashish Khetan, and Jeremy Bernstein. We also wish to thank Sheng Zha who has provided immense help with MXNet framework to implement our models.

Finally, we would like to thank members of DSP group at Rice, Machine Learing group at UC Berkeley, and Anima Anandkumar’s TensorLab at Caltech who have always been supportive throughout the time it has taken to finish this project.

References

  • [1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems 27, pp. 2672–2680, 2014.
  • [2] R. P. N. Rao and D. H. Ballard, “Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects,” Nature Neuroscience, vol. 2, pp. 79 EP –, 01 1999.
  • [3] K. Friston, “Does predictive coding have a future?,” Nature Neuroscience, vol. 21, no. 8, pp. 1019–1021, 2018.
  • [4] B. Neyshabur, R. R. Salakhutdinov, and N. Srebro, “Path-sgd: Path-normalized optimization in deep neural networks,” in Advances in Neural Information Processing Systems, pp. 2422–2430, 2015.
  • [5] S. Laine and T. Aila, “Temporal ensembling for semi-supervised learning,” in International Conference on Learning Representations, 2017.
  • [6] A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results,” in Advances in Neural Information Processing Systems 30, pp. 1195–1204, 2017.
  • [7] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, and X. Chen, “Improved techniques for training gans,” in Advances in Neural Information Processing Systems 29, pp. 2234–2242, 2016.
  • [8] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville, “Adversarially learned inference,” in International Conference on Learning Representations, 2017.
  • [9] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
  • [10] D. P. Kingma, S. Mohamed, D. Jimenez Rezende, and M. Welling, “Semi-supervised learning with deep generative models,” in Advances in Neural Information Processing Systems 27, pp. 3581–3589, 2014.
  • [11] J. Donahue, P. Krähenbühl, and T. Darrell, “Adversarial feature learning,” in International Conference on Learning Representations, 2017.
  • [12] L. Dinh, D. Krueger, and Y. Bengio, “Nice: Non-linear independent components estimation,” in International Conference on Learning Representations Workshop, 2015.
  • [13] L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density estimation using real nvp,” in International Conference on Learning Representations, 2017.
  • [14] D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible 1x1 convolutions,” in Advances in Neural Information Processing Systems, 2018.
  • [15] A. B. Patel, M. T. Nguyen, and R. Baraniuk, “A probabilistic framework for deep learning,” in Advances in Neural Information Processing Systems 29, pp. 2558–2566, 2016.
  • [16] A. Achille and S. Soatto, “Emergence of invariance and disentanglement in deep representations,” The Journal of Machine Learning Research, vol. 19, pp. 1–34, 2018.
  • [17] A. Achille and S. Soatto, “Information dropout: Learning optimal representations through noisy computation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
  • [18] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” in Annual Allerton Conference on Communication, Control and Computing, pp. 368–377, 1999.
  • [19] V. Papyan, Y. Romano, and M. Elad, “Convolutional neural networks analyzed via convolutional sparse coding,” The Journal of Machine Learning Research, vol. 18, no. 1, pp. 2887–2938, 2017.
  • [20] H. Bristow, A. Eriksson, and S. Lucey, “Fast convolutional sparse coding,” in 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 391–398, IEEE, 2013.
  • [21] B. Wohlberg, “Efficient convolutional sparse coding,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7173–7177, IEEE, 2014.
  • [22] F. Heide, W. Heidrich, and G. Wetzstein, “Fast and flexible convolutional sparse coding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5135–5143, 2015.
  • [23] V. Papyan, J. Sulam, and M. Elad, “Working locally thinking globally: Theoretical guarantees for convolutional sparse coding,” IEEE Transactions on Signal Processing, vol. 65, no. 21, pp. 5687–5701, 2017.
  • [24] J. Bruna and S. Mallat, “Invariant scattering convolution networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1872–1886, 2013.
  • [25] S. Mallat, “Understanding deep convolutional networks,” Phil. Trans. R. Soc. A, vol. 374, no. 2065, p. 20150203, 2016.
  • [26] S. Arora, N. Cohen, and E. Hazan, “On the optimization of deep networks: Implicit acceleration by overparameterization,” arXiv preprint arXiv:1802.06509, 2018.
  • [27] C. D. Freeman and J. Bruna, “Topology and geometry of half-rectified network optimization,” in International Conference on Learning Representations, 2017.
  • [28] A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun, “The loss surfaces of multilayer networks,” in Artificial Intelligence and Statistics, pp. 192–204, 2015.
  • [29] K. Kawaguchi, “Deep learning without poor local minima,” in Advances in Neural Information Processing Systems 29, pp. 586–594, 2016.
  • [30] K. Kawaguchi, L. P. Kaelbling, and Y. Bengio, “Generalization in deep learning,” arXiv preprint arXiv:1710.05468, 2017.
  • [31] B. Neyshabur, S. Bhojanapalli, D. Mcallester, and N. Srebro, “Exploring generalization in deep learning,” in Advances in Neural Information Processing Systems 30, pp. 5947–5956, 2017.
  • [32] R. Balestriero and R. G. Baraniuk, “A spline theory of deep networks,” in Proceedings of the 34th International Conference on Machine Learning, 2018.
  • [33] R. Vidal, J. Bruna, R. Giryes, and S. Soatto, “Mathematics of deep learning,” arXiv preprint arXiv:1712.04741, 2017.
  • [34] Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in International Conference on Machine Learning, pp. 1050–1059, 2016.
  • [35] A. Y. Ng and M. I. Jordan, “On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes,” in Advances in Neural Information Processing Systems, pp. 841–848, 2002.
  • [36] J. Bernardo, M. Bayarri, J. Berger, A. Dawid, D. Heckerman, A. Smith, and M. West, “Generative or discriminative? getting the best of both worlds,” Bayesian statistics, vol. 8, no. 3, pp. 3–24, 2007.
  • [37] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 39, pp. 1–38, 1997.
  • [38] C. M. Bishop, Pattern recognition and machine learning (information science and statistics). Berlin, Heidelberg: Springer-Verlag, 2006.
  • [39] H. Robbins and S. Monro, “A stochastic approximation method,” in Herbert Robbins Selected Papers, pp. 102–109, Springer, 1985.
  • [40] J. Kiefer, J. Wolfowitz, et al., “Stochastic estimation of the maximum of a regression function,” The Annals of Mathematical Statistics, vol. 23, no. 3, pp. 462–466, 1952.
  • [41] P. L. Bartlett, D. J. Foster, and M. J. Telgarsky, “Spectrally-normalized margin bounds for neural networks,” Advances in Neural Information Processing Systems (NIPS), 2017.
  • [42] N. Golowich, A. Rakhlin, and O. Shamir, “Size-independent sample complexity of neural networks,” Proceedings of the Conference On Learning Theory (COLT), 2018.
  • [43] T. Furlanello, Z. C. Lipton, M. Tschannen, L. Itti, and A. Anandkumar, “Born-again neural networks,” Proceedings of the International Conference on Machine Learning (ICML), 2018.
  • [44] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko, “Semi-supervised learning with ladder networks,” in Advances in Neural Information Processing Systems 28, pp. 3546–3554, 2015.
  • [45] T. Miyato, S.-i. Maeda, S. Ishii, and M. Koyama, “Virtual adversarial training: a regularization method for supervised and semi-supervised learning,” IEEE transactions on pattern analysis and machine intelligence, 2018.
  • [46] A. B. Patel, T. Nguyen, and R. G. Baraniuk, “A probabilistic theory of deep learning,” arXiv preprint arXiv:1504.00641, 2015.
  • [47] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5987–5995, IEEE, 2017.
  • [48] X. Gastaldi, “Shake-shake regularization of 3-branch residual networks,” in International Conference on Learning Representations, 2017.
  • [49] T. DeVries and G. W. Taylor, “Improved regularization of convolutional neural networks with cutout,” arXiv preprint arXiv:1708.04552, 2017.
  • [50] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [51] A. Kumar, P. Sattigeri, and T. Fletcher, “Semi-supervised learning with gans: Manifold invariance with improved inference,” in Advances in Neural Information Processing Systems 30, pp. 5534–5544, 2017.
  • [52] T. Nguyen, W. Liu, E. Perez, R. G. Baraniuk, and A. B. Patel, “Semi-supervised learning with the deep rendering mixture model,” arXiv preprint arXiv:1612.01942, 2016.
  • [53] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, “Variational inference: A review for statisticians,” Journal of the American Statistical Association, vol. 112, no. 518, pp. 859–877, 2017.
  • [54] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016.
  • [55] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [56] S. van de Geer, Empirical Processes in M-estimation. Cambridge University Press, 2000.
  • [57] R. Vershynin, “Introduction to the non-asymptotic analysis of random matrices,” arXiv:1011.3027v7, 2011.
  • [58] L. Devroye, L. Gyorfi, and G. Lugosi, A Probabilistic Theory of Pattern Recognition. Stochastic Modelling and Applied Probability, Springer, 1996.
  • [59] R. Dudley, “Central limit theorems for empirical measures,” Annals of Probability, vol. 6, 1978.
  • [60] V. Koltchinskii and D. Panchenko, “Empirical margin distributions and bounding the generalization error of combined classifiers,” Annals of Statistics, vol. 30, 2002.
  • [61] A. W. van der Vaart and J. Wellner, Weak Convergence and Empirical Processes. New York, NY: Springer-Verlag, 1996.
  • [62] I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent with warm restarts,” in International Conference on Learning Representations, 2016.
  • [63] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9, 2015.

Supplementary Material

Appendix A Appendix A

This appendix contains semi-supervised learning results of NRM on SVHN compared to other methods.

Semi-Supervised Learning Results on SVHN:

  250 labels 73257 images   500 labels 73257 images   1000 labels 73257 images   73257 labels 73257 images

 

ALI [8]
Improved GAN [7]
 + Jacob.-reg + Tangents [51]
model [5]   
Temporal Ensembling [5]
Mean Teacher [6]   
VAT+EntMin [45]
DRM [52]

 

Supervised-only   
NRM without RPN   
NRM+RPN   
NRM+RPN+Max-Min+MeanTeacher   

 

Table 5: Error rate percentage on SVHN in comparison with other state-of-the-art methods. All results are averaged over 2 runs (except for NRM+RPN when using all labels, 1 run)

Appendix B Appendix B

In this appendix, we give further connection of NRM to cross entropy as well as additional derivation of NRM to various models under both unsupervised and (semi)-supervised setting of data being mentioned in the main text. We also formally present our results on consistency and generalization bounds for NRM in supervised and semi-supervised learning settings. In addition, we explain how to extend NRM to derive ResNet and DenseNet. For the simplicity of the presentation, we denote to represent all the parameters that we would like to estimate from NRM where is the set of all possible values of latent (nuisance) variables . Additionally, for each , we denote , i.e., the subset of parameters corresponding to specific label and latent variable . Furthermore, to stress the dependence of on , we define the following function

for each where is a masking matrix associated with . Throughout this supplement, we will use and interchangeably as long as the context is clear. Furthermore, we assume that , which is a subset of for any , , which is a subset of , for , and , which is a subset of for all choices of and . Last but not least, we say that satisfies the non-negativity assumption if the intermediate rendered images satisfy for all . Finally, we use to denote transpose of the matrix .

b.1 Connection between NRM and cross entropy

As being established in part (a) of Theorem 3.3, the cross entropy is the lower bound of maximizing the conditional log likelihood. In the following full theorem, we will show both the upper bound and the lower bound of maximizing the conditional log likelihood in terms of the cross entropy.

Theorem B.1.

Given any , we denote . For any and , let be i.i.d. samples from the NRM. Then, the following holds
(a) (Lower bound)

where for all , for all