Abstract
Unsupervised and semisupervised learning are important problems that are especially challenging with complex data like natural images. Progress on these problems would accelerate if we had access to appropriate generative models under which to pose the associated inference tasks. Inspired by the success of Convolutional Neural Networks (CNNs) for supervised prediction in images, we design the Neural Rendering Model (NRM), a new probabilistic generative model whose inference calculations correspond to those in a given CNN architecture. The NRM uses the given CNN to design the prior distribution in the probabilistic model. Furthermore, the NRM generates images from coarse to finer scales. It introduces a small set of latent variables at each level, and enforces dependencies among all the latent variables via a conjugate prior distribution. This conjugate prior yields a new regularizer based on paths rendered in the generative model for training CNNs–the Rendering Path Normalization (RPN). We demonstrate that this regularizer improves generalization, both in theory and in practice. In addition, likelihood estimation in the NRM yields training losses for CNNs, and inspired by this, we design a new loss termed as the MaxMin cross entropy which outperforms the traditional crossentropy loss for object classification. The MaxMin cross entropy suggests a new deep network architecture, namely the MaxMin network, which can learn from less labeled data while maintaining good prediction performance. Our experiments demonstrate that the NRM with the RPN and the MaxMin architecture exceeds or matches thestateofart on benchmarks including SVHN, CIFAR10, and CIFAR100 for semisupervised and supervised learning tasks.
Keywords: neural nets, generative models, semisupervised learning, crossentropy, statistical guarantee
Neural Rendering Model: Joint Generation
and Prediction for SemiSupervised Learning
Nhat Ho, Tan Nguyen, Ankit Patel, 
Anima Anandkumar, Michael I. Jordan, Richard G. Baraniuk 
University of California at Berkeley, Berkeley, USA 
Rice University, Houston, USA 
Baylor College of Medicine, Houston, USA 
California Institute of Technology, Pasadena, USA 
NVIDIA, Santa Clara, USA 
July 19, 2019
^{†}^{†}footnotetext: Nhat Ho and Tan Nguyen contributed equally to this work. Anima Anandkumar, Michael I. Jordan, and Richard G. Baraniuk contributed equally to this work.1 Introduction
Unsupervised and semisupervised learning have still lagged behind compared to performance leaps we have seen in supervised learning over the last five years. This is partly due to a lack of good generative models that can capture all latent variations in complex domains such as natural images and provide useful structures that help learning. When it comes to probabilistic generative models, it is hard to design good priors for the latent variables that drive the generation.
Instead, recent approaches avoid the explicit design of image priors. For instance, Generative Adversarial Networks (GANs) use implicit feedback from an additional discriminator that distinguishes real from fake images [1]. Using such feedback helps GANs to generate visually realistic images, but it is not clear if this is the most effective form of feedback for predictive tasks. Moreover, due to separation of generation and discrimination in GANs, there are typically more parameters to train, and this might make it harder to obtain gains for semisupervised learning in the low (labeled) sample setting.
We propose an alternative approach to GANs by designing a class of probabilistic generative models, such that inference in those models also has good performance on predictive tasks. This approach is wellsuited for semisupervised learning since it eliminates the need for a separate prediction network. Specifically, we answer the following question: what generative processes output Convolutional Neural Networks (CNNs) when inference is carried out? This is natural to ask since CNNs are stateoftheart (SOTA) predictive models for images, and intuitively, such powerful predictive models should capture some essence of image generation. However, standard CNNs are not directly reversible and likely do not have all the information for generation since they are trained for predictive tasks such as image classification. We can instead invert the irreversible operations in CNNs, e.g., the rectified linear units (ReLUs) and spatial pooling, by assigning auxiliary latent variables to account for uncertainty in the CNN’s inversion process due to the information loss.
Contribution 1 – Neural Rendering Model: We develop the Neural Rendering Model (NRM) whose bottomup inference corresponds to a CNN architecture of choice (see Figure 1a). The “reverse” topdown process of image generation is through coarsetofine rendering, which progressively increases the resolution of the rendered image (see Figure 1b). This is intuitive since the reverse process of bottomup inference reduces the resolution (and dimension) through operations such as spatial pooling. We also introduce structured stochasticity in the rendering process through a small set of discrete latent variables, which capture the uncertainty in reversing the CNN feedforward process. The rendering in NRM follows a product of linear transformations, which can be considered as the transpose of the inference process in CNNs. In particular, the rendering weights in NRM are proportional to the transpose of the filters in CNNs. Furthermore, the bias terms in the ReLU units at each layer (after the convolution operator) make the latent variables in different network layers dependent (when the bias terms are nonzero). This design of image prior has an interesting interpretation from a predictivecoding perspective in neuroscience: the dependency between latent variables can be considered as a form of backward connections that captures prior knowledge from coarser levels in NRM and helps adjust the estimation at the finer levels [2, 3]. The correspondence between NRM and CNN is given in Figure 2 and Table 1 below.
NRM is a likelihoodbased framework, where unsupervised learning can be derived by maximizing the expected completedata loglikelihood of the model while supervised learning is done through optimizing the classconditional loglikelihood. Semisupervised learning unifies both loglikelihoods into an objective cost for learning from both labeled and unlabeled data. The NRM prior has the desirable property of being a conjugate prior, which makes learning in NRM computationally efficient.
Interestingly, we derive the popular crossentropy loss used to train CNNs for supervised learning as an upper bound of the NRM’s negative classconditional loglikelihood. A broadlyaccepted interpretation of crossentropy loss in training CNNs is from logistic regression perspective. Given features extracted from the data by the network, logistic regression is applied to classify those features into different classes, which yields the crossentropy. In this interpretation, there is a gap between feature extraction and classification. On the contrary, our derivation ties feature extraction and learning for classification in CNN into an endtoend optimization problem that estimates the conditional loglikelihood of NRM. This new interpretation of crossentropy allow us to develop better losses for training CNNs. An example is the MaxMin crossentropy discussed in Contribution 2 and Section 5.
NRM  CNN  










ReLU  

MaxPool  

Crossentropy loss  

Reconstruction loss  


Contribution 2 – New regularization, loss function, architecture and generalization bounds: The joint nature of generation, inference, and learning in NRM allows us to develop new training procedures for semisupervised and supervised learning, as well as new theoretical (statistical) guarantees for learning. In particular, for training, we derive a new form of regularization termed as the Rendering Path Normalization (RPN) from the NRM’s conjugate prior. A rendering path is a set of latent variable values in NRM. Unlike the pathwise regularizer in [4], RPN uses information from a generative model to penalizes the number of the possible rendering paths and, therefore, encourages the network to be compact in terms of representing the image. It also helps to enforce the dependency among different layers in NRM during training and improves classification performance.
We provide new theoretical bounds based on NRM. In particular, we prove that NRM is statistically consistent and derive a generalization bound of NRM for (semi)supervised learning tasks. Our generalization bound is proportional to the number of active rendering paths that generate closetoreal images. This suggests that RPN regularization may help in generalization since RPN enforces the dependencies among latent variables in NRM and, therefore, reduces the number of active rendering paths. We observe that RPN helps improve generalization in our experiments.
MaxMin crossentropy and network: We propose the new MaxMin crossentropy loss function for learning, based on negative classconditional loglikelihood in NRM. It combines the traditional crossentropy with another loss, which we term as the Min crossentropy. While the traditional (Max) crossentropy maximizes the probability of the correct labels, the Min crossentropy minimizes the probability of the incorrect labels. We show that the MaxMin crossentropy is also an upper bound to the negative conditional loglikehood of NRM, just like the crossentropy loss. The MaxMin crossentropy is realized through a new CNN architecture, namely the MaxMin network, which is a CNN with an additional branch sharing weights with the original CNN but containing minimum pooling (MinPool) operator and negative rectified linear units (NReLUs), i.e., (see Figure 5). Although the MaxMin network is derived from NRM, it is a metaarchitecture that can be applied independently on any CNN architecture. We show empirically that MaxMin networks and crossentropy help improve the SOTA on object classification for supervised and semisupervised learning.
Contribution 3 – Stateoftheart empirical results for semisupervised and supervised learning: We show strong results for semisupervised learning over CIFAR10, CIFAR100 and SVHN benchmarks in comparison with SOTA methods that use and do not use consistency regularization. Consistency regularization, such as those used in Temporal Ensembling [5] and Mean Teacher [6], enforces the networks to learn representation invariant to realistic perturbations of the data. NRM alone outperforms most SOTA methods which do not use consistency regularization [7, 8] in most settings. MaxMin crossentropy then helps improves NRM’s semisupervised learning results significantly. When combining NRM, MaxMin crossentropy, and Mean Teacher, we achieve SOTA results or very close to those on CIFAR10, CIFAR100, and SVHN (see Table 3, 4, and 5). Interestingly, compared to the other competitors, our method is consistently good, achieving either best or second best results in all experiments. Furthermore, MaxMin crossentropy also helps supervised learning. Using the MaxMin crossentropy, we achieve SOTA result for supervised learning on CIFAR10 (2.30% test error). Similarly, MaxMin crossentropy helps improve supervised training on ImageNet.
Despite good classification results, there is a caveat that NRM may not generate good looking images since that objective is not “baked” into its training. NRM is primarily aimed at improving semisupervised and supervised learning through better regularization. Potentially, an adversarial loss can be added to NRM to improve visual characteristics of the image, but that is beyond the scope of this paper.
Notation: To facilitate the presentation, the NRM’s notations are explained in Table 2. Throughout this paper, we denote and the Euclidean norm and transpose of respectively for any . Additionally, for any matrix and of the same dimension, denotes the Hadamard product between and . For any vector and of the same dimension, denotes the dot product between and .
Variables  
input image of size  
object category  
all latent variables of size in layer  
switching latent variable at pixel location in layer  
local translation latent variable at pixel location in layer  
intermediate rendered image of size in layer  
rendered image of size from NRM before adding noise  
corresponding feature maps in layer in CNNs.  
Parameters  
the template of class , as well as the coarsest image of size determined by the category at the top of NRM before adding any fine detail. is learned from the data.  
rendering matrix of size at layer .  
dictionary of rendering template of size at layer . is learned from the data.  
corresponding weight at the layer in CNNs  
set of zeropadding matrices at layer  
set of local translation matrices at layer . is chosen according to value of  
parameter of the conjugate prior at layer . This term is of size and becomes the bias term after convolutions in CNNs. It can be made independent of , which is equivalent to using the same bias in each feature map in CNNs. Here, is learned from data.  
probability of object category .  
pixel noise variance  
Other Notations  
.  
.  
RPN  .  
(y,z(L), …, z(1))  rendering configuration. 
2 Related Work
Deep Generative Models:
In addition to GANs, other recently developed deep generative models include the Variational Autoencoders (VAE) [9] and the Deep Generative Networks [10]. Unlike these models, which replace complicated or intractable inference by CNNs, NRM derives CNNs as its inference. This advantage allows us to develop better learning algorithms for CNNs with statistical guarantees, as being discussed in Section 3.3 and 4. Recent works including the Bidirectional GANs [11] and the Adversarially Learned Inference model [8] try to make the discriminators and generators in GANs reversible of each other, thereby providing an alternative way to invert CNNs. These approaches, nevertheless, still employ a separate network to bypass the irreversible operators in CNNs. Furthermore, the flowbased generative models such as NICE [12], Real NVP [13], and Glow [14] are invertible. However, the inference algorithms of these models, although being exact, do not match the CNN architecture. NRM is also close in spirit to the Deep Rendering Model (DRM) [15] but markedly different. Compared to NRM, DRM has several limitations. In particular, all the latent variables in DRM are assumed to be independent, which is rather unrealistic. This lack of dependency causes the missing of the bias terms in the ReLUs of the CNN derived from DRM. Furthermore, the crossentropy loss used in training CNNs for supervised learning tasks is not captured naturally by DRM. Due to these limitations, model consistency and generalization bounds are not derived for DRM.
SemiSupervised Learning:
In addition to deep generative model approach, consistency regularization methods, such as Temporal Ensembling [5] and Mean Teacher [6], have been recently developed for semisupervised learning and achieved stateoftheart results. These methods enforce that the baseline network learns invariant representations of the data under different realistic perturbations. Consistency regularization approaches are complimentary to and can be applied on most deep generative models, including NRM, to further increase the baseline model’s performance on semisupervised learning tasks. Experiments in Section 6 demonstrate that NRM achieves better test accuracy on CIFAR10 and CIFAR100 when combined with Mean Teacher.
Explaining Architecture of CNNs:
The architectures and training losses of CNNs have been studied from other perspectives. [16, 17] employ principles from information theory such as the Information Bottleneck Lagrangian introduced by [18] to show that stacking layers encourages CNNs to learn representations invariant to latent variations. They also study the crossentropy loss to understand possible causes of overfitting in CNNs and suggest a new regularization term for training that helps the trained CNNs generalize better. This regularization term relates to the amount of information about the labels memorized in the weights. Additionally, [19] suggests a connection between CNNs and the convolutional sparse coding (CSC) [20, 21, 22, 23]. They propose a multilayer CSC (MLCSC) model and prove that CNNs are the thresholding pursuit serving the MLCSC model. This thresholding pursuit framework implies alternatives to CNNs, which is related to the deconvolutional and recurrent networks. The architecture of CNNs is also investigated using the wavelet scattering transform. In particular, scattering operators help elucidate different properties of CNNs, including how the image sparsity and geometry captured by the networks [24, 25]. In addition, CNNs are also studied from the optimization perspective [26, 27, 28, 29], statistical learning theory perspective [30, 31], approximation theory perspective [32], and other approaches [33, 34].
Like these works, NRM helps explain different components in CNNs. It employs tools and methods in probabilistic inference to interpret CNN from a probabilistic perspective. That says, NRM can potentially be combined with aforementioned approaches to gain a better understanding of CNNs.
3 The Neural Rendering Model
We first define the Neural Rendering Model (NRM). Then we discuss the inference in NRM. Finally, we derive different learning losses from NRM, including the crossentropy loss, the reconstruction loss, and the RPN regularization, for supervised learning, unsupervised learning, and semisupervised learning.
3.1 Generative Model
NRM attempts to invert CNNs as its inference so that the information in the posterior can be used to inform the generation process of the model. NRM realizes this inversion by employing the structure of its latent variables. Furthermore, the joint prior distribution of latent variables in the model is parametrized such that it is the conjugate prior to the likelihood of the model. This conjugate prior is the function of intermediate rendered images in NRM and implicitly captures the dependencies among latent variables in the model. More precisely, NRM can be defined as follows:
Definition 3.1 (Neural Rendering Model (NRM)).
NRM is a deep generative model in which the latent variables at different layers are dependent. Let be the input image and be the target variable, e.g. object category. Generation in NRM takes the form:
(1)  
(2)  
where:
(3)  
(4) 
The generation process in NRM can be summarized in the following steps:

Given the class label , NRM first samples the latent variables from a categorical distribution whose prior is .

Starting from the class label at the top layer of the model, NRM renders its coarsest image, , which is also the object template of class .

At layer , a set of of latent variations is incorporated into via a linear transformation to render the finer image . The same process is repeated at each subsequent layer to finally render the finest image at the bottom of NRM.

Gaussian pixel noise is added into to render the final image .
In the generation process above, can be any linear transformation, and the latent variables , in their most generic form, can capture any latent variation in the data. While such generative model can represent any possible imagery data, it cannot be learned to capture the properties of natural images in reasonable time due to its huge number of degrees of freedom. Therefore, it is necessary to have more structures in NRM given our prior knowledge of natural images. One such prior knowledge is from classification models, . In particular, since classification models like CNNs have achieved excellent performance on a wide range of object classification tasks, we hypothesize that a generative model whose inference yields CNNs will also be a good model for natural images. As a result, we would like to introduce new structures into NRM so that CNNs can be derived as NRM’s inference. In other word, we use the posterior , i.e., CNNs, to inform the likelihood in designing NRM.
In our attempt to invert CNNs, we constrain the latent variables at layer in NRM to a set of template selecting latent variables and local translation latent variables . As been shown later in Section 3.2, during inference of NRM, the ReLU nonlinearity at layer “inverts” to find if particular features are in the image or not. Similarly, the MaxPool operator “inverts” to locate where particular features, if exist, are in the image. Both and are vectors indexed by .
The rendering matrix is now a function of and , and the rendering process from layer to layer using is described in the following equation:
(5) 
Even though the rendering equation above seems complicated at first, it is quite intuitive as illustrated in Figure 3. At each pixel in the intermediate image at layer , NRM decides to use that pixel to render or not according to the value of the template selecting latent variable at that pixel location. If , then NRM renders. Otherwise, it does not. If rendering, then the pixel value is used to scale the rendering template . This rendering template is local. It has the same number of feature maps as the next rendered image , but is of smaller size, e.g. or . As a result, the rendered image corresponds to a local patch in . Next, the padding matrix pads the resultant patch to the size of the image with zeros, and the translation matrix translates the result to a local location. NRM then keeps rendering at other pixel location of following the same process. All rendered images are added to form the final rendered image at layer .
Note that in NRM, there is one rendering template at each pixel location in the image . For example, if is of size , then the NRM uses templates to render at layer . This is too many rendering templates and would require a very large amount of data to learn, considering all layers in NRM. Therefore, we further constrain NRM by enforcing all pixels in the same feature maps of share the same rendering template. In other word, are the same if are in the same feature map. This constrain helps yield convolutions in CNNs during the inference of NRM, and the rendering templates in NRM now correspond to the convolution filters in CNNs.
While and can be let independent, we further constrain the model by enforcing the dependency among and at different layers in NRM. This constraint is motivated from realistic rendering of natural objects: different parts of a natural object are dependent on each other. For example, in an image of a person, the locations of the eyes in the image are restricted by the location of the head or if the face is not painted, then it is likely that we cannot find the eyes in the image either. Thus, NRM tries to capture such dependency in natural objects by imposing more structures into the joint prior of latent variables and at all layer , in the model. In particular, the joint prior is given by Eqn. 1. The form of the joint prior might look mysterious at first, but NRM parametrizes in this particular way so that is the conjugate prior of the model likelihood as proven in Appendix C.4. Specifically, in order to derive conjugate prior, we would like the log conditional distribution to have the linear piecewise form as the CNNs which compute the posterior . This design criterion results in each term in the joint prior in Eqn. 1. The conjugate form of allows efficient inference in the NRM. Note that are the parameters of the conjugate prior . Due to the form of , during inference, will become the bias terms after convolutions in CNNs as will be shown in Theorem 3.2. Furthermore, when training in an unsupervised setup, the conjugate prior results in the RPN regularization as shown in Theorem 3.3(b). This RPN regularization helps enforce the dependencies among latent variables in the model and increases the likelihood of latent configuration presents in the data during training. For the sake of clarity, we summarize all the notations used in NRM in Table 2.
We summarize the NRM’s rendering process in Algorithm 1 below. Reconstructed images at each layer of a 5layered NRM trained on MNIST are visualized in Figure 4. NRM reconstructs the images in two steps. First, the bottomup Estep inference in NRM, which has a CNN form, keeps track of the optimal latent variables and from the input image. Second, in the topdown Estep reconstruction, NRM uses and to render the reconstructed image according to Eqn. 2 and 5. The network is trained using the semisupervised learning framework discussed in Section 3.3. The reconstructed images show that NRM renders images from coarse to fine. Early layers in the model such as layer 4 and 3 capture coarsescale features of the image while later layers such as layer 2 and 1 capture finerscale features. Starting from layer 2, we begin to see the gist of the rendered digits which become clearer at layer 1. Note that images at layer 4 represent the class template in NRM, which is also the softmax weights in CNN.
Layer 4  Layer 3  Layer 2  Layer 1  Layer 0  Original 
In this paper, in order to simplify the notations, we only model . More precisely, in the subsequent sections, is set to for unlabeled data. An extension to model can be achieved by adding the term inside the Softmax operator in Eqn. 1. All theorems and proofs can also be easily extended to the case when is modeled. Furthermore, to facilitate the discussion, we will utilize , , and to denote the set of all rendering templates , zeropadding matrices , and local translation matrices at layer , respectively. We will also use , , and interchangeably. Similarly, the notations and , as well as and are used interchangeably.
NRM with skip connections: In Section 3.2 below, we will show that CNNs can be derived as a joint maximum a posteriori (JMAP) inference in NRM. By introducing skip connections into the structure of the rendering matrices , we can also derive the convolutional neural networks with skip connections including ResNet and DenseNet. The detail derivation of ResNet and DenseNet can be found in Appendix B.7.
3.2 Inference
We would like to show that inference in NRM has the CNN form (see Figure 2) and, therefore, is still tractable and efficient. This correspondence between NRM and CNN helps us achieve two important goals when developing a new generative model. First, the desired generative model is complex enough with rich needed structures to capture the great diversity of forms and appearances of the objects surrounding us. Second, the inference in the model is fast and efficient so that the model can be learned in reasonable time. Such advantages justify our modeling choice of using classification models like CNNs to inform our design of generative models like NRM.
The following theorem establishes the aforementioned correspondence, showing that the JMAP inference of the optimal latent variables in NRM is indeed a CNN:
Theorem 3.2.
Denote the set of all parameters in NRM. The JMAP inference of latent variable z in NRM is the feedforward step in CNNs. Particularly, we have:
(6) 
where is computed recursively. In particular, and:
(7) 
The equality holds in Eqn. 6 when the parameters in NRM satisfy the nonnegativity assumption that the intermediate rendered image .
There are four takeaways from the results in Theorem 3.2:

ReLU nonlinearities in CNNs find the optimal value for the template selecting latent variables at each layer in NRM, detecting if particular features exist in the image or not.

MaxPool operators in CNNs find the optimal value for the local translation latent variables at each layer in NRM, locating where particular features are rendered in the image.

Bias terms after each convolution in CNNs are from the prior distribution of latent variables in the model. Those bias terms update the posterior estimation of latent variables from data using the knowledge encoded in the prior distribution of those latent variables.

Convolutions in CNNs result from reversing the local rendering operators, which use the rendering template , in NRM. Instead of rendering as in NRM, convolutions in CNNs perform template matching. Particularly, it can be shown that convolution weights in CNNs are the transposes of the rendering templates in NRM.
Table 1 summarizes the correspondences between NRM and CNNs. The proofs for these correspondences are postponed to Appendix C. The nonnegativity assumption that the intermediate rendered image allows us to apply the maxproduct message passing and send the over latent variables operator through the product of rendering matrices . Thus, given this assumption, the equality holds in Eqn. 6. In Eqn. 7, we have removed the generative constraints inherited from NRM to derive the weights in CNNs, which are free parameters. As a result, when faced with training data that violates NRM’s underlying assumptions, CNNs will have more freedom to compensate. We refer to this process as a discriminative relaxation of a generative classifier [35, 36]. Finally, the dot product with the object template in Eqn. 6 corresponds to the fully connected layer before the Softmax nonlinearity is applied in CNNs.
Given Theorem 3.2 and the four takeaways above, NRM has successfully reverseengineered CNNs. However, the impact of Theorem 3.2 goes beyond a reverseengineering effort. First, it provides probabilistic semantics for components in CNNs, justifying their usage, and providing an opportunity to employ probabilistic inference methods in the context of CNNs. In particular, convolution operators in CNNs can be seen as factor nodes in the factor graph associated with NRM. Similarly, activations from the convolutions in CNNs correspond to bottomup messages in that factor graph. The bias terms added to the activations in CNNs, which are from the joint prior distribution of latent variables, are equivalent to the topdown messages from the top layers of NRM. These topdown messages have receptive fields of the whole image and are used to update the bottomup messages, which are estimated from local information with smaller receptive fields. Finally, ReLU nonlinearities and MaxPool operators in CNNs are maxmarginalization operators over the template selecting and local translation latent variables and in NRM, respectively. These maxmarginalization operators are from maxproduct message passing used to infer the latent variables in NRM.
Second, Theorem 3.2 provides a flexible framework to design CNNs. Instead of directly engineering CNNs for new tasks and datasets, we can modify NRM to incorporate our knowledge of the tasks and datasets into the model and then perform JMAP inference to achieve a new CNN architecture. For example, in Theorem 3.2, we show how ReLU can be derived from maxmarginalization of . By changing the distribution of , we can derive Leaky ReLU. Furthermore, batch normalization in CNNs can be derived from NRM by normalizing intermediate rendered images at each layer in NRM. Also, as mentioned above, by introducing skip connections into the rendering matrices , we can derive ResNet and DenseNet. Details of those derivations can be found in Appendix B and C.
3.3 Learning
NRM learns from both labeled and unlabeled data. Learning in NRM can be posed as likelihood estimation problems which optimize the conditional loglikelihood and the expected completedata loglikelihood for supervised and unsupervised learning respectively. Interestingly, the crossentropy loss used in training CNNs with labeled data is the upper bound of the NRM’s negative conditional loglikelihood. NRM solves these likelihood optimization problems via the Expectation Maximization (EM) approach. In the Estep, inference in NRM finds the optimal latent variables . This inference has the form of a CNN as shown in Theorem 3.2. In the Mstep, given , NRM maximizes the corresponding likelihood objective functions or their lower bounds as in the case of crossentropy loss. There is no closedform Mstep update for deep models like NRM, so NRM employs the generalized EM instead [37, 38]. In generalized EM, the Mstep seeks to increase value of the likelihood objective function instead of maximizing it. In particular, in the Mstep, NRM uses gradientbased methods such as Stochastic Gradient Descent (SGD) [39, 40] to update its parameters. The following theorem derives the learning objectives for NRM in both supervised and unsupervised settings.
Theorem 3.3.
Denote .
For any , let be i.i.d. samples from NRM. Assume that the final rendered template is normalized such that its norm is constant. The following holds:
(a) Crossentropy loss for supervised training CNNs with labeled data:
(8) 
where is the posterior estimated by CNN, and is the crossentropy between and the true posterior given by the ground truth.
(b) Reconstruction loss with RPN for unsupervised training of CNNs with labeled and unlabeled data:
(9) 
where the latent variable is estimated by the CNN as described in Theorem 3.2, is the reconstructed image, and the RPN regularization is the negative log prior defined as follows:
(10) 
CrossEntropy Loss for Training Convolutional Neural Networks with Labeled Data:
Part (a) of Theorem 3.3 establishes the crossentropy loss in the context of CNNs as an upper bound of the NRM’s negative conditional loglikelihood . Different from other derivations of crossentropy loss via logistic regression, Theorem 3.3(a) derives the crossentropy loss in conjunction with the architecture of CNNs since the estimation of the optimal latent variables is part of the optimization in Eqn. 8. In other word, Theorem 3.3(a) ties feature extraction and learning for classification in CNNs into an endtoend conditional likelihood estimation problem in NRM. This new interpretation of the crossentropy loss suggests an interesting direction in which better losses for training CNNs with labeled data for supervised classification tasks can be derived from tighter upper bounds for . The MaxMin crossentropy in Section 5 is an example. Note that the assumption that the rendered image has constant norm is solely for the ease of presentation. Later, in Appendix B, we extend the result of Theorem 3.3(a) to the setting in which the norm of rendered image is bounded.
In order to estimate how tight the crossentropy upper bound is, we prove the lower bound for . The gap between this lower bound and the crossentropy upper bound suggests the quality of the estimation in Theorem 3.3(a). In particular, this gap is given by:
(11) 
where for . More details can be found in Appendix B while its detail proof is deferred to Appendix C.
Reconstruction Loss with the Rendering Path Normalization (RPN) Regularization for Unsupervised Learning with Both Labeled and Unlabeled Data:
Part (b) of Theorem 3.3 suggests that NRM learns from both labeled and unlabeled data by maximizing its expected completedata loglikelihood, , which is the sum of a reconstruction loss and the RPN regularization. Deriving the Estep and Mstep of generalized EM when is rather complicated; therefore, for the simplicity of the paper, we only focus on the setting in which goes to 0. Under that setting, in the Mstep, NRM minimizes the objective function with respect to the parameters of the model. The first term in this objective function is the reconstruction loss between the input image and the reconstructed template . The second term is the Rendering Path Normalization (RPN) defined in Eqn. 10. RPN encourages the inferred in the bottomup E step to have higher prior among all possible values of . Due to the parametric form of as in Eqn. 1, RPN also enforces the dependencies among latent variables at different layers in NRM. An approximation to this RPN regularization is discussed in Appendix B.2.
SemiSupervised Learning with the LatentDependent Deep Rendering Model:
NRM learns from both labeled and unlabeled data simultaneously by maximizing a weighted combination of the crossentropy loss for supervised learning and the reconstruction loss with RPN regularization for unsupervised learning as in Theorem 3.3. We now formulate the semisupervised learning problem in NRM. In particular, let be i.i.d. samples from NRM and assume that the labels are unknown for some , NRM utilizes the following model to determine optimal parameters employed for the semisupervised classification task:
(12) 
where and are nonnegative weights
associated with the reconstruction loss/RPN regularization and the crossentropy loss, respectively. Again, the optimal latent variables are inferred in the Estep as in Theorem 3.2. For unlabeled data, is set to .
In summary, combining Theorem 3.3(a) and 3.2, NRM allows us to derive CNNs with convolution layer, the ReLU nonlinearity, and the MaxPool layer. These CNNs optimize the crossentropy loss for supervised classification tasks with labeled data. Combining Theorem 3.3(b) and 3.2, NRM extends the traditional CNNs for unsupervised learning tasks in which the networks optimize the reconstruction loss with the RPN regularization. NRM does semisupervised learning by optimizing the weighted combination of the losses in Theorem 3.3(a) and Theorem 3.3(b). Inference in the semisupervised learning setup still follows Theorem 3.2. NRM can also be extended to explain other variants of CNNs, including ResNet and DenseNet, as well as other components in CNNs such as Leaky ReLU and batch normalization.
4 Statistical Guarantees for the Neural Rendering Model in the Supervised Setting
We provide statistical guarantees for NRM to establish that NRM is well defined statistically. First, we prove that NRM is consistent under a supervised learning setup. Second, we provide a generalization bound for NRM, which is proportional to the ratio of the number of active rendering paths and the total number of rendering paths in the trained NRM. A rendering path is a configuration of all latent variables in NRM as defined in Table 2, and active rendering paths are those among rendering paths whose corresponding rendered image is sufficiently close to one of the data point from the input data distribution. Our key results are summarized below. More details and proofs are deferred to Appendix B and C.
Governed by the connection between the crossentropy and the posterior class probabilities under NRM, for the supervised setting of i.i.d. data , where is the data distribution, we utilize the following model to determine optimal parameters employed for the classification task
(13) 
where , and are nonnegative weights associated with the reconstruction loss and the crossentropy loss respectively. Here, the approximate posterior is chosen according to Theorem 3.3(a) under the regime . The optimal solutions of objective function (13) induce a corresponding set of optimal (active) rendering paths that play a central role for an understanding of generalization bound regarding the classification tasks.
Before proceeding to the generalization bound, we first state the informal result regarding the consistency of optimal solutions of (13) when the sample size goes to infinity.
Theorem 4.1.
(Informal) Under the appropriate conditions regarding parameter spaces of , the optimal solutions of objective function (13) converge almost surely to those of the following population objective function
where
In Appendix C, we provide detail formulations of Theorem 4.1 for the supervised learning. Additionally, the detail proof of this result is presented in Appendix D. The statistical guarantee regarding optimal solutions of (13) validates their usage for the classification task. Given that NRM is consistent under a supervised learning setup, the following theorem establish a generalization bound for the model.
Theorem 4.2.
(Informal) Let and denote the population and empirical losses on the data population and the training set of NRM, respectively. Under the marginbased loss, the generalization gap of the classification framework with optimal solutions from (13) is controlled by the following term
with probability . Here, denotes the ratio of active optimal rendering paths among all the optimal rendering paths, is the total number of rendering paths, is the lower bound of prior probability regarding labels, and is the radius of the sphere that the rendered images belong to.
The detail formulations of the above theorem are postponed to Appendix C. The dependence of generalization bound on the number of active rendering paths helps to justify our modeling assumptions. In particular, NRM helps to reduce the number of active rendering paths thanks to the dependencies among its latent variables, thereby tightening the generalization bound. Nevertheless, there is a limitation regarding the current generalization bound. In particular, the bound involves the number of rendering paths , which is usually large. This is mainly because our bound has not fully taken into account the structures of CNNs, which is the limitation shared among other latest generalization bounds for CNN. It is interesting to explore if techniques in works by [41] and [42] can be employed to improve the term in our bound.
Extension to unsupervised and semisupervised settings
Apart from the statistical guarantee and generalization bound established for the supervised setting, we also provide careful theoretical studies as well as detailed proofs regarding these results for the unsupervised and semisupervised setting in Appendix B, Appendix C, Appendix D, and Appendix E.
5 New MaxMin Cross Entropy From The Neural Rendering Model
In this section, we explore a particular way to derive an alternative to crossentropy inspired by the results in Theorem 3.3(a). In particular, denoting and , the new crossentropy , which is called the MaxMin crossentropy, is the weighted average of the crossentropy losses from and :
Here the Max crossentropy and Min cross entropy maximizes the correct target posterior and minimizes the incorrect target posterior, respectively. Similar to the crossentropy loss, the MaxMin crossentropy can also be shown to be an upper bound to the negative conditional loglikelihood of the NRM and has the same generalization bound derived in Section 4. The MaxMin networks in Figure 5 realize this new loss. These networks have two CNNlike branches that share weights. The max branch estimates using ReLU and MaxPooling, and the min branch estimates using the Negative ReLU, i.e., , and MinPooling. The MaxMin networks can be interpreted as a form of knowledge distillation like the Born Again networks [43] and the Mean Teacher networks. However, instead of a student network learning from a teacher network, in MaxMin networks, two students networks, the Max and the Min networks, cooperate and learn from each other during the training.
6 Experiments
6.1 SemiSupervised Learning
We show NRM armed with MaxMin crossentropy and Mean Teacher regularizer achieves SOTA on benchmark datasets. We discuss the experimental results for CIFAR10 and CIFAR100 here. The results for SVHN, training losses, and training details, can be found in the Appendix A & F.
Cifar10:
1K labels 50K images  2K labels 50K images  4K labels 50K images  50K labels 50K images  


Adversarial Learned Inference [8]  
Improved GAN [7]  
Ladder Network [44]  
model [5]  
Temporal Ensembling [5]  
Mean Teacher [6]  
VAT+EntMin [45]  
DRM [15, 46]  


Supervisedonly  
NRM without RPN  
NRM+RPN  
NRM+RPN+MaxMin  
NRM+RPN+MaxMin+Mean Teacher  

Table 3 shows comparable results of NRM to SOTA methods. NRM is also better than the best methods that do not use consistency regularization like GAN, Ladder network, and ALI when using only =2K and 4K labeled images. NRM outperform DRM in all settings. Also, among methods in our comparison, NRM achieves the best test accuracy when using all available labeled data (=50K). We hypothesize that NRM has the advantage over consistency regularization methods like Temporal Ensembling and Mean Teacher when there are enough labeled data is because the consistency regularization in those methods tries to match the activations in the network, but does not take into account the available class labels. On the contrary, NRM employs the class labels, if they are available, in its reconstruction loss and RPN regularization as in Eqns. 9 and 10. In all settings, RPN regularizer improves NRM performance. Even though the improvement from RPN is small, it is consistent across the experiments. Furthermore, using MaxMin crossentropy significantly reduces the test errors. When combining with MeanTeacher, our MaxMin NRM improves upon MeanTeacher and consistently achieves either SOTA results or second best results in all settings. This consistency in performance is only observed in our method and MeanTeacher. Also, like with MeanTeacher, NRM can potentially be combined with other consistency regularization methods, e.g., the Virtual Adversarial Training (VAT) [45], to obtain better results.
Cifar100:
Table 4 shows NRM’s comparable results to model and Temporal Ensembling, as well as better results than DRM. Same as with CIFAR10, using the RPN regularizer results in a slightly better test accuracy, and NRM achieves better results than model and Temporal Ensembling method when using all available labeled data. Notice that combining with MeanTeacher just slightly improves NRM’s performance when training with 10K labeled data. This is again because consistency regularization methods like MeanTeacher do not add much advantage when there are enough labeled data. However, NRM+MaxMin still yields better test errors and achieves SOTA result in all settings. Note that since combining with MeanTeacher does not help much here, we only show result for NRM+MaxMin.
6.2 Supervised Learning with MaxMin CrossEntropy
The MaxMin crossentropy can be applied not only to improve semisupervised learning but also on deep models including CNNs to enhance their supervised learning performance. In our experiments, we indeed observe MaxMin crossentropy reduces the test error for supervised object classification on CIFAR10. In particular, using the MaxMin crossentropy loss on a 29layer ResNet [47] trained with the ShakeShake regularization [48] and Cutout data augmentation [49], we are able to achieve SOTA test error of 2.30% on CIFAR10, an improvement of 0.26% over the test error of the baseline architecture trained with the traditional crossentropy loss. While 0.26% improvement seems small, it is a meaningful enhancement given that our baseline architecture (ResNeXt + ShakeShake + Cutout) is the second best model for supervised learning on CIFAR10. Such small improvement over an already very accurate model is significant in applications in which high accuracy is demanded such as selfdriving cars or medical diagnostics. Similarly, we observe MaxMin improves the top5 test error of the SqueezeandExcitation ResNeXt50 network [50] on ImageNet by 0.17% compared to the baseline (7.04% vs. 7.21%). For a fair comparison, we retrain the baseline models and report the scores in the reimplementation.
7 Discussion
We present the NRM, a general and an effective framework for semisupervised learning that combines generation and prediction in an endtoend optimization. Using NRM, we can explain operations used in CNNs and develop new features that help learning in CNNs. For example, we derive the new MaxMin crossentropy loss for training CNNs, which outperforms the traditional crossentropy.
In addition to the results discussed above, there are still many open problems related to NRM that have not been addressed in the paper. We give several examples below:

An adversarial loss like in GANs can be incorporated into the NRM so that the model can generate realistic images. Furthermore, more knowledge of image generation from graphics and physics can be integrated in NRM so that the model can employ more structures to help learning and generation.

The unsupervised and (semi)supervised models that we consider throughout the paper are under the assumption that the noise of NRM goes to 0. Governed by this assumption, we are able to derive efficient inference algorithms as well as rigorous statistical guarantees with these models. For the setting that is not close to 0, the inference with these models will rely on vanilla ExpectationMaximization (EM) algorithm for mixture models to obtain reliable estimators for the rendering templates. Since the parameters of interest are being shared among different rendering templates and have high dimensional structures, it is of practical interest to develop efficient EM algorithm to capture these properties of parameters under that setting of in NRM.

Thus far, the statistical guarantees with parameter estimation in the paper are established under the ideal assumptions that the optimal global solutions are obtained. However, it happens in practice that the inference algorithms based on SGD with the unsupervised and (semi)supervised models usually lead to (bad) local minima. As a consequence, investigating sufficient conditions for the inference algorithms to avoid being trapped at bad local minima is an important venue of future work.

NRM hinges upon the assumption that the data is generated from mixture of Gaussian distributions with the mean parameters characterizing the complex rendering templates. However, it may happen in reality that the underlying distribution of each component of mixtures is not Gaussian distribution. Therefore, extending the current understandings with NRM under Gaussian distribution to other choices of underlying distributions is an interesting direction to explore.
We would like to end the paper with a remark that NRM is a flexible framework that enables us to introduce new components in the generative process and the corresponding features for CNNs can be derived in the inference. This hallmark of NRM provide a more fundamental and systematic way to design and study CNNs.
8 Acknowledgements
First of all, we are very grateful to Amazon AI for providing a highly stimulating research environment for us to start this research project and further supporting our research through their cloud credits program. We would also like to express our sincere thanks to Gautam Dasarathy for great discussions. Furthermore, we would also like to thank Doris Y. Tsao for suggesting and providing references for connections between our model and feedforward and feedback connections in the brain.
Many people during Tan Nguyen’s internship at Amazon AI have helped by providing comments and suggestions on our work, including Stefano Soatto, Zack C. Lipton, YuXiang Wang, Kamyar Azizzadenesheli, Fanny Yang, Jean Kossaifi, Michael Tschannen, Ashish Khetan, and Jeremy Bernstein. We also wish to thank Sheng Zha who has provided immense help with MXNet framework to implement our models.
Finally, we would like to thank members of DSP group at Rice, Machine Learing group at UC Berkeley, and Anima Anandkumar’s TensorLab at Caltech who have always been supportive throughout the time it has taken to finish this project.
References
 [1] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems 27, pp. 2672–2680, 2014.
 [2] R. P. N. Rao and D. H. Ballard, “Predictive coding in the visual cortex: a functional interpretation of some extraclassical receptivefield effects,” Nature Neuroscience, vol. 2, pp. 79 EP –, 01 1999.
 [3] K. Friston, “Does predictive coding have a future?,” Nature Neuroscience, vol. 21, no. 8, pp. 1019–1021, 2018.
 [4] B. Neyshabur, R. R. Salakhutdinov, and N. Srebro, “Pathsgd: Pathnormalized optimization in deep neural networks,” in Advances in Neural Information Processing Systems, pp. 2422–2430, 2015.
 [5] S. Laine and T. Aila, “Temporal ensembling for semisupervised learning,” in International Conference on Learning Representations, 2017.
 [6] A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weightaveraged consistency targets improve semisupervised deep learning results,” in Advances in Neural Information Processing Systems 30, pp. 1195–1204, 2017.
 [7] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, and X. Chen, “Improved techniques for training gans,” in Advances in Neural Information Processing Systems 29, pp. 2234–2242, 2016.
 [8] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville, “Adversarially learned inference,” in International Conference on Learning Representations, 2017.
 [9] D. P. Kingma and M. Welling, “Autoencoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
 [10] D. P. Kingma, S. Mohamed, D. Jimenez Rezende, and M. Welling, “Semisupervised learning with deep generative models,” in Advances in Neural Information Processing Systems 27, pp. 3581–3589, 2014.
 [11] J. Donahue, P. Krähenbühl, and T. Darrell, “Adversarial feature learning,” in International Conference on Learning Representations, 2017.
 [12] L. Dinh, D. Krueger, and Y. Bengio, “Nice: Nonlinear independent components estimation,” in International Conference on Learning Representations Workshop, 2015.
 [13] L. Dinh, J. SohlDickstein, and S. Bengio, “Density estimation using real nvp,” in International Conference on Learning Representations, 2017.
 [14] D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible 1x1 convolutions,” in Advances in Neural Information Processing Systems, 2018.
 [15] A. B. Patel, M. T. Nguyen, and R. Baraniuk, “A probabilistic framework for deep learning,” in Advances in Neural Information Processing Systems 29, pp. 2558–2566, 2016.
 [16] A. Achille and S. Soatto, “Emergence of invariance and disentanglement in deep representations,” The Journal of Machine Learning Research, vol. 19, pp. 1–34, 2018.
 [17] A. Achille and S. Soatto, “Information dropout: Learning optimal representations through noisy computation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
 [18] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” in Annual Allerton Conference on Communication, Control and Computing, pp. 368–377, 1999.
 [19] V. Papyan, Y. Romano, and M. Elad, “Convolutional neural networks analyzed via convolutional sparse coding,” The Journal of Machine Learning Research, vol. 18, no. 1, pp. 2887–2938, 2017.
 [20] H. Bristow, A. Eriksson, and S. Lucey, “Fast convolutional sparse coding,” in 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 391–398, IEEE, 2013.
 [21] B. Wohlberg, “Efficient convolutional sparse coding,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7173–7177, IEEE, 2014.
 [22] F. Heide, W. Heidrich, and G. Wetzstein, “Fast and flexible convolutional sparse coding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5135–5143, 2015.
 [23] V. Papyan, J. Sulam, and M. Elad, “Working locally thinking globally: Theoretical guarantees for convolutional sparse coding,” IEEE Transactions on Signal Processing, vol. 65, no. 21, pp. 5687–5701, 2017.
 [24] J. Bruna and S. Mallat, “Invariant scattering convolution networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1872–1886, 2013.
 [25] S. Mallat, “Understanding deep convolutional networks,” Phil. Trans. R. Soc. A, vol. 374, no. 2065, p. 20150203, 2016.
 [26] S. Arora, N. Cohen, and E. Hazan, “On the optimization of deep networks: Implicit acceleration by overparameterization,” arXiv preprint arXiv:1802.06509, 2018.
 [27] C. D. Freeman and J. Bruna, “Topology and geometry of halfrectified network optimization,” in International Conference on Learning Representations, 2017.
 [28] A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun, “The loss surfaces of multilayer networks,” in Artificial Intelligence and Statistics, pp. 192–204, 2015.
 [29] K. Kawaguchi, “Deep learning without poor local minima,” in Advances in Neural Information Processing Systems 29, pp. 586–594, 2016.
 [30] K. Kawaguchi, L. P. Kaelbling, and Y. Bengio, “Generalization in deep learning,” arXiv preprint arXiv:1710.05468, 2017.
 [31] B. Neyshabur, S. Bhojanapalli, D. Mcallester, and N. Srebro, “Exploring generalization in deep learning,” in Advances in Neural Information Processing Systems 30, pp. 5947–5956, 2017.
 [32] R. Balestriero and R. G. Baraniuk, “A spline theory of deep networks,” in Proceedings of the 34th International Conference on Machine Learning, 2018.
 [33] R. Vidal, J. Bruna, R. Giryes, and S. Soatto, “Mathematics of deep learning,” arXiv preprint arXiv:1712.04741, 2017.
 [34] Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in International Conference on Machine Learning, pp. 1050–1059, 2016.
 [35] A. Y. Ng and M. I. Jordan, “On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes,” in Advances in Neural Information Processing Systems, pp. 841–848, 2002.
 [36] J. Bernardo, M. Bayarri, J. Berger, A. Dawid, D. Heckerman, A. Smith, and M. West, “Generative or discriminative? getting the best of both worlds,” Bayesian statistics, vol. 8, no. 3, pp. 3–24, 2007.
 [37] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 39, pp. 1–38, 1997.
 [38] C. M. Bishop, Pattern recognition and machine learning (information science and statistics). Berlin, Heidelberg: SpringerVerlag, 2006.
 [39] H. Robbins and S. Monro, “A stochastic approximation method,” in Herbert Robbins Selected Papers, pp. 102–109, Springer, 1985.
 [40] J. Kiefer, J. Wolfowitz, et al., “Stochastic estimation of the maximum of a regression function,” The Annals of Mathematical Statistics, vol. 23, no. 3, pp. 462–466, 1952.
 [41] P. L. Bartlett, D. J. Foster, and M. J. Telgarsky, “Spectrallynormalized margin bounds for neural networks,” Advances in Neural Information Processing Systems (NIPS), 2017.
 [42] N. Golowich, A. Rakhlin, and O. Shamir, “Sizeindependent sample complexity of neural networks,” Proceedings of the Conference On Learning Theory (COLT), 2018.
 [43] T. Furlanello, Z. C. Lipton, M. Tschannen, L. Itti, and A. Anandkumar, “Bornagain neural networks,” Proceedings of the International Conference on Machine Learning (ICML), 2018.
 [44] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko, “Semisupervised learning with ladder networks,” in Advances in Neural Information Processing Systems 28, pp. 3546–3554, 2015.
 [45] T. Miyato, S.i. Maeda, S. Ishii, and M. Koyama, “Virtual adversarial training: a regularization method for supervised and semisupervised learning,” IEEE transactions on pattern analysis and machine intelligence, 2018.
 [46] A. B. Patel, T. Nguyen, and R. G. Baraniuk, “A probabilistic theory of deep learning,” arXiv preprint arXiv:1504.00641, 2015.
 [47] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5987–5995, IEEE, 2017.
 [48] X. Gastaldi, “Shakeshake regularization of 3branch residual networks,” in International Conference on Learning Representations, 2017.
 [49] T. DeVries and G. W. Taylor, “Improved regularization of convolutional neural networks with cutout,” arXiv preprint arXiv:1708.04552, 2017.
 [50] J. Hu, L. Shen, and G. Sun, “Squeezeandexcitation networks,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
 [51] A. Kumar, P. Sattigeri, and T. Fletcher, “Semisupervised learning with gans: Manifold invariance with improved inference,” in Advances in Neural Information Processing Systems 30, pp. 5534–5544, 2017.
 [52] T. Nguyen, W. Liu, E. Perez, R. G. Baraniuk, and A. B. Patel, “Semisupervised learning with the deep rendering mixture model,” arXiv preprint arXiv:1612.01942, 2016.
 [53] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, “Variational inference: A review for statisticians,” Journal of the American Statistical Association, vol. 112, no. 518, pp. 859–877, 2017.
 [54] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016.
 [55] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
 [56] S. van de Geer, Empirical Processes in Mestimation. Cambridge University Press, 2000.
 [57] R. Vershynin, “Introduction to the nonasymptotic analysis of random matrices,” arXiv:1011.3027v7, 2011.
 [58] L. Devroye, L. Gyorfi, and G. Lugosi, A Probabilistic Theory of Pattern Recognition. Stochastic Modelling and Applied Probability, Springer, 1996.
 [59] R. Dudley, “Central limit theorems for empirical measures,” Annals of Probability, vol. 6, 1978.
 [60] V. Koltchinskii and D. Panchenko, “Empirical margin distributions and bounding the generalization error of combined classifiers,” Annals of Statistics, vol. 30, 2002.
 [61] A. W. van der Vaart and J. Wellner, Weak Convergence and Empirical Processes. New York, NY: SpringerVerlag, 1996.
 [62] I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent with warm restarts,” in International Conference on Learning Representations, 2016.
 [63] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9, 2015.
Supplementary Material
Appendix A Appendix A
This appendix contains semisupervised learning results of NRM on SVHN compared to other methods.
SemiSupervised Learning Results on SVHN:
250 labels 73257 images  500 labels 73257 images  1000 labels 73257 images  73257 labels 73257 images  


ALI [8]  
Improved GAN [7]  
+ Jacob.reg + Tangents [51]  
model [5]  
Temporal Ensembling [5]  
Mean Teacher [6]  
VAT+EntMin [45]  
DRM [52]  


Supervisedonly  
NRM without RPN  
NRM+RPN  
NRM+RPN+MaxMin+MeanTeacher  

Appendix B Appendix B
In this appendix, we give further connection of NRM to cross entropy as well as additional derivation of NRM to various models under both unsupervised and (semi)supervised setting of data being mentioned in the main text. We also formally present our results on consistency and generalization bounds for NRM in supervised and semisupervised learning settings. In addition, we explain how to extend NRM to derive ResNet and DenseNet. For the simplicity of the presentation, we denote to represent all the parameters that we would like to estimate from NRM where is the set of all possible values of latent (nuisance) variables . Additionally, for each , we denote , i.e., the subset of parameters corresponding to specific label and latent variable . Furthermore, to stress the dependence of on , we define the following function
for each where is a masking matrix associated with . Throughout this supplement, we will use and interchangeably as long as the context is clear. Furthermore, we assume that , which is a subset of for any , , which is a subset of , for , and , which is a subset of for all choices of and . Last but not least, we say that satisfies the nonnegativity assumption if the intermediate rendered images satisfy for all . Finally, we use to denote transpose of the matrix .
b.1 Connection between NRM and cross entropy
As being established in part (a) of Theorem 3.3, the cross entropy is the lower bound of maximizing the conditional log likelihood. In the following full theorem, we will show both the upper bound and the lower bound of maximizing the conditional log likelihood in terms of the cross entropy.
Theorem B.1.
Given any , we denote . For any and , let be i.i.d. samples from the NRM. Then, the following holds
(a) (Lower bound)
where for all , for all