Spatial Transformer
Introspective Neural Network
Abstract
Natural images contain many variations such as illumination differences, affine transformations, and shape distortions. Correctly classifying these variations poses a long standing problem. The most commonly adopted solution is to build largescale datasets that contain objects under different variations. However, this approach is not ideal since it is computationally expensive and it is hard to cover all variations in one single dataset. Towards addressing this difficulty, we propose the spatial transformer introspective neural network (STINN) that explicitly generates samples with the unseen affine transformation variations in the training set. Experimental results indicate STINN achieves classification accuracy improvements on several benchmark datasets, including MNIST, affNIST, SVHN and CIFAR10. We further extend our method to cross dataset classification tasks and fewshot learning problems to verify our method under extreme conditions and observe substantial improvements from experiment results.
Yunhan Zhao*yzhao83@jhu.edu1
\addauthorYe Tian*ytian27@jhu.edu1
\addauthorWei Shenwei.shen@t.shu.edu.cn23
\addauthorAlan Yuilleayuille1@jhu.edu3
\addinstitution
Laboratory for Computational Sensing and Robotics
Johns Hopkins University
Baltimore, USA
\addinstitution
Key Laboratory of Specialty Fiber Optics and Optical Access Networks
Shanghai University
Shanghai, China
\addinstitution
The Department of Computer Science
Johns Hopkins University
Baltimore, USA
Spatial TRANSFORMER INTROSPECTIVE NEURAL NETWORK
1 Introduction
Classification problems have rapidly progressed with advancements in convolutional neural networks (CNNs) [LeCun et al.(1989)LeCun, Boser, Denker, Henderson, Howard, Hubbard, and Jackel] and the advent of large visual recognition datasets. CNNs are capable of learning complex features that are informative and discriminant [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton, Simonyan and Zisserman(2014), Szegedy et al.(2015)Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke, Rabinovich, et al., He et al.(2016)He, Zhang, Ren, and Sun, Huang et al.(2017)Huang, Liu, Weinberger, and van der Maaten]. Even though CNNs beats traditional machine learning algorithms, the learning process is quite cumbersome. CNNs generally require large training sets to learn high quality features. Many neural networks still suffer from variations in the test data after training with large amounts of samples. Moreover, it is impossible to find a dataset that spans the entire image space to make CNNs capture all possible features. Therefore, our attention is brought to find an effective method to handle discrepancies between training data and test data. [Elhamifar and Vidal(2011), Wang and Wang(2014), Gao et al.(2010)Gao, Tsang, and Chia].
Many works have been proposed to address this issue. One of the most common approach is adopting data augmentation techniques [Antoniou et al.(2017)Antoniou, Storkey, and Edwards, Polson et al.(2011)Polson, Scott, et al., Ronneberger et al.(2015)Ronneberger, Fischer, and Brox] to enrich the variations in the training set. This method is certainly more efficient than building a large training dataset but it is still not optimal. Data augmentation techniques apply random operations such as rotation, scaling and cropping to the input images before the training step. However, the number of possible variations are unlimited and it is tough to find beneficial samples with data augmentations. It is better if the models can generate unseen variations in the training set and utilize them to strengthen the classifiers.
The image space is huge, thus we concentrate on affine transformation variations of images in this work. Inspired by [Welling et al.(2003)Welling, Zemel, and Hinton, Tu(2007), Antoniou et al.(2017)Antoniou, Storkey, and Edwards] in which selfgenerated samples are utilized, as well as the hard examples training strategy [Shrivastava et al.(2016)Shrivastava, Gupta, and Girshick, Wang et al.(2017)Wang, Shrivastava, and Gupta], we propose a novel method named spatial transformer introspective neural network (STINN). Our approach utilizes the advantages of generative models and boosts the classification performance by generating novel variations that are not covered in the training set. Instead of generating with generative adversarial nets (GANs) [Goodfellow et al.(2014)Goodfellow, PougetAbadie, Mirza, Xu, WardeFarley, Ozair, Courville, and Bengio], we adopt introspective neural networks (INNs) [Tu(2007), Lazarow et al.(2017)Lazarow, Jin, and Tu]. INNs maintain one single CNN discriminator that itself is also a generator while GANs have separate discriminators and generators. Moreover, INNs are easier to train than GANs with gradient descent algorithms by avoiding adversarial learning. To generate novel variations, we use spatial transformers [Jaderberg et al.(2015)Jaderberg, Simonyan, Zisserman, et al.] to learn new affine transformation parameters and then apply them to the input images. The spatial transformers and classifiers constitute an adversary since the spatial transformers try to produce unseen variations that are hard for discriminators to classify. On the other hand, the discriminators try to correctly classify both the original training images and the transformed images. Therefore, the generated new images are determined by the classifiers. In our experiments, we show performance gain not only on classification problems but also on cross dataset classification and fewshot learning problems.
2 Related Work
In recent years, a significant number of works build strong classifiers with data augmentation techniques [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] that produce more variations by applying simple pixel level operations to the training samples. The performance gain by adopting this method has been validated by many stateoftheart algorithms [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton, Iandola et al.(2016)Iandola, Han, Moskewicz, Ashraf, Dally, and Keutzer, Simonyan and Zisserman(2014), Szegedy et al.(2015)Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke, Rabinovich, et al., He et al.(2016)He, Zhang, Ren, and Sun, Huang et al.(2017)Huang, Liu, Weinberger, and van der Maaten]. However, this method is not desired since the manually produced samples are not guaranteed to benefit the classifiers. Moreover, the possible pixel level operations are specified before training, which further limit the possibilities of produced samples. It is more efficient to directly generate samples that are advantageous to classifiers.
GANs [Goodfellow et al.(2014)Goodfellow, PougetAbadie, Mirza, Xu, WardeFarley, Ozair, Courville, and Bengio] have led a huge wave in exploring the generative adversarial structures. Combining this structure with deep convolutional networks can produce models that have strong generative ability. In GANs, generators and discriminators are trained simultaneously. Generators try to generate fake images that fool the discriminators, while discriminators try to distinguish the real and fake images. Many variations of GANs have emerged in the past three years, like DCGAN [Radford et al.(2015)Radford, Metz, and Chintala], WGAN [Arjovsky et al.(2017)Arjovsky, Chintala, and Bottou] and WGANGP [Gulrajani et al.(2017)Gulrajani, Ahmed, Arjovsky, Dumoulin, and Courville]. These GANs variations show stronger learning ability that enables generating complex images. Techniques have been proposed to improve adversarial learning for image generation [Salimans et al.(2016)Salimans, Goodfellow, Zaremba, Cheung, Radford, and Chen, Gulrajani et al.(2017)Gulrajani, Ahmed, Arjovsky, Dumoulin, and Courville, Denton et al.(2015)Denton, Chintala, Fergus, et al.] as well as for training better image generative models [Radford et al.(2015)Radford, Metz, and Chintala, Isola et al.(2017)Isola, Zhu, Zhou, and Efros]. [Radford et al.(2015)Radford, Metz, and Chintala] also highlights that the adversarial learning can improve image classification in a semisupervised setting.
INNs [Tu(2007), Lazarow et al.(2017)Lazarow, Jin, and Tu, Jin et al.(2017)Jin, Lazarow, and Tu, Lee et al.(2018)Lee, Xu, Fan, and Tu] provide an alternative approach to generate samples. INNs are closely related to GANs since they both have generative and discriminative abilities but different in various ways. INNs keep one single models that are both discriminative and generative at the same time while GANs have distinct generators and discriminators. INNs focus on introspective learning that synthesize samples from its own classifier. On the other hand, GANs emphasize adversarial learning that guide generators with separate discriminators.
Both GANs and INNs are designed to generate images that are similar to input images. However, we want to generate images that are different from existing training images while still remain in the same category. This motivation leads us to explore the Spatial transformer networks (STNs) [Jaderberg et al.(2015)Jaderberg, Simonyan, Zisserman, et al.]. STNs first proposed that the affine transformation parameters can be learned with CNNs. STNs locate the region of interest in original images and apply the computed affine transformation parameters to the region, which enables the possibility of generating different variations.
3 Method
We now describe the details of our approach in this section. We first briefly review the introspective learning framework [Tu(2007)]. This is followed by a detailed mathematical explanation of our generative and discriminative steps. In particular, we focus on explaining how our model generates unseen examples that complement the training datasets.
3.1 Learning Framework: Introspective Learning
Let be a data sample and be its label, indicating either a negative or a positive sample. A discriminative classifier computes , the probability of being positive or negative and . The primary goal is to learn that captures the underlying generation process of positive samples. For binary classification case, the discriminative models can be arranged as:
(1) 
This equation could be further simplified by assuming ,
(2) 
The generative model is connected with the discriminative model . For notation simplicity, we denote as and let represents the in the iteration. It has been proven that in [Tu(2007)], where KL denotes the KullbackLeibler divergences. Therefore, the negative distribution will iteratively converge to positive distribution by the following update equation
(3) 
where is the normalizing factor, represents the initial negative distribution, and is the discriminative model learned by the classifier.
There are several works that extend this unique learning framework. [Lazarow et al.(2017)Lazarow, Jin, and Tu] adapts this framework to neural networks and shows can be efficiently learned with a CNN classifier , where is the output from the last activation function, is the classifier parameters in the iteration, is the binary predicted labels of input and is the sigmoid function. The synthesis step can be done by standard back propagation. [Jin et al.(2017)Jin, Lazarow, and Tu] extends this work to multiclass classification problems with CNNs and proves the CNN classifiers have the ability to learn multiple classes at the same time with the softmax loss function. In this case, becomes the , where is the predicted class of given sample , Softmax() is the softmax function. The Wasserstein loss is integrated by [Lee et al.(2018)Lee, Xu, Fan, and Tu] for synthesis and classification tasks.
3.2 StInn
In this section, we present our formulation building upon the introspective learning framework presented in the previous section. Theoretically, even large training datasets cannot fully cover the entire image space. Our goal is to explore the part of the image space that is not covered by the training set. As shown in Figure 1,we actively generate affine transformed examples that help learn a discriminator robust to all affine transformation. We keep the notation consistent with the previous section, which means the corresponds to the learned classifier in Eqn. (3) and is the model parameters in the iteration. The update rules shown in Eqn. (3) holds under the assumption that , therefore the number of positive samples and negatives samples drawn in all steps are always same.
Classification steps The classification step can be viewed as training a normal classifier with positive samples from and negative samples from . The objective function of classification step is define in Eqn. (4). The first part of the objective function is to encourage the model to correctly classify positive images as well as transformed positive images. This encourages the classifiers to not only preserve features learned from the original images but also try to capture more information from transformed images. The second part of the objective function is to maximize the Wasserstein distance between transformed positive images and negative images in the feature space. In this case, we have two slightly different features for classification tasks and for calculating the Wasserstein distance. Therefore, we introduce two functions and to compute features at different level of our network for classification tasks and Wasserstein distance, respectively. The object function can be represented as
(4) 
where represents the loss function, are weights of each loss function, represents the affine transfer parameters that will introduce in the next part, and represents the samples drawn from and , respectively. , where , and is the groundtruth labels of . The term is the gradient penalty that enables stable training of the Wasserstein loss function. As shown in the Eqn. (4), this function is only parameterized by . In other words, we only update in the classification step and keep fixed. This is to ensure the convergence of in the training procedure. The decision boundary are expected to get reshaped to a more robust boundary that has high tolerance against affine transformations.
Spatial transformer To actively expand the sample space, we adopt spatial transformers (STs) to generate novel samples. As suggested in [Jaderberg et al.(2015)Jaderberg, Simonyan, Zisserman, et al.], the affine transformation parameters can be learned by localization networks that take the form of CNNs. The data dependent affine transformations are predicted at the top layer of the localization networks. Moreover, the networks are differentiable, which means the network parameters can be learned with standard backpropagation. The pointwise affine transformation can be represented as follows:
(5) 
where and represents the pixel in the source and target coordinates, respectively. We use to denote the six affine transformation parameters for simplicity. The transformation parameters allow rotation, translation, scale, and shear to be applied to the input feature map. The affine transformation is introduced in this work to create unseen examples that are hard for the discriminators to classify. The generated images are expected to include patterns that are not covered by the training set. Therefore, the classifiers become more robust to affine variations after trained with these hard examples. The localization network is trained by minimizing the following loss function
(6) 
where is the affine transformation function that takes the output of localization networks and transform the input features. We can observe that this loss function is negative, thus minimizing this loss is equivalent to maximizing the softmax loss of the transformed images.
Synthesis steps In synthesis step, we want to obtain effective negative samples from the most recent . The random samples are drawn from and updated by increasing using back propagation. Note that is independent from , therefore we can directly model = exp(). Take logarithm of both sides of the model, then the right hand side becomes . Thus, is nicely converted to . This conversion allows us to update the samples with stochastic gradient descent(SGD) based algorithms. In practice, we update from the samples generated from previous iterations to reduce time and memory complexity. High quality negative samples are very significant in tightening the boundary. An update threshold is introduced to guarantee the generated negative images are above certain criteria. We modify the update threshold proposed in [Lee et al.(2018)Lee, Xu, Fan, and Tu] and keep track of the in every iteration. In particular, we build a set by recording , where in every iteration. We form a normal distribution , where and represents mean and standard deviation computed from set . The stop threshold is set to be a random number sampled from this normal distribution. The reason behind this threshold is to make sure the generated negative images are close to the majority of transformed positive images in the feature space.
4 Experiments
In this section, we include 3 different types of experiments to validate our proposed method. First, we conduct classification experiments on four datasets: MNIST [LeCun et al.(1998)LeCun, Bottou, Bengio, and Haffner], affNIST [Tieleman(2013)] , SVHN [Netzer et al.(2011)Netzer, Wang, Coates, Bissacco, Wu, and Ng] and CIFAR10 [Krizhevsky and Hinton(2009)], to show that our method has the ability to boost the performance not only on simple datasets like MNIST, but also on datasets with realworld images like SVHN and CIFAR10. We also run experiments that perform classification tasks across different datasets. The purpose of this type of experiment is to exam the robustness of the classifier when the test dataset contains significant different images. Lastly, we introduce the fewshot learning problems. In fewshot learning experiments, we provide all categories the same number of samples to test the ability of generating novel samples with very limited variations.
We compare our method against CNNs, DCGAN [Radford et al.(2015)Radford, Metz, and Chintala], WGANGP [Gulrajani et al.(2017)Gulrajani, Ahmed, Arjovsky, Dumoulin, and Courville], INN [Lazarow et al.(2017)Lazarow, Jin, and Tu] and WINN [Lee et al.(2018)Lee, Xu, Fan, and Tu]. DCGAN experimentally shows the potential of GANs with deep convolutional networks. WGANGP stabilizes the training step of WGAN [Arjovsky et al.(2017)Arjovsky, Chintala, and Bottou] with the gradient penalty. INN shows strong generative ability while being discriminative at the same time. WINN connects Wasserstein distance with INNs and shows even better performance. All of our comparisons are proposed in an unsupervised setting except WINN. To compare them with our method in a supervised setting, we adopt the evaluation metric proposed in [Jin et al.(2017)Jin, Lazarow, and Tu]. The training phase becomes a twostep implementation. We first generate negative samples with the original implementation. Then, the generated negative images are used to augment the original training set. We train a simple CNN that has the identical structure with our method on the augmented training set. All results reported in this section are the average of multiple repetitions.
All experiments are conducted with a simple network that contains 4 convolutional layers, each having a filter size with 64 channels and stride 2 in all layers. We apply batch normalization [Ioffe and Szegedy(2015)] and swish activation function [Ramachandran et al.(2018)Ramachandran, Zoph, and Le] after the convolutional layers. The last convolutional layer is followed by two consecutive fully connected layers to compute logits and Wasserstein distance. We train our method and other baselines for 200 epochs. The optimizer used is Adam optimizer [Kingma and Ba(2014)] with parameters and .
4.1 Classification
We use the standard MNIST as the simplest benchmark to show our results. In this dataset, 55000, 5000 and 10000 images are used as training, validation and testing split respectively. The affNIST dataset is used to show our result on deformed images. This dataset is built by taking images from MNIST and applying various reasonable affine transformations to them. To accord with the MNIST, we also take 55000, 5000 and 10000 images for training, validation and testing, respectively. SVHN is a realworld dataset that contains house numbers images from Google Street View and it is significantly harder than the MNIST dataset. We follow its training and testing split without augmenting the training set with extra images. Lastly, we conduct experiments on the CIFAR10 dataset. CIFAR10 contains 60000 natural images of ten different objects from the realworld scenes. 50000 images are used in training and 10000 are used for testing.
Method  MNIST  affNIST  SVHN  CIFAR10 

CNN (baseline)  0.89%  2.82%  9.86%  31.31% 
CNN + DCGAN  0.79%  2.78%  9.78%  31.22% 
CNN + WGANGP  0.74%  2.76%  9.73%  31.08% 
CNN + INN  0.72%  2.97%  9.72%  32.34% 
WINN  0.67%  2.56%  9.84%  30.72% 
Ours  0.64%  2.37%  8.95%  28.75% 
As shown in Table 1, our method achieves the best performance on all four datasets. The boosted performance on MNIST dataset is marginal, which meets our expectation because the difference between training and test split in MNIST dataset is tiny. Therefore, the potential of improving classifiers by generating hard examples is very limited on this dataset. On the other hand, we can clearly see that the performance increases on the affNIST dataset that contains more variations than the MNIST dataset. The overall improvements can be explained by the fact that our method can generate novel and reliable negative images (shown in Figure 2) that can effectively tighten the decision boundary. The spatial transformers tend to find classifier unseen examples and all generated images are directly focus on the weakness of current classifiers. The generated images of our method on the MNIST dataset are clearly different from the original images (shown in Figure 3). The classifiers are generalized to different affine transformations after training with unseen examples as well as preserving original features. Therefore, STINN has lower error rate than not only the methods that used generated samples to augment the training set like DCGAN and WGANGP, but also the introspective methods like INN and WINN.
4.2 Cross Dataset Classification
As mentioned in the previous section, as the introduction of the spatial transformers, our method has the ability to generate novel variations that are different from the existing types in the training data, and thus helps the classifiers become robust. To further verify this claim, we design a challenging cross dataset classification task between two significantly different datasets. The training set in this experiment is MNIST while the test set is affNIST that includes much more variations than the MNIST dataset. CNNs with standard data augmentation is also included in the comparisons. We could clearly observe from Table 2 that our method has significant improvement over other methods. Moreover, our method outperforms CNNs with standard data augmentation, which further demonstrate that our method improves performance more efficient than simple data augmentation.
In addition, we want to analyze the relationship between the performance improvement and the affine transformation magnitude on cross dataset classification tasks. Therefore, we manually create three different test sets by applying different magnitudes of affine transformation on the MNIST dataset. All these three test sets have same number of samples as the test set in the cross dataset classification task mentioned above. The purpose of this experiment is to test the performance of all methods under a more regularized setting since the affNIST dataset is a mixture of all types of affine transformed images. The detailed setting of each test split is reported in Table 3. We compare our method with the baseline and WINN, and the results are plotted in the Figure 3. We can conclude from the results that our method has greater improvement under afftest2 and afftest3, which means our method can tolerate strong affine transformation.
Method  CNN  CNN (w/ DA)  DCGAN  WGANGP  INN  WINN  Ours 

Error  76.26%  69.16%  74.94%  74.60%  74.16%  73.36%  65.35% 
4.3 FewShot Learning
Lastly, we want to generalize STINN to fewshot learning problems that the number of training samples are strictly limited. Many work has been proposed to solve this extremely challenging problems [Qiao et al.(2018)Qiao, Liu, Shen, and Yuille, Vinyals et al.(2016)Vinyals, Blundell, Lillicrap, Wierstra, et al., Santoro et al.(2016)Santoro, Bartunov, Botvinick, Wierstra, and Lillicrap, Wong and Yuille(2015), Zhang et al.(2017)Zhang, Qiao, Xie, Shen, Wang, and Yuille]. The purpose of this experiment is to explore the potential of STINN in generating unseen variations with very few training samples, thus we mainly compare with generative models. We introduce one more comparison here named data augmentation generative adversarial network (DAGAN) [Antoniou et al.(2017)Antoniou, Storkey, and Edwards] that improves the performance on fewshot learning problems by using generative models to do data augmentation. We design the experiments that the training set is the MNIST dataset with only 10, 25 and 50 samples per class while the test set is the whole MNIST test set. Similarly, we repeat the same experiments on the affNIST dataset to further verify the results.
Method 
CNN  DCGAN  WGANGP  INN  DAGAN  WINN  Ours 
10shots(M) 
25.81%  22.43%  22.03%  23.28%  22.07%  22.89%  20.02% 
25shots(M) 
11.08%  9.86%  9.74%  9.97%  9.78%  9.67%  9.01% 
50shots(M) 
6.68%  6.03%  5.98%  6.12%  5.86%  6.23%  5.26% 
10shots(A) 
84.07%  82.92%  82.84%  82.92%  80.45%  81.92%  78.53% 
25shots(A) 
67.04%  61.88%  61.58%  61.08%  61.07%  61.67%  59.71% 
50shots(A) 
52.72%  51.67%  51.71%  51.98%  50.47%  51.13%  49.04% 

As shown in Table 4, it is clear that our method has the best performance on all fewshot learning tasks. The overall performance gain on MNIST dataset is smaller than on the affNIST dataset when the number of shots are same. One possible reason behind this observation is that the number of variations are limited in MNIST datset while affNIST dataset includes much more variations. Therefore, our method can generate more useful variations on affNIST dataset under fewshot settings, which leads to greater improvements.
5 Conclusion
In this work, we proposed STINN that strengthens the classifiers by generating novel affine transformation variations. Our method shows consistent performance improvements not only on the classification tasks but also on the cross dataset classification tasks, which indicates that our method successfully generates classifiers unseen variations. Moreover, STINN also shows great potential in handling fewshots learning problems. In futureworks, we would like to apply our method to large scale datasets and extend our method to generate more types of variations.
References
 [Antoniou et al.(2017)Antoniou, Storkey, and Edwards] Antreas Antoniou, Amos Storkey, and Harrison Edwards. Data augmentation generative adversarial networks. arXiv preprint arXiv:1711.04340, 2017.
 [Arjovsky et al.(2017)Arjovsky, Chintala, and Bottou] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
 [Denton et al.(2015)Denton, Chintala, Fergus, et al.] Emily L Denton, Soumith Chintala, Rob Fergus, et al. Deep generative image models using aï¿¼ laplacian pyramid of adversarial networks. In Advances in neural information processing systems, pages 1486–1494, 2015.
 [Elhamifar and Vidal(2011)] Ehsan Elhamifar and René Vidal. Robust classification using structured sparse representation. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1873–1879. IEEE, 2011.
 [Gao et al.(2010)Gao, Tsang, and Chia] Shenghua Gao, Ivor WaiHung Tsang, and LiangTien Chia. Kernel sparse representation for image classification and face recognition. In European Conference on Computer Vision, pages 1–14. Springer, 2010.
 [Goodfellow et al.(2014)Goodfellow, PougetAbadie, Mirza, Xu, WardeFarley, Ozair, Courville, and Bengio] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [Gulrajani et al.(2017)Gulrajani, Ahmed, Arjovsky, Dumoulin, and Courville] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5769–5779, 2017.
 [He et al.(2016)He, Zhang, Ren, and Sun] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [Huang et al.(2017)Huang, Liu, Weinberger, and van der Maaten] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, volume 1, page 3, 2017.
 [Iandola et al.(2016)Iandola, Han, Moskewicz, Ashraf, Dally, and Keutzer] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnetlevel accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
 [Ioffe and Szegedy(2015)] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 [Isola et al.(2017)Isola, Zhu, Zhou, and Efros] Phillip Isola, JunYan Zhu, Tinghui Zhou, and Alexei A Efros. Imagetoimage translation with conditional adversarial networks. arXiv preprint, 2017.
 [Jaderberg et al.(2015)Jaderberg, Simonyan, Zisserman, et al.] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In Advances in neural information processing systems, pages 2017–2025, 2015.
 [Jin et al.(2017)Jin, Lazarow, and Tu] Long Jin, Justin Lazarow, and Zhuowen Tu. Introspective classification with convolutional nets. In Advances in Neural Information Processing Systems, pages 823–833, 2017.
 [Kingma and Ba(2014)] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [Krizhevsky and Hinton(2009)] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
 [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [Lazarow et al.(2017)Lazarow, Jin, and Tu] Justin Lazarow, Long Jin, and Zhuowen Tu. Introspective neural networks for generative modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2774–2783, 2017.
 [LeCun et al.(1989)LeCun, Boser, Denker, Henderson, Howard, Hubbard, and Jackel] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.
 [LeCun et al.(1998)LeCun, Bottou, Bengio, and Haffner] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [Lee et al.(2018)Lee, Xu, Fan, and Tu] Kwonjoon Lee, Weijian Xu, Fan Fan, and Zhuowen Tu. Wasserstein introspective neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
 [Netzer et al.(2011)Netzer, Wang, Coates, Bissacco, Wu, and Ng] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5, 2011.
 [Polson et al.(2011)Polson, Scott, et al.] Nicholas G Polson, Steven L Scott, et al. Data augmentation for support vector machines. Bayesian Analysis, 6(1):1–23, 2011.
 [Qiao et al.(2018)Qiao, Liu, Shen, and Yuille] Siyuan Qiao, Chenxi Liu, Wei Shen, and Alan Yuille. Fewshot image recognition by predicting parameters from activations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
 [Radford et al.(2015)Radford, Metz, and Chintala] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
 [Ramachandran et al.(2018)Ramachandran, Zoph, and Le] Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions. 2018.
 [Ronneberger et al.(2015)Ronneberger, Fischer, and Brox] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. Unet: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computerassisted intervention, pages 234–241. Springer, 2015.
 [Salimans et al.(2016)Salimans, Goodfellow, Zaremba, Cheung, Radford, and Chen] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2234–2242, 2016.
 [Santoro et al.(2016)Santoro, Bartunov, Botvinick, Wierstra, and Lillicrap] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Oneshot learning with memoryaugmented neural networks. arXiv preprint arXiv:1605.06065, 2016.
 [Shrivastava et al.(2016)Shrivastava, Gupta, and Girshick] Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training regionbased object detectors with online hard example mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 761–769, 2016.
 [Simonyan and Zisserman(2014)] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [Szegedy et al.(2015)Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke, Rabinovich, et al.] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich, et al. Going deeper with convolutions. CVPR, 2015.
 [Tieleman(2013)] Tijmen Tieleman. affnist, 2013. URL https://www.cs.toronto.edu/~tijmen/affNIST/.
 [Tu(2007)] Zhuowen Tu. Learning generative models via discriminative approaches. In Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, pages 1–8. IEEE, 2007.
 [Vinyals et al.(2016)Vinyals, Blundell, Lillicrap, Wierstra, et al.] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pages 3630–3638, 2016.
 [Wang and Wang(2014)] Haoxiang Wang and Jingbin Wang. An effective image representation method using kernel classification. In Tools with Artificial Intelligence (ICTAI), 2014 IEEE 26th International Conference on, pages 853–858. IEEE, 2014.
 [Wang et al.(2017)Wang, Shrivastava, and Gupta] Xiaolong Wang, Abhinav Shrivastava, and Abhinav Gupta. Afastrcnn: Hard positive generation via adversary for object detection. arXiv preprint arXiv:1704.03414, 2, 2017.
 [Welling et al.(2003)Welling, Zemel, and Hinton] Max Welling, Richard S Zemel, and Geoffrey E Hinton. Self supervised boosting. In Advances in neural information processing systems, pages 681–688, 2003.
 [Wong and Yuille(2015)] Alex Wong and Alan L Yuille. One shot learning via compositions of meaningful patches. In Proceedings of the IEEE International Conference on Computer Vision, pages 1197–1205, 2015.
 [Zhang et al.(2017)Zhang, Qiao, Xie, Shen, Wang, and Yuille] Zhishuai Zhang, Siyuan Qiao, Cihang Xie, Wei Shen, Bo Wang, and Alan L Yuille. Singleshot object detection with enriched semantics. arXiv preprint arXiv:1712.00433, 2017.