Tangent-Normal Adversarial Regularization for Semi-supervised Learning
The ever-increasing size of modern datasets combined with the difficulty of obtaining label information has made semi-supervised learning of significant practical importance in modern machine learning applications. Compared with supervised learning, the key difficulty in semi-supervised learning is how to make full use of the unlabeled data. In order to utilize manifold information provided by unlabeled data, we propose a novel regularization called the tangent-normal adversarial regularization, which is composed by two parts. The two terms complement with each other and jointly enforce the smoothness along two different directions that are crucial for semi-supervised learning. One is applied along the tangent space of the data manifold, aiming to enforce local invariance of the classifier on the manifold, while the other is performed on the normal space orthogonal to the tangent space, intending to impose robustness on the classifier against the noise causing the observed data deviating from the underlying data manifold. Both of the two regularizers are achieved by the strategy of virtual adversarial training. Our method has achieved state-of-the-art performance on semi-supervised learning tasks on both artificial dataset and FashionMNIST dataset.
The recent success of supervised learning (SL) models, like deep convolutional neural networks, highly relies on the huge amount of labeled data. However, though obtaining data itself might be relatively effortless in various circumstances, to acquire the annotated labels is still costly, limiting the further applications of SL methods in practical problems. Semi-supervised learning (SSL) models, which requires only a small part of data to be labeled, does not suffer from such restrictions. The advantage that SSL depends less on well-annotated datasets makes SSL of crucial practical importance and draws lots of research interests. The common setting in SSL is that we have access to a relatively small amount of labeled data and much larger amount of unlabeled data. And we need to train a classifier utilizing those data. Comparing to SL, the main challenge of SSL is how to make full use of the huge amount of unlabeled data, i.e., how to utilize the marginalized input distribution to improve the prediction model i.e., the conditional distribution of supervised target . To solve this problem, there are mainly three lines of research.
The first approach, based on probabilistic models, recognizes the SSL problem as a specialized missing data imputation task for classification problem. The common scheme of this method is to establish a hidden variable model capturing the relationship between the input and label, and then applies Bayesian inference techniques to optimize the model (Kingma et al., 2014; Zhu et al., 2003; Rasmus et al., 2015). Suffering from the estimation of posterior being either inaccurate or computationally inefficient, this approach performs less well particularly in high-dimensional dataset (Kingma et al., 2014).
And the second line tries to construct proper regularization using the unlabeled data, to impose the desired smoothness on the classifier. One kind of useful regularization is achieved by adversarial training (Goodfellow et al., 2014b), or virtual adversarial training (VAT) when applied to unlabeled data (Miyato et al., 2016, 2017). Such regularization leads to robustness of classifier to adversarial examples, thus inducing smoothness of classifier in input space where the observed data is presented. The input space being high dimensional, though, the data itself is concentrated on a underlying manifold of much lower dimensionality (Cayton, 2005; Narayanan & Mitter, 2010; Chapelle et al., 2009; Rifai et al., 2011). Thus directly performing VAT in input space might overly regularize and does potential harm to the classifier. Another kind of regularization called manifold regularization aims to encourage invariance of classifier on manifold (Simard et al., 1998; Belkin et al., 2006; Niyogi, 2013; Kumar et al., 2017; Rifai et al., 2011), rather than in input space as VAT has done. Such manifold regularization is implemented by tangent propagation (Simard et al., 1998; Kumar et al., 2017) or manifold Laplacian norm (Belkin et al., 2006; Lecouat et al., 2018), requiring evaluating the Jacobian of classifier (with respect to manifold representation of data) and thus being highly computationally inefficient.
The third way is related to generative adversarial network (GAN) (Goodfellow et al., 2014a). Most GAN based approaches modify the discriminator to include a classifier, by splitting the real class of original discriminator into subclasses, where denotes the number of classes of labeled data (Salimans et al., 2016; Odena, 2016; Dai et al., 2017; Qi et al., 2018). The features extracted for distinguishing the example being real or fake, which can be viewed as a kind of coarse label, have implicit benefits for supervised classification task. Besides that, there are also works jointly training a classifier, a discriminator and a generator (Li et al., 2017).
Our work mainly follows the second line. We firstly sort out three important assumptions that motivate our idea:
- The manifold assumption
The observed data presented in high dimensional space is with high probability concentrated in the vicinity of some underlying manifold of much lower dimensionality (Cayton, 2005; Narayanan & Mitter, 2010; Chapelle et al., 2009; Rifai et al., 2011). We denote the underlying manifold as . We further assume that the classification task concerned relies and only relies on (Rifai et al., 2011).
- The noisy observation assumption
The observed data can be decomposed into two parts as , where is exactly supported on the underlying manifold and is some noise independent of (Bengio et al., 2013; Rasmus et al., 2015). With the assumption that the classifier only depends on the underlying manifold , the noise part might have undesired influences on the learning of the classifier.
- The semi-supervised learning assumption
Inspired by the three assumptions, we introduce a novel regularization called the tangent-normal adversarial regularization (TNAR), which is composed by two parts. The tangent adversarial regularization (TAR) induces the smoothness of the classifier along the tangent space of the underlying manifold, to enforce the invariance of the classifier along manifold. And the normal adversarial regularization (NAR) penalizes the deviation of the classifier along directions orthogonal to the tangent space, to impose robustness on the classifier against the noise carried in observed data.
To realize our idea, we have two challenges to conquer: how to estimate the underlying manifold and how to efficiently perform TNAR.
For the first issue, we take advantage of the generative models equipped with an extra encoder, to characterize coordinate chart of manifold (Kumar et al., 2017; Lecouat et al., 2018; Qi et al., 2018). More specifically, in this work we choose variational autoendoer (VAE) (Kingma & Welling, 2013) and localized GAN (Qi et al., 2018) to estimate the underlying manifold from data.
For the second problem, we develop an adversarial regularization approach based on virtual adversarial training (VAT) (Miyato et al., 2017). Different from VAT, we perform virtual adversarial training in tangent space and normal space separately as illustrated in Figure 1, which leads to a number of new technical difficulties and we will elaborate the corresponding solutions later. Compared with the traditional manifold regularization methods based on tangent propagation (Simard et al., 1998; Kumar et al., 2017) or manifold Laplacian norm (Belkin et al., 2006; Lecouat et al., 2018), our realization does not require explicitly evaluating the Jacobian of classifier. All we need is to calculate the derivative of matrix vector produce, which only costs a few times of back or forward propagation of network.
We denote the labeled and unlabeled dataset as and respectively, where denotes observed data and denotes label. is the full dataset. For simplicity, the output of classification model is written as , where is the model parameters to be trained, e.g. the weights in neural networks. We use to represent supervised loss function, e.g. KL divergence in most practical cases. And the regularization term is denoted as with specific subscript for distinction.
The space the observed data is presented is written as , where denotes its dimensionality. And the underlying manifold of the observed data is written as , is the dimensionality of . We use for the manifold representation of data . We denote the decoder, or the generator, as and the encoder as , which form the coordinate chart of manifold together. If not stated otherwise, we always assume and correspond to the coordinate of the same data point in observed space and on manifold , i.e., and . The tangent space of at point is denoted as
where is the Jacobian of at point . Note that the tangent space is also the span of the columns of . For the convenient of notation, we define .
The perturbation in observed space is represented as , while the perturbation on the manifold representation is denoted as . Hence the perturbation on manifold is . When the perturbation is small enough for the holding of the first order Taylor’s expansion, the perturbation on manifold is approximately equal to the perturbation on its tangent space:
Therefore we say a perturbation is actually on manifold, if there is a perturbation , such that .
2.2 Virtual adversarial training
VAT (Miyato et al., 2017) is an effective regularization method for SSL. The virtual adversarial loss introduced in VAT is defined by the robustness of the classifier against local perturbation in the input space . Hence VAT imposes a kind of smoothness condition on the classifier. Mathematically, the virtual adversarial loss in VAT for SSL is
The VAT regularization is defined as
where is some distribution distance measure like KL divergence and is a hyperparameter controlling the magnitude of adversarial example. For simplicity, define
Then . The so called virtual adversarial example is
Once we have , the VAT loss can be optimized with the objective as .
To obtain the virtual adversarial example , Miyato et al. (2017) suggested to apply second order Taylor’s expansion to around as
where denotes the Hessian of with respect to . The vanishing of the first two terms in Taylor’s expansion occurs because that is a distance measure with minimum zero and is the corresponding optimal value, indicating that at , both the value and the gradient of are zero. Therefore for small enough ,
which is an eigenvalue problem and the direction of can be solved by power iteration
Note that the evaluation of matrix vector product is enough for power iteration, which could be efficiently computed since derivation is a linear operator and can communicate with vector multiplication.
2.3 Generative models for data manifold
We take advantage of generative model with both encoder and decoder to estimate the underlying data manifold and its tangent space . As assumed by previous works (Kumar et al., 2017; Lecouat et al., 2018), perfect generative models with both decoder and encoder can describe the data manifold, where the decoder and the encoder together serve as the coordinate chart of manifold . Note that the encoder is indispensable for it helps to identify the manifold coordinate for point . With the trained generative model, the tangent space is given by as shown in Eq. (1), or the span of the columns of .
VAE (Kingma & Welling, 2013) is a well known generative model consisting of both encoder and decoder. The training of VAE is by optimizing the variational lower bound of log likelihood,
Here is the prior of hidden variable , and , models the encoder and decoder in VAE, respectively. The derivation of the lower bound with respect to is well defined thanks to the reparameterization trick (Kingma & Welling, 2013), thus it could be optimized by gradient based method. The lower bound could also be interpreted as a reconstruction term plus a regularization term (Kingma & Welling, 2013). With a trained VAE at hand, the encoder and decoder are given as and accordingly.
Localized GAN (Qi et al., 2018) suggests to use a localized generator to replace the global generator in vanilla GAN. The key difference between localized GAN and previous generative model for manifold is that, localized GAN learns a distinguishing local coordinate chart for each point , which is given by , rather than one global coordinate chart. To model the local coordinate chart in data manifold, localized GAN requires the localized generator to satisfy two more regularity conditions:
locality: , so that is localized around ;
orthogonmality: , to ensure is non-degenerated.
The two conditions are achieved by the following penalty during training of localized GAN:
Since defines a local coordinate chart for each separately, in which the latent encode of is , there is no need for the extra encoder to provide the manifold representation of .
In this section we elaborate our proposed tangent-normal adversarial regularization (TNAR) strategy. The TNAR loss to be minimized for SSL is
The first term in Eq. (11) is a common used supervised loss. and is the so called tangent adversarial regularization (TAR) and normal adversarial regularization (NAR) accordingly, jointly forming the proposed TNAR.
We assume that we already have a well trained generative model for the underlying data manifold , with encoder and decoder , which can be obtained as described in Section 2.3.
3.1 Tangent adversarial regularization
Vanilla VAT penalizes the variety of the classifier against local perturbation in the input space (Miyato et al., 2017), which might overly regularize the classifier, since the semi-supervised learning assumption only indicates that the true conditional distribution varies smoothly along the underlying manifold , but not the whole input space (Belkin et al., 2006; Rifai et al., 2011; Niyogi, 2013). To avoid this shortcoming of vanilla VAT, we propose the tangent adversarial regularization (TAR), which restricts virtual adversarial training to the tangent space of the underlying manifold , to enforce manifold invariance property of the classifier.
This is a classic generalized eigenvalue problem, the optimal solution of which could be obtained by power iteration and conjugate gradient (and scaling). The iteration framework is as
Now we elaborate the detailed implementation of each step in Eq. (16).
- Computing .
Note that . Define . For , we have
While on the other hand, since is some distance measure with minimum zero and is the corresponding optimal value, we have . Therefore,
Thus the targeted matrix vector product could be efficiently computed as
Note that is a scalar, thus the gradient of which could be obtained by back propagating the network for once. And it only costs twice back propagating the network for the computation of .
- Solving .
Similarly, define . We have
Since , we have . Thus the matrix vector product could be evaluated similarly as
The extra cost for evaluating is still back propagating the network for twice. Due to being positive definite ( is non-degenerated), we can apply several steps of conjugate gradient to solve efficiently.
Compared with manifold regularization based on tangent propagation (Simard et al., 1998; Kumar et al., 2017) or manifold Laplacian norm (Belkin et al., 2006; Lecouat et al., 2018), which is computationally inefficient due to the evaluation of Jacobian, our proposed TAR could be efficiently implemented, thanks to the low computational cost of virtual adversarial training.
3.2 Normal adversarial regularization
Motivated by the noisy observation assumption indicating that the observed data contains noise driving them off the underlying manifold, we come up with the normal adversarial regularization (NAR) to enforce the robustness of the classifier against such noise, by performing virtual adversarial training in the normal space. The mathematical description is
Note that is spanned by the coloums of , thus . Therefore we could reformulate Eq. (22) as
However, Eq. (23) is not easy to optimize since cannot be efficiently computed. To overcome this, instead of requiring being orthogonal to the whole tangent space , we take a step back to demand being orthogonal to only one specific tangent direction, i.e., the tangent space adversarial perturbation . Thus the constraint is relaxed to . And Eq. (23) becomes
And we further replace the constraint in Eq. (24) by a regularization term,
where is a hyperparameter introduced to control the orthogonality of .
Since Eq. (25) is again an eigenvalue problem, and we can apply power iteration to solve it as before. Note that a small identity matrix is needed to be added to keep semi-positive definite. This procedure does not change the optimal solution of eigenvalue problem. Similarly, the power iteration is as
And the evaluation of is by
After finding the optimal solution of Eq. (25) as , the proposed NAR becomes .
Finally, as in (Miyato et al., 2017), we add entropy regularization to our loss function. It ensures neural networks to output a more determinate prediction and has implicit benefits for performing virtual adversarial training.
Our final loss for SSL is
The TAR inherits the computational efficiency from VAT and the manifold invariance property from traditional manifold regularization. The NAR causes the classifier for SSL being robust against the off manifold noise contained in the observed data. Combining the two regularization terms, TAR and NAR, these advantages make our proposed TNAR a reasonable regularization method for SSL, the superiority of which will be shown in the experiment part in Section 4.
Since our regularization strategy TNAR requires an estimation of the underlying manifold given the observed data, for instance, VAE or localized GAN need to be trained before applying TNAR into SSL. This directly brings some extra computational cost for training the specified generative models with encoders. To avoid training a generative model to achieve manifold approximation, we propose an alternative solution by using the secant space to approximate the tangent space . Coarsely but conveniently, the secant space could be obtained by the linear combination of the randomly sampled observed data points.
In practice, when applying TNAR into SSL, this provides an alternative for the user to make decision on which choice of manifold approximation should be used, considering the trade-off between classification accuracy and computational efficiency. We also conduct various experiments to compare the secant-based methods with generative model based ones in Section 4.
To demonstrate the advantages of our proposed TNAR for SSL, we conduct a series of experiments on both artificial and real dataset.
The compared methods for SSL include:
Supervised learning using only the labeled data.
Vanilla VAT (Miyato et al., 2017).
The proposed TNAR method, with the tangent space coarsely approximated by secant space (Eq. (30)).
The proposed TNAR method, with the underlying manifold estimated by VAE.
The proposed TNAR method, with the underlying manifold estimated by localized GAN.
The proposed TNAR method with oracle underlying manifold for the observed data, only used for artificial dataset.
The proposed TNAR method, with the underlying manifold estimated roughly by autoendoer, only used for artificial dataset.
If not stated otherwise, all the above methods contain entropy regularization term (ent).
4.1 Two-rings artificial dataset
We introduce experiments on a two-rings artificial dataset to show the effectiveness of our proposed methods intuitively.
The underlying manifold for two-rings data is given by , where
represent two different classes. The observed data is sampled as , where is uniformly sampled from and . We sample labeled training data, for each class, and unlabeled training data, as shown in Figure 2.
The performance of each compared methods is shown in Table 1, and the corresponding classification boundary is demonstrated in Figure 2. We can easily observe that the Manifold-TNAR notably outperforms other methods. The TNAR under true underlying manifold (Manifold-TNAR) perfectly classifies the two-rings dataset with merely labeled data, while the other methods fail to predict the correct decision boundary. Even with the underlying manifold roughly approximated by autoendoer, our approach (AE-TNAR) outperforms VAT in this artificial dataset. However,the improvement of AE-TNAR is much smaller than Manifold-TNAR, indicating that the effectiveness of TNAR relies on the quality estimating the underlying manifold.
|VAT (without ent)||76.2%|
|AE-TNAR (without ent)||87.55%|
|Manifold-TNAR (without ent)||90.1%|
We also conduct experiments on FashionMNIST dataset
The corresponding results are shown in Table 2, from which we observe at least two phenomena. The first is that our proposed TANR methods (Secant-TNAR, VAE-TNAR, LGAN-TNAR) achieve higher classification accuracy than VAT, especially when the number of labeled data is small. The second is that the performance of our methods depend on the estimation of the underlying manifold of the observed data. While the more accurate estimation of manifold like VAE or Localized GAN (VAE-TNAR, LGAN-TNAR) bring larger improvement to our TNAR methods, the coarse estimation of the tangent space using secant space (Secant-TNAR) still outperforms VAT. As the development of generative model capturing more accurate underlying manifold, it is expected that our proposed regularization strategy benefit more for SSL.
|Method||100 labels||200 labels||1000 labels|
We present the tangent-normal adversarial regularization, a novel regularization strategy for semi-supervised learning, composing of regularization on the tangent and normal space separately. The tangent adversarial regularization enforces manifold invariance of the classifier, while the normal adversarial regularization imposes robustness of the classifier against the noise contained in the observed data. Experiments on artificial dataset and FashionMNIST dataset demonstrate that our approach outperforms other state-of-the-art methods for semi-supervised learning, especially when the labeled data is of a small amount. The performance of our method relies on the quality of the estimation of the underlying manifold, hence the breakthroughs on modeling data manifold could also benefit our strategy for semi-supervised learning, which we leave for future works.
- Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of machine learning research, 7(Nov):2399–2434, 2006.
- Yoshua Bengio, Li Yao, Guillaume Alain, and Pascal Vincent. Generalized denoising auto-encoders as generative models. In Advances in Neural Information Processing Systems, pp. 899–907, 2013.
- Lawrence Cayton. Algorithms for manifold learning. Univ. of California at San Diego Tech. Rep, 12(1-17):1, 2005.
- Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien. Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks, 20(3):542–542, 2009.
- Zihang Dai, Zhilin Yang, Fan Yang, William W Cohen, and Ruslan R Salakhutdinov. Good semi-supervised learning that requires a bad gan. In Advances in Neural Information Processing Systems, pp. 6510–6520, 2017.
- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014a.
- Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014b.
- Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems, pp. 3581–3589, 2014.
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.
- Abhishek Kumar, Prasanna Sattigeri, and Tom Fletcher. Semi-supervised learning with gans: Manifold invariance with improved inference. In Advances in Neural Information Processing Systems, pp. 5540–5550, 2017.
- Bruno Lecouat, Chuan-Sheng Foo, Houssam Zenati, and Vijay R Chandrasekhar. Semi-supervised learning with gans: Revisiting manifold regularization. arXiv preprint arXiv:1805.08957, 2018.
- Chongxuan Li, Kun Xu, Jun Zhu, and Bo Zhang. Triple generative adversarial nets. arXiv preprint arXiv:1703.02291, 2017.
- Takeru Miyato, Andrew M Dai, and Ian Goodfellow. Adversarial training methods for semi-supervised text classification. arXiv preprint arXiv:1605.07725, 2016.
- Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. arXiv preprint arXiv:1704.03976, 2017.
- Hariharan Narayanan and Sanjoy Mitter. Sample complexity of testing the manifold hypothesis. In Advances in Neural Information Processing Systems, pp. 1786–1794, 2010.
- Partha Niyogi. Manifold regularization and semi-supervised learning: Some theoretical analyses. The Journal of Machine Learning Research, 14(1):1229–1250, 2013.
- Augustus Odena. Semi-supervised learning with generative adversarial networks. arXiv preprint arXiv:1606.01583, 2016.
- Guo-Jun Qi, Liheng Zhang, Hao Hu, Marzieh Edraki, Jingdong Wang, and Xian-Sheng Hua. Global versus localized generative adversarial nets. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semi-supervised learning with ladder networks. In Advances in Neural Information Processing Systems, pp. 3546–3554, 2015.
- Salah Rifai, Yann N Dauphin, Pascal Vincent, Yoshua Bengio, and Xavier Muller. The manifold tangent classifier. In Advances in Neural Information Processing Systems, pp. 2294–2302, 2011.
- Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pp. 2234–2242, 2016.
- Patrice Y Simard, Yann A LeCun, John S Denker, and Bernard Victorri. Transformation invariance in pattern recognitionâtangent distance and tangent propagation. In Neural networks: tricks of the trade, pp. 239–274. Springer, 1998.
- Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty. Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International conference on Machine learning (ICML-03), pp. 912–919, 2003.