Adversarial Transformations for Semi-Supervised Learning
We propose a Regularization framework based on Adversarial Transformations (RAT) for semi-supervised learning. RAT is designed to enhance robustness of the output distribution of class prediction for a given data against input perturbation. RAT is an extension of Virtual Adversarial Training (VAT) in such a way that RAT adversraialy transforms data along the underlying data distribution by a rich set of data transformation functions that leave class label invariant, whereas VAT simply produces adversarial additive noises. In addition, we verified that a technique of gradually increasing of perturbation region further improve the robustness. In experiments, we show that RAT significantly improves classification performance on CIFAR-10 and SVHN compared to existing regularization methods under standard semi-supervised image classification settings.
Semi-supervised learning (SSL) [ssl] is an effective learning framework on datasets that have large amounts of the data and few labels. In a practical situations, obtained datasets are often partially labeled, because labeling is more costly than collecting data in many cases. Thus, a powerful SSL framework that enhances the model performance is needed.
|(a) input data (b) VAT with|
|(c) VAT with (d) RAT with rotation|
Among the many SSL algorithms, virtual adversarial training (VAT) [vat2016, vat2018] is a successful one. It regularizes local distributional smoothness, by which we mean the robustness of the output distribution around each input datapoint against local perturbation, by using virtual adversarial perturbation. The perturbation is calculated as the noise, which adversarially changes the output of the model, and VAT imposes the consistency between the outputs of the model for the data and the perturbed data. It indicates that VAT enforces the output distribution to be robust with respect to perturbation within the -ball centered on the data, where denotes the norm of the perturbation vector. The specific advantage of VAT is to leverage the adversarial perturbation.
Although VAT demonstrates remarkable accuracy in standard benchmarks, VAT allows the smoothing only within the -ball. This means that there is no guarantee that two datapoints are classified into the same class even if they belong to the same class, if the distance between them is greater than (see Figure 1 (b)). However, if is too large, the noise-added data may penetrate the true class boundary; inconsistency within -ball may occur (see Figure 1 (c)). To summarize, the -ball, that is an isotropic hypersphere in the input space, is too universal to take the underlying data distribution into account. Our basic idea is to use some adversarial transformations under which a datapoint transforms to another point within the underlying distribution of the same class.
We propose a regularization framework, RAT (Regularization based on Adversarial Transformations), which regularizes smoothness of the output distribution by utilizing adversarial transformations. The aim of RAT is to make the outputs with respect to the same class data close.
We justify RAT as the regularization for smoothing the distribution, and provide use of composite functions and a technique, -rampup, that enlarge the area in which the smoothing is effective. To demonstrate the effectiveness of RAT, we compare it with baseline methods in the experiments following the valid setting proposed by Oliver et al. (2018), and RAT outperforms baseline methods.
We summarize our contributions as follows:
We propose the regularization framework based on adversarial transformations, which includes VAT as a special case. Unlike VAT, RAT can impose distributional smoothness along the underlying data distribution.
We claim that use of composite transformations, each of which leaves class label invariant, can further improve the smoothing effect, thanks to the higher degrees of freedom in adversarial transformation.
Moreover, we provide a technique to enhance the smoothing effect by ramping up . The technique is common to VAT and RAT, and enlarge the area in which the smoothing is effective.
RAT outperforms baseline methods in semi-supervised image classification tasks. In particular, RAT is robust against reduction of labeled samples, compared to other methods.
Virtual Adversarial Training: A Review
In this section, we review virtual adversarial training (VAT) [vat2016, vat2018]. VAT is a similar method of the adversarial training [adv_exm_ian], but the aim of VAT is to regularize distributional smoothness. Miyato et al. (2016; 2018) claimed importance of the adversarial noise and proved this by comparing it with random perturbation training. VAT has indeed shown state-of-the-art results in valid benchmarks [realisticeval].
Let be a data sample where is the dimension of the input, and be a model distribution parameterized by . The objective function of VAT in SSL scenario is:
where denotes a supervised loss term. and are labeled data, its labels, and unlabeled data. is a scaling parameter for regularization. , local distributional smoothness, is defined as follows:
where and are KL divergence between output distributions with respect to an original data and a noise-added data, and virtual adversarial perturbation, respectively. They are defined as follows:
where denotes KL divergence and denotes the L2 norm. is a hyperparameter to control the range of the smoothing.
emerges as the eigenvector of Hessian matrix from the following derivation. Since takes a minimum value at , is 0. Therefore, the second-order Taylor approximation of around is:
By this approximation, is parallel to the eigenvector corresponding to the largest eigenvalue of . Note that VAT assumes that is differentiable with respect to and almost everywhere, and we also assume it in this paper.
Regularization Based on Adversarial Transformations
We propose the regularization framework, RAT, which imposes the consistency between output distributions with respect to datapoints belonging to the same class. Leveraging the power of adversariality, we introduce adversarial transformations that replace additive adversarial noises in VAT.
To consider imposing the consistency, we assume that each datapoint belonging to -th class is in a class-specific subspace . We consider the generic transformation parameterized by : for any , such that , where denotes valid norm with respect to such as L1, L2, or operator norm. Our strategy is to regularize the output distribution by utilizing instead of . Note that we mainly consider the image classification tasks in this work. Therefore, we deal with the transformation such as the spatial transformation or the color distortion.
We define local distributional smoothness utilizing a transformation as follows:
where and are defined as follows:
We refer to as an adversarial transformation in this paper. There are some adversarial attacks utilizing functions instead of additive noises [adef, fourier_attack], and the relation between adversarial transformation and these attacks is similar to the relation between virtual adversarial perturbation and adversarial perturbation [adv_exm_ian].
We utilize for imposing the consistency. Thus, the objective function of RAT for SSL scenario is represented as follows:
We can identify VAT as the special case of RAT; when and is the L2 norm, RAT is equal to VAT.
To compute , we have to solve the maximization problem (9). However, it is difficult to exactly solve it. Therefore, we consider approximating it by using Taylor approximation.
To efficiently approximate and solve (9), needs to satisfy the following two conditions:
is differentiable with respect to almost everywhere.
There is a parameter that makes identity transformation, .
is also parallel to the eigenvector corresponding to the largest eigenvalue of .
allows any transformations satisfying , and the conditions C1 and C2. In this paper, we use hand crafted transformations depending on input data domain such as image coordinate shift, image resizeing, or global color change, to name a few for the case of an image classification task. In a next section, we propose use of composite transformations as to further enhance the effect of the smoothing.
Use of Composite Transformations
Let be a set of functions, , satisfying the conditions C1 and C2. may contain composite functions of the form , where and . It is obvious that these composite functions satisfy the conditions C1 and C2.
By having such composite functions, one can obtain a much richer set of transformations that still yield class-invariant transformations, as given by the relation, when for all except for . It is reasonable for RAT to utilize the composite functions, because the composite functions leads to the richer transformation and imposing the consistency over a wider range.
Fast Approximation of
Although emerges as the eigenvector corresponding to the largest eigenvalue of Hessian matrix as already described, the computational costs of the eigenvector are . There is a way to approximate the eigenvector with small computational costs [power_iteration, vat2016, vat2018].
We approximate the parameters with the power iteration and the finite difference method, just as VAT does. For each transformation , we sample random unit vectors as initial parameters and calculate iteratively. It makes converge to the eigenvector corresponding to the largest eigenvalue of .
The Hessian-vector product is calculated with the finite difference method as follows:
with . We used the fact that . We approximate by normalizing the norm of the approximated eigenvector as described in the next section. Note that we calculate just one iteration for power iteration in this paper, because it is reported in [vat2016, vat2018] that one iteration is sufficient for computing accurate and increasing the iteration does not have an effects.
Although should satisfy , is actually given as the parameter satisfying when solving (11), because the Hessian of KL divergence is semi-positive definite.
In the case of VAT, the smoothing with means that the model should satisfy consistency between the outputs with respect to the original data and the data on the surface of the -ball centered on the original data.
We propose a technique, -rampup, that enhances the effect of the smoothing not only on the boundary but also inside the boundary with small computational costs. The technique is to ramp up from 0 to a predefined value during training, and the parameters of the adversarial transformation are determined by solving . One can approximately solve this by normalizing the approximated eigenvector to satisfy after the power iteration.
Although there are many techniques to ramp up or anneal parameters [meanteacher, cyclr], we adopt the procedure used in Mean Teacher [meanteacher]. Mean Teacher utilizes a sigmoid shape function, , for ramping up the regularization coefficient, and we adopt it for .
We show the pseudocode of the generation process of in Algorithm 1.
Evaluation on Synthetic Dataset
We show the smoothing effect of RAT with a toy problem, a moons dataset, in Figure 1. We make 10 labeled samples and 30 labeled samples for each moon. consists of a three-layer neural network with ReLU non-linearity [relu]. All hiden layers have 128 units. is optimized with Adam optimizer for 500 iterations with default parameters suggested in [adam]. In each iteration, we use all samples for updating . We treat this toy problem as if we already know appropriate class-invariant transformations; we adopt class-wise rotation along each moon as for illustraintion purpose. Note that we do not ramp up in this experiment.
VAT with a small draws the decision boundary crossing as shown in Figure 1 (b). When we adopt a lager , VAT cannot smooth the output distribution within the -ball as shown in Figure 1 (c), because the larger allows the unexpected transformation , and causes inconsistency. On the other hand, RAT draws the decision boundary along .
In this toy problem, we utilize and it is equal to using the label information implicitly. Therefore, in a next section, we evaluate RAT using realistic situations, where we do not know .
We evaluate the effectiveness of RAT and -rampup through three experiments on a semi-supervised image classification task: (i) evaluation of composite transformations, (ii) evaluation of -rampup for VAT and RAT, and (iii) comparison of RAT to baseline methods. As the baseline methods, we use -Model [pimodel1, pimodel2], Pseudo-label [pseudo_label], Mean Teacher [meanteacher], and VAT [vat2018]. Note that VAT utilizes entropy regularization in all experiments. To evaluate on realistic situations, we follow the setting proposed by Oliver et al. (2018). We use PyTorch [pytorch] to implement and evaluate SSL algorithms, and we carefully reproduced the results of Oliver et al. (2018). All hyperparameters for SSL algorithms are adopted the same as in Oliver et al. (2018) except that we do not use L1 and L2 regularization.
For all experiments, we used the same Wide ResNet architecture, depth 28 and width 2 [wrn], and we use the CIFAR-10 [cifar] and SVHN [svhn] datasets for evaluation. CIFAR-10 has 50,000 training data and 10,000 test data, and we split training data into a train/validation set, 45,000 data for training and 5,000 data for validation. SVHN has 73,257 data for training and 26,032 data for testing. We also split training data into 65,931 data for training and 7,326 data for validation. For the semi-supervised setting, we further split training data into labeled data and unlabeled data.
We utilize standard preprocessing and data augmentations for training, following Oliver et al. (2018). For SVHN, we normalize the pixel value into the range , and use random translation by up to 2 pixels as data augmentation. For CIFAR-10, we apply ZCA normalization [cifar] and global contrast normalization as normalization, and random horizontal flipping, random translation by up to 2 pixels, and Gaussian noise injection with standard deviation 0.15 as data augmentation.
We report the mean and standard deviation of error rates over five trials with test sets. The test error rates are evaluated with the model that has the minimum error rate on validation sets. The evaluation on validation set is executed every 25,000 training iterations.
Implementation Details of RAT
We tested three types of data transformations, all of which are commonly used in data augmentation in image classification tasks. All transformations discussed below satisfy the conditions C1 and C2. We evaluated different types of composite transformations as is discussed below.
The noise injection is represented as . We define the norm for the parameters of the noise injection as the L2 norm, . This is equal to the formulation of VAT.
We consider three spatial transformations with different degrees of freedom: affine transformation, thin plate spline [tps], and flow field. All these transformations shift the pixel position by offset vector to give . The pixel values of the transformed image are calculated by bilinear interpolation. The details of these transformations are provided in [stn].
The difference between transformations is the degrees of freedom to calculate the offset vectors as follows: affine transformation has six parameters, thin plate spline has parameters proportional to the number of control points , and the flow field directly has the local offset vectors as parameters, meaning that the number of parameters of the flow field is proportional to the spatial resolution of the image. We set to in all experiments, which means we employ a grid.
We define the norm for the parameters of the flow field transformations as the L2 norm, , where is the number of pixels. The norm for thin plate spline is also defined as the L2 norm of offset vectors for the control points.
Affine transformation is a linear operator in homogeneous coordinates, and the norm is given as the operator norm. Thus, is calculated as the maximum singular value of an affine transformation matrix.
Color distortion is an effective augmentation method for image classification tasks. Among many methods for color distortion, we use a simple way, channel-wise weighting, , where is the pixel value of the -th channel of the -th pixel, and is the scalar value for each channel. This transformation is described as the linear operator, and we define the norm as the operator norm. Note that channel-wise weighting is represented as the multiplication of a diagonal matrix and a pixel value, and the operator norm is calculated as .
Evaluation of Composite Transformations
Since the performance of RAT depends on the combination of the transformations, we report the effect of a combination of functions by adding transformations to VAT.
We first seek good for each transformation with a grid search on CIFAR-10 with 4,000 labeled data, from 0.001 to 0.01 with 0.001 step size for channel-wise weighting, from 0.1 to 1 with a 0.1 step size for affine transformation and thin plate spline, and from 0.01 to 0.1 with a 0.01 step size for flow field. We show the grid search results in Table 1. Other parameters such as , , and parameters for optimization are the same as VAT suggested in [realisticeval]. We summarize the parameters in Table 2. Note that -rampup is not utilized in this experiment, because this experiment explores the effect of composite transformations.
|Thin plate spline||1|
|Initial learning rate||0.003|
|Max consistency coefficient||0.3|
|noise injection||6.0 or 1.0|
|Entropy penalty multiplier||0.06|
The results of adopting various transformations with CIFAR-10 with 4,000 labeled data and SVHN with 1,000 labeled data are shown in Table 3.
|Methods||Channel-wise||Affine||Thin Plate Spline||Flow Field||CIFAR-10||SVHN|
All transformations and combinations improve the performance from supervised learning. However, channel-wise weighting and flow field increase the test error rates from VAT in CIFAR-10. For CIFAR-10, since we utilize the normalizations, channel-wise weighting for the space of normalized data is an unnatural transformation for natural images. On the other hand, flow field is the transformation that can break object structure. Unlike a simple structure like the data in SVHN, the detail structure is the important feature for general objects. Thus, flow field induces the unfavorable effect.
Affine transformation achieves the best performance of all spatial transformations, and the results are induced by the low degree of freedom of the affine transformation. The affine transformation has the lowest degree of freedom among the three. In particular, the affine transformation preserves points, straight lines and planes. In other words, the affine transformation preserves the basic structure of the objects. Therefore, except for extreme cases, the affine transformation ensures that the class of transformed data is the same as the class of original data. This fact matches the strategy of RAT, which is that the output distribution of the data belonging to the same class should be close.
|(a) SVHN (b) CIFAR-10|
The effect of combining channel-wise weighting and each spatial transformation is less effective. This fact means that combining channel-wise weighting does not expand the smoothing effect to a meaningful range. Indeed, the difference of combining channel-wise weighting is within the standard deviation.
Evaluation of -Rampup
We evaluate the effectiveness of -rampup with CIFAR-10 with 4,000 labeled data. Since -rampup is the technique for VAT and RAT, we compare the results with and without ramping up for VAT and RAT. We ramp up for 400,000 iterations. We utilize the composite transformations consisting of affine transformation and noise injection for RAT, and the hyperparameters are as in Table 2.
The results are shown in Table 4. For both VAT and RAT, -rampup brings a positive effect. As interesting effects, -rampup allows for a large . Since the smoothing effect reaches within the range of by ramping up, VAT and RAT with ramping up work well with a relatively large .
|VAT w/ -rampup||10||13.260.20%|
|RAT w/ -rampup||(6, 1)||11.260.34%|
Comparison of RAT to Baselines
We show the effectiveness of RAT by comparing RAT with baseline methods and non-adversarial version of RAT (Random Transformation). We use the composite transformations consisting of affine transformation and noise injection for RAT. The hyperparameters of RAT and for transformations are the same as shown in Tables 1 and 2, respectively, except that we set for affine transformation to 1 for RAT with ramping up with CIFAR-10, and set for noise injection to 5 for RAT and VAT with -rampup with SVHN. Note that all the parameters of the random transformation are the same as RAT.
In Table 5, we show the comparison results with standard SSL settings, CIFAR-10 with 4,000 labeled data and SVHN with 1,000 labeled data.
|Methods||4,000 Labels||1,000 Labels|
|VAT w/ -rampup||13.260.20%||5.170.26%|
|RAT w/ -rampup||11.260.34%||2.860.07%|
On both datasets, RAT improves test error rates more than 2% from the best baseline method, VAT. Futhermore, RAT also improves the error rates from the random transformation, especially on the SVHN dataset. The results prove the importance of the adversariality.
We evaluate the test error rates of each method with a varying number of labeled data from 250 to 8,000. The results are shown in Figure 2. RAT consistently outperforms other methods for all the range on both datasets. Remarkably, RAT significantly improves the test error rate in CIFAR-10 with 250 labeled data compared with the best result of baseline methods, from 50.201.88% to 36.312.03%. We believe that the improvement results from appropriately smoothed model prediction along the underlying data distribution. The experimental fact of VAT’s underperformance, namely 15% degraded CIFAR-10 test error rate compared to RAT, is a clear indication that adversarial, class-invariant transformation provides far better consistency regularization than isotropic, adversarial noise superposition.
Lastly, we compare our results with very recently reported results of Mixup-based SSL method, MixMatch [mixmatch]. Interestingly, RAT is comparative or superior to MixMatch on SVHN, while MixMatch is superior to RAT on CIFAR-10. But, we should point out that this comparison does not seem fair because the experimental settings of MixMatch are different in several respects from ours and others shown in Figure 2. In MixMatch paper, the authors took exponential moving average of models, and the final results were given by the median of the last 20 test error rates measured every training iterations. These settings, that are missing in our experiments, seem to partly boost their performance.
There are many SSL algorithms such as graph-based methods [gcn, label_prop], generative model-based methods [semi_vae, semi_gan, mgan], and regularization-based methods [vat2016, pimodel1, pimodel2, meanteacher, mixmatch].
Label propagation [label_prop] is a representative graph-based SSL method. Modern graph-based approaches utilize neural networks for graphs [gcn]. Graph-based methods demonstrate the effectiveness, when elaborated graph structure is given.
Generative models such as VAE and GAN are now popular frameworks for the SSL setting [semi_vae, semi_gan]. Although generative model-based SSL methods typically have to train additional models, they tend to show remarkable gain in test performance.
Regularization-based methods are comparatively much more tractable, and can be utilized in arbitrary models. Next, we review three regularization-based methods closely related to RAT.
Consistency regularization is a method imposing the consistency between the outputs of one model with respect to a typically unlabeled data and its perturbed counterpart, or the outputs of two models with respect to the same input. One of the simplest ways of constructing consistency regularization is to add stochastic perturbation to the data, , as follows:
where is some distance functions; e.g., Euclidean distance or KL divergence. VAT [vat2016, vat2018] and RAT are classified in this category.
The random transformation-based consistency regularization techniques, the work [pimodel1] and “-Model” [pimodel2] are vary similar and famous ones. We refer to these models as -Model in this paper. One can view -Model as non-adversarial version of RAT. Therefore, by comparing RAT with -Model, we can see the importance of the adversarial transformations.
The other way of constructing consistency regularization is to utilize dropout [dropout]. Let and be the randomly selected parameters through dropout. Dropout as consistency regularization is represented as follows:
Mean Teacher [meanteacher] is a successful method that employs consistency regularization between two models. It makes a teacher model by exponential moving average of the parameters of a student model, and imposes the consistency between the teacher and the student. Although Mean Teacher can be combined with other SSL methods, the combination sometimes impairs the model performance as reported in [mixmatch].
VAT [vat2016, vat2018] is a very effective consistency regularization method. The advantage of VAT lies in the generation of adversarial noises, and the adversariality leads to isotropic smoothness around sampled datapoints. VAT also show the effectiveness in natural language processing tasks [nvat].
Entropy regularization is a way to bring low entropy on to make model prediction more discriminative, and is known to give low-density separation [entmin, ssl, ict]. The entropy regularization term is:
This regularization is often combined with other SSL algorithms [pimodel1, vat2018] and a combined method, VAT+entropy regularization, shows the state-of-the-art results in [realisticeval].
Mixup [mixup] is a powerful regularization method that is very recently used for SSL [manifold_mixup, mixmatch]. Mixup blends two different data, and , and their labels, and , as follows:
where is the scalar value sampled from Beta distribution. In a semi-supervised setting, is calculated as a blend between a label and a prediction or predictions, or , and a regularization term is described as consistency regularization, .
We proposed an SSL framework, called RAT: Regularization framework based on Adversarial Transformation. RAT aims to smooth model output along the underlying data distribution within a given class based on recent advancement of generation of adversarial inputs that stem from unlabeled data. Instead of just superposing adversarial noise, RAT uses a wider range of data transformations, each of which leaves class label invariant. We further propose use of composite transformations and a technique, called -rampup, to enlarge the area in which the smoothing is effective without sacrificing computational cost. We experimentally show that RAT significantly outperform the baseline methods including VAT in semi-supervised image classification tasks. RAT is especially robust against reduction of labeled samples, compared to other methods. As a future work, we would like to replace the designing of composite functions by black box function optimization.
Appendix: Adversarial Transformations for Semi-Supervised Learning
Hyperparameters for all baseline methods except for VAT are shown in Table 6. VAT and RAT hyperparameters are given in the main text.
|Learning decayed by a factor of||0.2|
|at training iteration||400,000|
|Consistency coefficient rampup||200,000|
|Initial learning rate||0.003|
|Initial learning rate||0.0003|
|Max consistency coefficient||20|
|Initial learning rate||0.0004|
|Max consistency coefficient||8|
|Exponential moving average decay||0.95|
|Initial learning rate||0.003|
|Max consistency coefficient||1.0|
B. Comparison of RAT to MixMatch
We compare a Mixup-based method, MixMatch [mixmatch] in Tables 7 and 8. MixMatch is an unrefereed state-of-the-art method at the time of submission. MixMatch combines various semi-supervised learning (SSL) mechanisms, such as consistency regularization, pseudo-label, and mixup. As is mentioned in the main text, the comparisons in Tables 7 and 8 are not exactly fair. In MixMatch paper, the authors took exponential moving average of models, and the final results were given by the median of the last 20 test error rates measured every training iterations. These settings are missing in our experiments.
We show the results on SVHN, varying the number of labeled data in Table 7. MixMatch results are taken from [mixmatch]. Although the experiment setting is slightly different, RAT outperforms MixMatch except for the 4000 labeled setting. The test error rates of RAT increase from 2,000 labeled data to 8,000 labeled data settings. The hyperparameters of RAT were determined on CIFAR-10 with 4,000 labeled data and did not consider SVHN. Nevertheless, the test error rates of RAT on SVHN 4000 labeled are the same as MixMatch.
We also show the results on CIFAR-10 in Table 8. MixMatch demonstrated remarkable results.
Since RAT and Mixup are compatible, RAT has the potential to improve test error rates by Mixup. However, exploring the combination of SSL methods is outside the scope of this paper. We will evaluate RAT+Mixup in future work.
|labels||MixMatch||RAT w/ -rampup|
|MixMatch w/o Mixup||39.11||10.97|
|RAT w/ -rampup||36.312.03||11.260.34|
C. Test error rates of Figure 2
|RAT w/ -rampup||36.312.03||27.170.95||21.381.74||15.380.42||11.260.34||10.510.21|
|RAT w/ -rampup||3.300.08||3.140.11||2.860.07||2.780.11||2.890.05||2.990.05|