Adversarial Transformations for Semi-Supervised Learning

Adversarial Transformations for Semi-Supervised Learning

Teppei Suzuki and Ikuro Sato
DENSO IT LABORATORY, INC.
Tokyo, Japan
{tsuzuki,isato}@d-itlab.co.jp
Abstract

We propose a Regularization framework based on Adversarial Transformations (RAT) for semi-supervised learning. RAT is designed to enhance robustness of the output distribution of class prediction for a given data against input perturbation. RAT is an extension of Virtual Adversarial Training (VAT) in such a way that RAT adversraialy transforms data along the underlying data distribution by a rich set of data transformation functions that leave class label invariant, whereas VAT simply produces adversarial additive noises. In addition, we verified that a technique of gradually increasing of perturbation region further improve the robustness. In experiments, we show that RAT significantly improves classification performance on CIFAR-10 and SVHN compared to existing regularization methods under standard semi-supervised image classification settings.

Introduction

Semi-supervised learning (SSL) [ssl] is an effective learning framework on datasets that have large amounts of the data and few labels. In a practical situations, obtained datasets are often partially labeled, because labeling is more costly than collecting data in many cases. Thus, a powerful SSL framework that enhances the model performance is needed.

(a) input data (b) VAT with
(c) VAT with (d) RAT with rotation
Figure 1: Comparison between VAT and RAT with moons dataset. Colored regions of (b), (c), and (d) correspond to prediction confidence. Each moon has 10 labeled data (dark color points in (a)) and 30 unlabeled data (light color points in (a)), and each data is randomly generated from semicircles in (a) with Gaussian noise. RAT uses noise injection and rotation along the blue and red semicircles in (a) as transformation. The setting of (b) and (d) is the same except for using rotation in (d).

Among the many SSL algorithms, virtual adversarial training (VAT) [vat2016, vat2018] is a successful one. It regularizes local distributional smoothness, by which we mean the robustness of the output distribution around each input datapoint against local perturbation, by using virtual adversarial perturbation. The perturbation is calculated as the noise, which adversarially changes the output of the model, and VAT imposes the consistency between the outputs of the model for the data and the perturbed data. It indicates that VAT enforces the output distribution to be robust with respect to perturbation within the -ball centered on the data, where denotes the norm of the perturbation vector. The specific advantage of VAT is to leverage the adversarial perturbation.

Although VAT demonstrates remarkable accuracy in standard benchmarks, VAT allows the smoothing only within the -ball. This means that there is no guarantee that two datapoints are classified into the same class even if they belong to the same class, if the distance between them is greater than (see Figure 1 (b)). However, if is too large, the noise-added data may penetrate the true class boundary; inconsistency within -ball may occur (see Figure 1 (c)). To summarize, the -ball, that is an isotropic hypersphere in the input space, is too universal to take the underlying data distribution into account. Our basic idea is to use some adversarial transformations under which a datapoint transforms to another point within the underlying distribution of the same class.

We propose a regularization framework, RAT (Regularization based on Adversarial Transformations), which regularizes smoothness of the output distribution by utilizing adversarial transformations. The aim of RAT is to make the outputs with respect to the same class data close.

We justify RAT as the regularization for smoothing the distribution, and provide use of composite functions and a technique, -rampup, that enlarge the area in which the smoothing is effective. To demonstrate the effectiveness of RAT, we compare it with baseline methods in the experiments following the valid setting proposed by Oliver et al. (2018), and RAT outperforms baseline methods.

We summarize our contributions as follows:

  • We propose the regularization framework based on adversarial transformations, which includes VAT as a special case. Unlike VAT, RAT can impose distributional smoothness along the underlying data distribution.

  • We claim that use of composite transformations, each of which leaves class label invariant, can further improve the smoothing effect, thanks to the higher degrees of freedom in adversarial transformation.

  • Moreover, we provide a technique to enhance the smoothing effect by ramping up . The technique is common to VAT and RAT, and enlarge the area in which the smoothing is effective.

  • RAT outperforms baseline methods in semi-supervised image classification tasks. In particular, RAT is robust against reduction of labeled samples, compared to other methods.

Virtual Adversarial Training: A Review

In this section, we review virtual adversarial training (VAT) [vat2016, vat2018]. VAT is a similar method of the adversarial training [adv_exm_ian], but the aim of VAT is to regularize distributional smoothness. Miyato et al. (2016; 2018) claimed importance of the adversarial noise and proved this by comparing it with random perturbation training. VAT has indeed shown state-of-the-art results in valid benchmarks [realisticeval].

Let be a data sample where is the dimension of the input, and be a model distribution parameterized by . The objective function of VAT in SSL scenario is:

(1)

where denotes a supervised loss term. and are labeled data, its labels, and unlabeled data. is a scaling parameter for regularization. , local distributional smoothness, is defined as follows:

(2)

where and are KL divergence between output distributions with respect to an original data and a noise-added data, and virtual adversarial perturbation, respectively. They are defined as follows:

(3)
(4)

where denotes KL divergence and denotes the L2 norm. is a hyperparameter to control the range of the smoothing.

emerges as the eigenvector of Hessian matrix from the following derivation. Since takes a minimum value at , is 0. Therefore, the second-order Taylor approximation of around is:

(5)

We can describe (4) using (5) as follows:

(6)

By this approximation, is parallel to the eigenvector corresponding to the largest eigenvalue of . Note that VAT assumes that is differentiable with respect to and almost everywhere, and we also assume it in this paper.

Regularization Based on Adversarial Transformations

We propose the regularization framework, RAT, which imposes the consistency between output distributions with respect to datapoints belonging to the same class. Leveraging the power of adversariality, we introduce adversarial transformations that replace additive adversarial noises in VAT.

To consider imposing the consistency, we assume that each datapoint belonging to -th class is in a class-specific subspace . We consider the generic transformation parameterized by : for any , such that , where denotes valid norm with respect to such as L1, L2, or operator norm. Our strategy is to regularize the output distribution by utilizing instead of . Note that we mainly consider the image classification tasks in this work. Therefore, we deal with the transformation such as the spatial transformation or the color distortion.

We define local distributional smoothness utilizing a transformation as follows:

(7)

where and are defined as follows:

(8)
(9)

We refer to as an adversarial transformation in this paper. There are some adversarial attacks utilizing functions instead of additive noises [adef, fourier_attack], and the relation between adversarial transformation and these attacks is similar to the relation between virtual adversarial perturbation and adversarial perturbation [adv_exm_ian].

We utilize for imposing the consistency. Thus, the objective function of RAT for SSL scenario is represented as follows:

(10)

We can identify VAT as the special case of RAT; when and is the L2 norm, RAT is equal to VAT.

To compute , we have to solve the maximization problem (9). However, it is difficult to exactly solve it. Therefore, we consider approximating it by using Taylor approximation.

To efficiently approximate and solve (9), needs to satisfy the following two conditions:

  • is differentiable with respect to almost everywhere.

  • There is a parameter that makes identity transformation, .

If satisfies these conditions, takes a minimum value at and (8) is written in the same form as (5) by the second-order Taylor approximation around as , where . Thus, (9) is approximated as follows:

(11)

is also parallel to the eigenvector corresponding to the largest eigenvalue of .

allows any transformations satisfying , and the conditions C1 and C2. In this paper, we use hand crafted transformations depending on input data domain such as image coordinate shift, image resizeing, or global color change, to name a few for the case of an image classification task. In a next section, we propose use of composite transformations as to further enhance the effect of the smoothing.

Use of Composite Transformations

Let be a set of functions, , satisfying the conditions C1 and C2. may contain composite functions of the form , where and . It is obvious that these composite functions satisfy the conditions C1 and C2.

By having such composite functions, one can obtain a much richer set of transformations that still yield class-invariant transformations, as given by the relation, when for all except for . It is reasonable for RAT to utilize the composite functions, because the composite functions leads to the richer transformation and imposing the consistency over a wider range.

Fast Approximation of

Although emerges as the eigenvector corresponding to the largest eigenvalue of Hessian matrix as already described, the computational costs of the eigenvector are . There is a way to approximate the eigenvector with small computational costs [power_iteration, vat2016, vat2018].

We approximate the parameters with the power iteration and the finite difference method, just as VAT does. For each transformation , we sample random unit vectors as initial parameters and calculate iteratively. It makes converge to the eigenvector corresponding to the largest eigenvalue of .

The Hessian-vector product is calculated with the finite difference method as follows:

(12)

with . We used the fact that . We approximate by normalizing the norm of the approximated eigenvector as described in the next section. Note that we calculate just one iteration for power iteration in this paper, because it is reported in [vat2016, vat2018] that one iteration is sufficient for computing accurate and increasing the iteration does not have an effects.

-Rampup

Although should satisfy , is actually given as the parameter satisfying when solving (11), because the Hessian of KL divergence is semi-positive definite.

In the case of VAT, the smoothing with means that the model should satisfy consistency between the outputs with respect to the original data and the data on the surface of the -ball centered on the original data.

We propose a technique, -rampup, that enhances the effect of the smoothing not only on the boundary but also inside the boundary with small computational costs. The technique is to ramp up from 0 to a predefined value during training, and the parameters of the adversarial transformation are determined by solving . One can approximately solve this by normalizing the approximated eigenvector to satisfy after the power iteration.

Although there are many techniques to ramp up or anneal parameters [meanteacher, cyclr], we adopt the procedure used in Mean Teacher [meanteacher]. Mean Teacher utilizes a sigmoid shape function, , for ramping up the regularization coefficient, and we adopt it for .

We show the pseudocode of the generation process of in Algorithm 1.

1:  Input: Data ; transformation functions ; distance functions ; scalar parameters and
2:  Output: function parameters
3:  Make copy of data
4:  for  do
5:     initialize as a random unit vector
6:     
7:     
8:  end for
9:  for  do
10:     
11:     
12:     Normalize to satisfy
13:     
14:  end for
Algorithm 1 Generation of

Evaluation on Synthetic Dataset

We show the smoothing effect of RAT with a toy problem, a moons dataset, in Figure 1. We make 10 labeled samples and 30 labeled samples for each moon. consists of a three-layer neural network with ReLU non-linearity [relu]. All hiden layers have 128 units. is optimized with Adam optimizer for 500 iterations with default parameters suggested in [adam]. In each iteration, we use all samples for updating . We treat this toy problem as if we already know appropriate class-invariant transformations; we adopt class-wise rotation along each moon as for illustraintion purpose. Note that we do not ramp up in this experiment.

VAT with a small draws the decision boundary crossing as shown in Figure 1 (b). When we adopt a lager , VAT cannot smooth the output distribution within the -ball as shown in Figure 1 (c), because the larger allows the unexpected transformation , and causes inconsistency. On the other hand, RAT draws the decision boundary along .

In this toy problem, we utilize and it is equal to using the label information implicitly. Therefore, in a next section, we evaluate RAT using realistic situations, where we do not know .

Experiments

We evaluate the effectiveness of RAT and -rampup through three experiments on a semi-supervised image classification task: (i) evaluation of composite transformations, (ii) evaluation of -rampup for VAT and RAT, and (iii) comparison of RAT to baseline methods. As the baseline methods, we use -Model [pimodel1, pimodel2], Pseudo-label [pseudo_label], Mean Teacher [meanteacher], and VAT [vat2018]. Note that VAT utilizes entropy regularization in all experiments. To evaluate on realistic situations, we follow the setting proposed by Oliver et al. (2018). We use PyTorch [pytorch] to implement and evaluate SSL algorithms, and we carefully reproduced the results of Oliver et al. (2018). All hyperparameters for SSL algorithms are adopted the same as in Oliver et al. (2018) except that we do not use L1 and L2 regularization.

For all experiments, we used the same Wide ResNet architecture, depth 28 and width 2 [wrn], and we use the CIFAR-10 [cifar] and SVHN [svhn] datasets for evaluation. CIFAR-10 has 50,000 training data and 10,000 test data, and we split training data into a train/validation set, 45,000 data for training and 5,000 data for validation. SVHN has 73,257 data for training and 26,032 data for testing. We also split training data into 65,931 data for training and 7,326 data for validation. For the semi-supervised setting, we further split training data into labeled data and unlabeled data.

We utilize standard preprocessing and data augmentations for training, following Oliver et al. (2018). For SVHN, we normalize the pixel value into the range , and use random translation by up to 2 pixels as data augmentation. For CIFAR-10, we apply ZCA normalization [cifar] and global contrast normalization as normalization, and random horizontal flipping, random translation by up to 2 pixels, and Gaussian noise injection with standard deviation 0.15 as data augmentation.

We report the mean and standard deviation of error rates over five trials with test sets. The test error rates are evaluated with the model that has the minimum error rate on validation sets. The evaluation on validation set is executed every 25,000 training iterations.

Implementation Details of RAT

We tested three types of data transformations, all of which are commonly used in data augmentation in image classification tasks. All transformations discussed below satisfy the conditions C1 and C2. We evaluated different types of composite transformations as is discussed below.

Noise Injection

The noise injection is represented as . We define the norm for the parameters of the noise injection as the L2 norm, . This is equal to the formulation of VAT.

Spatial Transformation

We consider three spatial transformations with different degrees of freedom: affine transformation, thin plate spline [tps], and flow field. All these transformations shift the pixel position by offset vector to give . The pixel values of the transformed image are calculated by bilinear interpolation. The details of these transformations are provided in [stn].

The difference between transformations is the degrees of freedom to calculate the offset vectors as follows: affine transformation has six parameters, thin plate spline has parameters proportional to the number of control points , and the flow field directly has the local offset vectors as parameters, meaning that the number of parameters of the flow field is proportional to the spatial resolution of the image. We set to in all experiments, which means we employ a grid.

We define the norm for the parameters of the flow field transformations as the L2 norm, , where is the number of pixels. The norm for thin plate spline is also defined as the L2 norm of offset vectors for the control points.

Affine transformation is a linear operator in homogeneous coordinates, and the norm is given as the operator norm. Thus, is calculated as the maximum singular value of an affine transformation matrix.

Color Distortion

Color distortion is an effective augmentation method for image classification tasks. Among many methods for color distortion, we use a simple way, channel-wise weighting, , where is the pixel value of the -th channel of the -th pixel, and is the scalar value for each channel. This transformation is described as the linear operator, and we define the norm as the operator norm. Note that channel-wise weighting is represented as the multiplication of a diagonal matrix and a pixel value, and the operator norm is calculated as .

Evaluation of Composite Transformations

Since the performance of RAT depends on the combination of the transformations, we report the effect of a combination of functions by adding transformations to VAT.

We first seek good for each transformation with a grid search on CIFAR-10 with 4,000 labeled data, from 0.001 to 0.01 with 0.001 step size for channel-wise weighting, from 0.1 to 1 with a 0.1 step size for affine transformation and thin plate spline, and from 0.01 to 0.1 with a 0.01 step size for flow field. We show the grid search results in Table 1. Other parameters such as , , and parameters for optimization are the same as VAT suggested in [realisticeval]. We summarize the parameters in Table 2. Note that -rampup is not utilized in this experiment, because this experiment explores the effect of composite transformations.

Transformations
Channel-wise weighting 0.001
Affine transformation 0.6
Thin plate spline 1
Flow field 0.01
Table 1: The parameters for transformations.
Parameters values
Initial learning rate 0.003
Max consistency coefficient 0.3
noise injection 6.0 or 1.0
Entropy penalty multiplier 0.06
Table 2: VAT and RAT shared parameters. is required 6.0 for CIFAR-10 and 1.0 for SVHN.

The results of adopting various transformations with CIFAR-10 with 4,000 labeled data and SVHN with 1,000 labeled data are shown in Table 3.

Methods Channel-wise Affine Thin Plate Spline Flow Field CIFAR-10 SVHN
Supervised 20.350.14% 12.330.25%
VAT 13.680.25% 5.320.25%
RAT 14.330.44% 5.190.29%
11.700.32% 3.100.12%
13.060.44% 4.120.11%
14.270.41% 4.930.35%
11.320.44% 3.140.12%
13.460.37% 4.060.16%
14.170.27% 4.850.12%
Table 3: Test error rates of RAT with various transformations on CIFAR-10 with 4,000 labeled data and SVHN with 1,000 labeled data. All settings of RAT include noise injection as transformation, and all of the hyperparameters and experiment settings of VAT and RAT are the same, except for the inherent parameters of RAT.

All transformations and combinations improve the performance from supervised learning. However, channel-wise weighting and flow field increase the test error rates from VAT in CIFAR-10. For CIFAR-10, since we utilize the normalizations, channel-wise weighting for the space of normalized data is an unnatural transformation for natural images. On the other hand, flow field is the transformation that can break object structure. Unlike a simple structure like the data in SVHN, the detail structure is the important feature for general objects. Thus, flow field induces the unfavorable effect.

Affine transformation achieves the best performance of all spatial transformations, and the results are induced by the low degree of freedom of the affine transformation. The affine transformation has the lowest degree of freedom among the three. In particular, the affine transformation preserves points, straight lines and planes. In other words, the affine transformation preserves the basic structure of the objects. Therefore, except for extreme cases, the affine transformation ensures that the class of transformed data is the same as the class of original data. This fact matches the strategy of RAT, which is that the output distribution of the data belonging to the same class should be close.

(a) SVHN (b) CIFAR-10
Figure 2: Test error rates obtained by varying the number of labeled data. Shaded regions indicate standard deviation over five trials.

The effect of combining channel-wise weighting and each spatial transformation is less effective. This fact means that combining channel-wise weighting does not expand the smoothing effect to a meaningful range. Indeed, the difference of combining channel-wise weighting is within the standard deviation.

Evaluation of -Rampup

We evaluate the effectiveness of -rampup with CIFAR-10 with 4,000 labeled data. Since -rampup is the technique for VAT and RAT, we compare the results with and without ramping up for VAT and RAT. We ramp up for 400,000 iterations. We utilize the composite transformations consisting of affine transformation and noise injection for RAT, and the hyperparameters are as in Table 2.

The results are shown in Table 4. For both VAT and RAT, -rampup brings a positive effect. As interesting effects, -rampup allows for a large . Since the smoothing effect reaches within the range of by ramping up, VAT and RAT with ramping up work well with a relatively large .

Methods maximum CIFAR-10
VAT 6 13.680.25%
VAT 10 14.310.33%
VAT w/ -rampup 10 13.260.20%
RAT (6, 0.6) 11.700.32%
RAT (6, 1) 12.680.29%
RAT w/ -rampup (6, 1) 11.260.34%
Table 4: Comparison between with and without -rampup on CIFAR-10 with 4,000 labeled data. RAT has two , one for the noise injection and one for the affine transformation.

Comparison of RAT to Baselines

We show the effectiveness of RAT by comparing RAT with baseline methods and non-adversarial version of RAT (Random Transformation). We use the composite transformations consisting of affine transformation and noise injection for RAT. The hyperparameters of RAT and for transformations are the same as shown in Tables 1 and 2, respectively, except that we set for affine transformation to 1 for RAT with ramping up with CIFAR-10, and set for noise injection to 5 for RAT and VAT with -rampup with SVHN. Note that all the parameters of the random transformation are the same as RAT.

In Table 5, we show the comparison results with standard SSL settings, CIFAR-10 with 4,000 labeled data and SVHN with 1,000 labeled data.

CIFAR-10 SVHN
Methods 4,000 Labels 1,000 Labels
Supervised 20.350.14% 12.330.25%
-Model 16.240.38% 7.810.39%
Pseudo-Label 14.780.26% 7.260.27%
Mean Teacher 15.770.22% 6.480.44%
VAT 13.680.25% 5.320.25%
VAT w/ -rampup 13.260.20% 5.170.26%
Random Transformation 12.710.82% 6.060.50%
RAT 11.700.32% 3.100.12%
RAT w/ -rampup 11.260.34% 2.860.07%
Table 5: Test error rates of RAT and baseline methods on CIFAR-10 with 4,000 labeled data and SVHN with 1,000 labeled data.

On both datasets, RAT improves test error rates more than 2% from the best baseline method, VAT. Futhermore, RAT also improves the error rates from the random transformation, especially on the SVHN dataset. The results prove the importance of the adversariality.

We evaluate the test error rates of each method with a varying number of labeled data from 250 to 8,000. The results are shown in Figure 2. RAT consistently outperforms other methods for all the range on both datasets. Remarkably, RAT significantly improves the test error rate in CIFAR-10 with 250 labeled data compared with the best result of baseline methods, from 50.201.88% to 36.312.03%. We believe that the improvement results from appropriately smoothed model prediction along the underlying data distribution. The experimental fact of VAT’s underperformance, namely 15% degraded CIFAR-10 test error rate compared to RAT, is a clear indication that adversarial, class-invariant transformation provides far better consistency regularization than isotropic, adversarial noise superposition.

Lastly, we compare our results with very recently reported results of Mixup-based SSL method, MixMatch [mixmatch]. Interestingly, RAT is comparative or superior to MixMatch on SVHN, while MixMatch is superior to RAT on CIFAR-10. But, we should point out that this comparison does not seem fair because the experimental settings of MixMatch are different in several respects from ours and others shown in Figure 2. In MixMatch paper, the authors took exponential moving average of models, and the final results were given by the median of the last 20 test error rates measured every training iterations. These settings, that are missing in our experiments, seem to partly boost their performance.

Related Work

There are many SSL algorithms such as graph-based methods [gcn, label_prop], generative model-based methods [semi_vae, semi_gan, mgan], and regularization-based methods [vat2016, pimodel1, pimodel2, meanteacher, mixmatch].

Label propagation [label_prop] is a representative graph-based SSL method. Modern graph-based approaches utilize neural networks for graphs [gcn]. Graph-based methods demonstrate the effectiveness, when elaborated graph structure is given.

Generative models such as VAE and GAN are now popular frameworks for the SSL setting [semi_vae, semi_gan]. Although generative model-based SSL methods typically have to train additional models, they tend to show remarkable gain in test performance.

Regularization-based methods are comparatively much more tractable, and can be utilized in arbitrary models. Next, we review three regularization-based methods closely related to RAT.

Consistency Regularization

Consistency regularization is a method imposing the consistency between the outputs of one model with respect to a typically unlabeled data and its perturbed counterpart, or the outputs of two models with respect to the same input. One of the simplest ways of constructing consistency regularization is to add stochastic perturbation to the data, , as follows:

(13)

where is some distance functions; e.g., Euclidean distance or KL divergence. VAT [vat2016, vat2018] and RAT are classified in this category.

The random transformation-based consistency regularization techniques, the work [pimodel1] and “-Model” [pimodel2] are vary similar and famous ones. We refer to these models as -Model in this paper. One can view -Model as non-adversarial version of RAT. Therefore, by comparing RAT with -Model, we can see the importance of the adversarial transformations.

The other way of constructing consistency regularization is to utilize dropout [dropout]. Let and be the randomly selected parameters through dropout. Dropout as consistency regularization is represented as follows:

(14)

Mean Teacher [meanteacher] is a successful method that employs consistency regularization between two models. It makes a teacher model by exponential moving average of the parameters of a student model, and imposes the consistency between the teacher and the student. Although Mean Teacher can be combined with other SSL methods, the combination sometimes impairs the model performance as reported in [mixmatch].

VAT [vat2016, vat2018] is a very effective consistency regularization method. The advantage of VAT lies in the generation of adversarial noises, and the adversariality leads to isotropic smoothness around sampled datapoints. VAT also show the effectiveness in natural language processing tasks [nvat].

Entropy Regularization

Entropy regularization is a way to bring low entropy on to make model prediction more discriminative, and is known to give low-density separation [entmin, ssl, ict]. The entropy regularization term is:

(15)

This regularization is often combined with other SSL algorithms [pimodel1, vat2018] and a combined method, VAT+entropy regularization, shows the state-of-the-art results in [realisticeval].

Mixup

Mixup [mixup] is a powerful regularization method that is very recently used for SSL [manifold_mixup, mixmatch]. Mixup blends two different data, and , and their labels, and , as follows:

(16)

where is the scalar value sampled from Beta distribution. In a semi-supervised setting, is calculated as a blend between a label and a prediction or predictions, or , and a regularization term is described as consistency regularization, .

Conclusion

We proposed an SSL framework, called RAT: Regularization framework based on Adversarial Transformation. RAT aims to smooth model output along the underlying data distribution within a given class based on recent advancement of generation of adversarial inputs that stem from unlabeled data. Instead of just superposing adversarial noise, RAT uses a wider range of data transformations, each of which leaves class label invariant. We further propose use of composite transformations and a technique, called -rampup, to enlarge the area in which the smoothing is effective without sacrificing computational cost. We experimentally show that RAT significantly outperform the baseline methods including VAT in semi-supervised image classification tasks. RAT is especially robust against reduction of labeled samples, compared to other methods. As a future work, we would like to replace the designing of composite functions by black box function optimization.

References

Appendix: Adversarial Transformations for Semi-Supervised Learning
Id: 3743

A. Hyperparemters

Hyperparameters for all baseline methods except for VAT are shown in Table 6. VAT and RAT hyperparameters are given in the main text.

Shared
Training iteration 500,000
Learning decayed by a factor of 0.2
at training iteration 400,000
Consistency coefficient rampup 200,000
Supervised
Initial learning rate 0.003
-Model
Initial learning rate 0.0003
Max consistency coefficient 20
Mean Teacher
Initial learning rate 0.0004
Max consistency coefficient 8
Exponential moving average decay 0.95
Pseudo-Label
Initial learning rate 0.003
Max consistency coefficient 1.0
Pseudo-label threshold 0.95
Table 6: Hyperparemter settings. All hyperparameters were determined following Oliver et al. (2018) except for L1 and L2 regularization. In our experiments, we do not use L1 and L2 regularization, because the original implementation of Oliver et al. (2018) do not use them.

B. Comparison of RAT to MixMatch

We compare a Mixup-based method, MixMatch [mixmatch] in Tables 7 and 8. MixMatch is an unrefereed state-of-the-art method at the time of submission. MixMatch combines various semi-supervised learning (SSL) mechanisms, such as consistency regularization, pseudo-label, and mixup. As is mentioned in the main text, the comparisons in Tables 7 and 8 are not exactly fair. In MixMatch paper, the authors took exponential moving average of models, and the final results were given by the median of the last 20 test error rates measured every training iterations. These settings are missing in our experiments.

We show the results on SVHN, varying the number of labeled data in Table 7. MixMatch results are taken from [mixmatch]. Although the experiment setting is slightly different, RAT outperforms MixMatch except for the 4000 labeled setting. The test error rates of RAT increase from 2,000 labeled data to 8,000 labeled data settings. The hyperparameters of RAT were determined on CIFAR-10 with 4,000 labeled data and did not consider SVHN. Nevertheless, the test error rates of RAT on SVHN 4000 labeled are the same as MixMatch.

We also show the results on CIFAR-10 in Table 8. MixMatch demonstrated remarkable results.

Since RAT and Mixup are compatible, RAT has the potential to improve test error rates by Mixup. However, exploring the combination of SSL methods is outside the scope of this paper. We will evaluate RAT+Mixup in future work.

labels MixMatch RAT w/ -rampup
250 3.780.26 3.300.08
500 3.640.46 3.140.11
1000 3.270.31 2.860.07
2000 3.040.13 2.780.11
4000 2.890.06 2.890.05
8000 N/A 2.990.05
Table 7: Test error rates (%) for SVHN
labels 250 4000
MixMatch w/o Mixup 39.11 10.97
MixMatch 11.80 6.00
RAT w/ -rampup 36.312.03 11.260.34
Table 8: Test error rates (%) for CIFAR-10

C. Test error rates of Figure 2

We show the test error rates of Figure 2 in Tables 9 and 10.

labels 250 500 1000 2000 4000 8000
-Model 52.240.94 43.921.08 34.191.48 24.310.55 16.240.38 12.740.25
Pseudo-label 50.201.88 41.860.68 30.421.24 18.230.33 14.780.26 12.610.19
Mean Teacher 50.780.93 42.960.25 33.311.30 21.640.63 15.770.22 12.840.18
VAT 51.121.54 30.520.93 24.230.82 16.040.43 13.680.25 11.390.27
RAT w/ -rampup 36.312.03 27.170.95 21.381.74 15.380.42 11.260.34 10.510.21
Table 9: Test error rates (%) for CIFAR-10
labels 250 500 1000 2000 4000 8000
-Model 25.511.67 12.210.61 7.810.39 6.390.34 5.330.12 4.510.11
Pseudo-label 13.660.96 7.780.21 7.260.27 5.670.29 5.030.28 4.430.12
Mean Teacher 12.350.60 7.800.40 6.480.44 5.400.32 5.050.21 4.330.21
VAT 7.240.30 6.370.13 5.320.25 5.220.11 4.760.28 4.280.13
RAT w/ -rampup 3.300.08 3.140.11 2.860.07 2.780.11 2.890.05 2.990.05
Table 10: Test error rates (%) for SVHN
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
398425
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description