Data Interpolating Prediction: Alternative Interpretation of Mixup

Data Interpolating Prediction:
Alternative Interpretation of Mixup

Takuya Shimada 111Work done during an internship program at Preferred Networks. The University of Tokyo, Tokyo, Japan Shoichiro Yamaguchi Preferred Networks, Tokyo, Japan Kohei Hayashi Preferred Networks, Tokyo, Japan Sosuke Kobayashi Preferred Networks, Tokyo, Japan
Abstract

Data augmentation by mixing samples, such as Mixup, has widely been used typically for classification tasks. However, this strategy is not always effective due to the gap between augmented samples for training and original samples for testing. This gap may prevent a classifier from learning the optimal decision boundary and increase the generalization error. To overcome this problem, we propose an alternative framework called Data Interpolating Prediction (DIP). Unlike common data augmentations, we encapsulate the sample-mixing process in the hypothesis class of a classifier so that train and test samples are treated equally. We derive the generalization bound and show that DIP helps to reduce the original Rademacher complexity. Also, we empirically demonstrate that DIP can outperform existing Mixup.

1 Introduction

Data augmentation (Simard et al., 1998) has played an important role in training deep neural networks for the purpose of preventing overfitting and improving the generalization performance. Recently, sample-mixed data augmentation (Zhang et al., 2018; Tokozume et al., 2018a, b; Verma et al., 2018; Guo et al., 2019; Inoue, 2018) has attracted attention, where we combine two samples linearly to generate augmented samples. The effectiveness of this approach is shown especially for image classification and sound recognition tasks.

Many traditional data augmentations (e.g., slight deformations for image data (Taylor and Nitschke, 2017)) rely on the specific properties of the target domain such as invariances to some transformations. On the other hand, sample-mixed augmentation can be applied to any dataset due to its simplicity. However, its effectiveness depends on the specified data structure. There is basically a difference between original clean samples and augmented samples in that augmented samples are not drawn from an underlying distribution directly. Thus a classifier trained with sample-mix augmentation may learn the biased decision boundary. In fact, we can easily create a distribution where sample-mix deteriorates the classification performance (See Fig. 1).

(a) w/o sample-mix
(b) Mixup ()
(c) DIP ()
Figure 1: Test accuracy and visualization of decision area on 2d spirals data. The neural networks are trained either (Left) without sample-mix, (Center) with the existing sample-mix method, or (Right) with the proposed sample-mix training + output approximation with monte carlo sampling. Although the standard sample-mix training deteriorates classification performance due to the biased decision boundary caused by sample-mix training, our method mitigates this problem. See section 2 for the details of a hyper-parameter of beta distribution and the number of sampling .

To overcome this problem, we propose a alternative framework called Data Interpolating Prediction (DIP), where the sample-mixing process is encapsulated in a classifier. More specifically, we consider sample-mix as a stochastic perturbation in a function and obtain the prediction by computing the expected value over the random variable. Note that we apply sample-mix to both train and test samples in our framework. This procedure is similar to existing studies such as monte carlo dropout  (Gal and Ghahramani, 2016) and Augmented PAttern Classification (Sato et al., 2015). Furthermore, we derive the generalization error bound for our algorithm via Rademacher complexity and find that sample-mix helps to reduce the Rademacher complexity of a hypothesis class. Through experiments on benchmark image datasets, we confirm the generalization gap can be reduced by sample-mix and demonstrate the effectiveness of the proposed method.

2 Proposed Method

2.1 Data Interpolating Prediction

Let be a -dimensional input space and be a -dimensional one-hot label space. Denote a classifier by . The standard goal of classification problems is to obtain a classifier that minimizes the classification risk defined as:

(1)

where is the joint density of an underlying distribution and is a loss function. Here, we consider a function where the sample-mixing process is encapsulated as a random variable. We describe sample-mix between and with a function as follows.

(2)

where is a parameter that controls mixing ratio between two samples.

Let be i.i.d. samples drawn from , be the density of the empirical distribution of , and be a specified classifier. Using the function , we redefine the deterministic function by

(3)

where is some density function over . Note that the function is equivalent to the base function when we set .

2.2 Practical Optimization

Since the expected value is usually intractable, we train the classifier by minimizing upper the bound of an empirical version of ,

(4)

By applying Jensen’s inequality, we have

(5)

where is a positive integer which represents the number of sampling to estimate the expectation. We denote the RHS in equation 5 by . The tightness of the above bound is related to the value of as

(6)

We can prove this in a similar manner to Burda et al. (2016). Since , larger gives a more precise risk estimation.

2.3 Label-Mixing or Label-Preserving

There are two types of sample-mix data augmentation, namely, label-mixing approach and label-preserving approach. We can show that the objective functions of both approaches are consistent under some conditions.

Proposition 1.

Suppose that is a linear function with respect to the second argument and for some constant . Then we have the following equation.

(7)

The proof of this theorem can be found in the blog post222inFERENCe, https://www.inference.vc/mixup-data-dependent-data-augmentation. For many label-mixing approaches (Zhang et al., 2018; Verma et al., 2018), they use beta distribution for a prior of . Thus, the optimization of such approaches can be considered as a special case of our framework because an empirical version of RHS in equation 7 corresponds to where is set to . We experimentally investigate behaviors of both label-mixing and label-preserving training in Sec. 3.

2.4 Generalization Bound via Rademacher Complexity

In this section, we present a generalization bound for a function equipped with sample-mix. Let be a function class of the specified model and be the empirical Rademacher complexity of . Then we have the following inequality.

Proposition 2.

Let be i.i.d. random variables drawn from an underlying distribution with the density and . Suppose that is bounded by some constant . For any , with the probability at least , the following holds for all .

(8)

The proof of this theorem can be found in the textbook such as Mohri et al. (2012). Now we analyze a Rademacher complexity of a proposed function class. Let be a specified function class and as defined in equation 3. Suppose that the empirical Rademacher complexity of can be bounded with some constant as follows.

(9)

We can prove this assumption holds for neural network models in a similar manner to Gao and Zhou (2016). Then we have the following theorem.

Theorem 1.

Suppose that is a -Lipschitz function with respect to the first argument and satisfies the assumption in equation 9. Let be a function class of defined in equation 3. Then we have the following inequality.

(10)

where .

Note that always holds from Jensen’s inequality and . Thus, sample-mix can reduce the empirical Rademacher complexity of the function class, which reduces the generalization gap (i.e., . For example, when in equation 3, we have , which is a monotonically decreasing function with respect to . Hence, we claim that larger can be effective for the smaller generalization gap. We experimentally analyze the behavior with respect to in Sec. 3.

3 Experiments on CIFAR Datasets

In this section, we analyze the behavior of our proposed framework through experiments on CIFAR10/100 datasets (Krizhevsky and Hinton, 2009). We evaluated the classification performances with two neural network architectures, VGG16 (Simonyan and Zisserman, 2015) and PreActResNet18 (He et al., 2016). The details of the experimental setting are described in Appendix B. For our proposed DIP, output after final fully-connected layer is used as in equation 3 and the expected output is approximated by 500 times monte carlo sampling in test stage. As we discussed in Sec. 2.3, there are two types of optimization process when . We evaluated both label-preserving and label-mixing style training. We set for label-preserving sample-mix training and for label-mixing sample-mix training. Note that the prediction was computed with in test stage even when label-mix style was used for training.

For two baseline methods, we trained a classifier with (i) standard training (without sample-mix) and (ii) Mixup (Zhang et al., 2018) training (label-mixing style). To evaluate the performances of these methods, we computed the prediction only from original clean samples. We used for Mixup training.

We show the classification performances in Table 1 and generalization gap (i.e., the gap between train and test performances) in Fig. 2. Note that the magnified versions of experimental results are deferred to Appendix C. As can be seen in Table 1, our proposed method is likely to outperform existing Mixup approach.

Remarks:  For all approaches including existing Mixup, the larger leads to the smaller generalization gap, which is consistent with the discussion in Sec. 2.4. In addition, we found that the larger is likely to enlarge the gap and deteriorate the performance on test samples. It might be because the variance of the empirical loss function computed by times sampling plays a role of regularization.

Model Method CIFAR10 CIFAR100
VGG16 without sample-mix 6.78 (0.057) 28.68 (0.169)
Mixup ((Zhang et al., 2018) 5.81 (0.031) 26.58 (0.044)
DIP (, , label-mixing) 5.74 (0.100) 25.48 (0.034)
DIP (, , label-preserving) 6.05 (0.015) 26.57 (0.155)
DIP (, ) 5.52 (0.041) 26.73 (0.054)
PreActResNet18 without sample-mix 5.68 (0.015) 25.25 (0.272)
Mixup () 4.46 (0.082) 22.58 (0.074)
DIP (, , label-mixing) 4.36 (0.079) 21.97 (0.052)
DIP (, , label-preserving) 4.83 (0.125) 23.33 (0.052)
DIP (, ) 4.40 (0.036) 22.04 (0.067)
Table 1: Mean misclassification rate and standard error over three trials on CIFAR10/100 datasets.
(a) VGG16 / CIFAR10
(b) PreActResNet18 / CIFAR10
Figure 2: Mean generalization gap and standard error over three trials on CIFAR10 dataset. indicates standard training without sample-mix.

4 Conclusion

In this paper, we proposed a novel framework called DIP, where sample-mix is encapsulated in the hypothesis class of a classifier. We theoretically evaluated the generalization error bound via Rademacher complexity and showed that sample-mix is effective to reduce the generalization gap. Through experiments on CIFAR datasets, we demonstrated that our approach can outperform existing Mixup data augmentation.

References

  • Burda et al. (2016) Y. Burda, R. Grosse, and R. Salakhutdinov. Importance weighted autoencoders. In ICLR, 2016.
  • Gal and Ghahramani (2016) Y. Gal and Z. Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In ICML, 2016.
  • Gao and Zhou (2016) W. Gao and Z.-H. Zhou. Dropout rademacher complexity of deep neural networks. Science China Information Sciences, 59(7):072104, 2016.
  • Guo et al. (2019) H. Guo, Y. Mao, and R. Zhang. Mixup as locally linear out-of-manifold regularization. In AAAI, 2019.
  • He et al. (2016) K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV, 2016.
  • Inoue (2018) H. Inoue. Data augmentation by pairing samples for images classification. arXiv preprint arXiv:1801.02929, 2018.
  • Krizhevsky and Hinton (2009) A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
  • Mohri et al. (2012) M. Mohri, A. Rostamizadeh, F. Bach, and A. Talwalkar. Foundations of Machine Learning. MIT Press, 2012.
  • Sato et al. (2015) I. Sato, H. Nishimura, and K. Yokoi. Apac: Augmented pattern classification with neural networks. arXiv preprint arXiv:1505.03229, 2015.
  • Simard et al. (1998) P. Y. Simard, Y. A. LeCun, J. S. Denker, and B. Victorri. Transformation invariance in pattern recognition—tangent distance and tangent propagation. In Neural networks: tricks of the trade. 1998.
  • Simonyan and Zisserman (2015) K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition, 2015.
  • Taylor and Nitschke (2017) L. Taylor and G. Nitschke. Improving deep learning using generic data augmentation. arXiv preprint arXiv:1708.06020, 2017.
  • Tokozume et al. (2018a) Y. Tokozume, Y. Ushiku, and T. Harada. Between-class learning for image classification. In CVPR, 2018a.
  • Tokozume et al. (2018b) Y. Tokozume, Y. Ushiku, and T. Harada. Learning from between-class examples for deep sound recognition. In ICLR, 2018b.
  • Verma et al. (2018) V. Verma, A. Lamb, C. Beckham, A. Najafi, A. Courville, I. Mitliagkas, and Y. Bengio. Manifold mixup: Learning better representations by interpolating hidden states. arXiv preprint arXiv:1806.05236, 2018.
  • Zhang et al. (2018) H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk minimization. In ICLR, 2018.

Appendix A Proofs of Theorem 1

In this section, we give a complete proof of theorem 1. The empirical Rademacher complexity is defined as follows.

Definition 1.

Let be a positive integer, be i.i.d. random variables drawn from , be a class of measurable functions, and be Rademacher random variables, namely, random variables taking and with the equal probabilities. Then the empirical Rademacher complexity of is defined as

(11)

We assume that is a -lipschitz function with respect to first argument. Here we have the following useful lemma. The proof of this lemma can be found in [Mohri et al., 2012].

Lemma 1 (Talagrand’s lemma).

Let be an -lipschitz function. Then for any hypothesis set of real valued function functions, the following inequality holds:

(12)

From this lemma, we have

(13)

Let . In equation 9, we assume that

Now we can bound as follows.

By combining the above result and equation 13, we complete the proof of this theorem. ∎

Appendix B Details of Experimental Setting

In this section, we describe the details of training for experiments in Section 3.

b.1 Training

VGG16 [Simonyan and Zisserman, 2015] and PreActResNet18 [He et al., 2016] was used for experiments. We did not apply Dropout similarly to Mixup [Zhang et al., 2018]. For all experiments, we trained a neural network for 200 epoch. Learning rate was set to in the beginning and multiplied by at 100 and 150 epoch. We applied standard augmentation such as cropping and flipping. The size of mini-batch was set to . We set for label-preserving sample-mix training and for label-mixing sample-mix training. was generated for each sample in mini-batch, and was obtained by permutation of samples in mini-batch.

b.2 Prediction

For standard without sample-mix method and Mixup method, we predicted labels of test samples from original clean samples. For proposed method, we predicted labels from the expectation over mixed samples computed by monte carlo approximation. In the same manner to the training process, we sampled and 500 times and calculated the average to obtain the final output. In evaluation stage, data augmentation except for sample-mix was turned off.

Appendix C Magnified Versions of Experimental Results

In this section, we present the magnfied version of experimental results in Section 3.

CIFAR10 CIFAR100
Model Method Train Acc. Test Acc. Train Acc. Test Acc.
VGG16 without mix 0.00 (0.000) 6.78 (0.057) 0.03 (0.003) 28.68 (0.169)
Mixup () 0.05 (0.007) 5.81 (0.031) 0.27 (0.006) 26.58 (0.044)
Mixup () 0.26 (0.029) 5.73 (0.042) 1.77 (0.108) 26.34 (0.225)
DIP (, , label-mixing) 0.13 (0.000) 5.74 (0.100) 0.48 (0.012) 25.48 (0.034)
DIP (, , label-mixing) 0.72 (0.035) 5.85 (0.015) 3.08 (0.147) 25.45 (0.179)
DIP (, , label-preserving) 0.47 (0.012) 6.05 (0.015) 1.26 (0.072) 26.57 (0.155)
DIP (, , label-preserving) 2.08 (0.026) 6.81 (0.046) 7.38 (0.152) 27.73 (0.140)
DIP (, ) 0.02 (0.003) 5.57 (0.093) 0.12 (0.006) 25.87 (0.200)
DIP (, ) 0.30 (0.015) 5.63 (0.032) 1.15 (0.038) 25.72 (0.042)
DIP (, ) 0.00 (0.000) 5.85 (0.041) 0.04 (0.003) 27.20 (0.067)
DIP (, ) 0.01 (0.006) 5.52 (0.041) 0.10 (0.009) 26.73 (0.054)
PreActResNet18 without mix 0.00 (0.000) 5.68 (0.015) 0.02 (0.000) 25.25 (0.272)
Mixup () 0.02 (0.004) 4.46 (0.082) 0.09 (0.006) 22.58 (0.074)
Mixup () 0.18 (0.013) 4.32 (0.098) 0.50 (0.027) 22.87 (0.100)
DIP (, , label-mixing) 0.09 (0.006) 4.36 (0.079) 0.26 (0.009) 21.97 (0.052)
DIP (, , label-mixing) 0.51 (0.013) 4.66 (0.125) 1.43 (0.029) 22.31 (0.127)
DIP (, , label-preserving) 0.40 (0.032) 4.83 (0.125) 0.78 (0.003) 23.33 (0.052)
DIP (, , label-preserving) 1.74 (0.022) 5.84 (0.059) 3.87 (0.023) 23.75 (0.156)
DIP (, ) 0.02 (0.000) 4.50 (0.116) 0.09 (0.003) 21.85 (0.231)
DIP (, ) 0.27 (0.010) 4.75 (0.110) 0.57 (0.003) 21.94 (0.197)
DIP (, ) 0.00 (0.000) 4.75 (0.046) 0.03 (0.000) 22.37 (0.136)
DIP (, ) 0.01 (0.003) 4.40 (0.036) 0.08 (0.007) 22.04 (0.067)
Table 2: Mean misclassification rate and standard error over 3 trials on CIFAR10/100 datasets.
(a) VGG16 / CIFAR10
(b) PreActResNet18 / CIFAR10
(c) VGG16 / CIFAR100
(d) PreActResNet18 / CIFAR100
Figure 3: Mean generalization gap and standard error over three trials on CIFAR10/100 dataset. indicates standard training without sample-mix.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
377941
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description