Data Valuation using Reinforcement Learning
Abstract
Quantifying the value of data is a fundamental problem in machine learning. Data valuation has multiple important use cases: (1) building insights about the learning task, (2) domain adaptation, (3) corrupted sample discovery, and (4) robust learning. To adaptively learn data values jointly with the target task predictor model, we propose a meta learning framework which we name Data Valuation using Reinforcement Learning (DVRL). We employ a data value estimator (modeled by a deep neural network) to learn how likely each datum is used in training of the predictor model. We train the data value estimator using a reinforcement signal of the reward obtained on a small validation set that reflects performance on the target task. We demonstrate that DVRL yields superior data value estimates compared to alternative methods across different types of datasets and in a diverse set of application scenarios. The corrupted sample discovery performance of DVRL is close to optimal in many regimes (i.e. as if the noisy samples were known apriori), and for domain adaptation and robust learning DVRL significantly outperforms stateoftheart by 14.6% and 10.8%, respectively.
1 Introduction
Data is an essential ingredient in machine learning. Machine learning models are wellknown to improve when trained on largescale and highquality datasets (Hestness et al., 2017; Najafabadi et al., 2015). However, collecting such largescale and highquality datasets is costly and challenging. One needs to determine the samples that are most useful for the target task and then label them correctly. Recent work (Toneva et al., 2019) suggests that not all samples are equally useful for training, particularly in the case of deep neural networks. In some cases, similar or even higher test performance can be obtained by removing a significant portion of training data, i.e. lowquality or noisy data may be harmful (Ferdowsi et al., 2013; Frenay & Verleysen, 2014). There are also some scenarios where traintest mismatch cannot be avoided because the training dataset only exists for a different domain. Different methods (Ngiam et al., 2018; Zhu et al., 2019) have demonstrated the importance of carefully selecting the most relevant samples to minimize this mismatch.
Accurately quantifying the value of data has a great potential for improving model performance for realworld training datasets which commonly contain incorrect labels, and where the input samples differ in relatedness, sample quality, and usefulness for the target task. Instead of treating all data samples equally, lower priority can be assigned for a datum to obtain a higher performance model – for example in the following scenarios:

[leftmargin=*]

Incorrect label (e.g. human labeling errors).

Input comes from a different distribution (e.g. different location or time).

Input is noisy or low quality (e.g. noisy capturing hardware).

Usefulness for target task (label is very common in the training dataset but not as common in the testing dataset).
In addition to improving performance in such scenarios, data valuation also enables many new use cases. It can suggest better practices for data collection, e.g. what kinds of additional data would the model benefit the most from. For organizations that sell data, it determines the correct valuebased pricing of data subsets. Finally, it enables new possibilities for constructing very largescale training datasets in a much cheaper way, e.g. by searching the Internet using the labels and filtering away less valuable data.
How does one evaluate the value of a single datum? This is a crucial and challenging question. It is straightforward to address at the full dataset granularity: one could naively train a model on the entire dataset and use its prediction performance as the value. However, evaluating the value of each datum is far more difficult, especially for complex models such as deep neural networks that are trained on largescale datasets. In this paper, we propose a meta learningbased data valuation method which we name Data Valuation using Reinforcement Learning (DVRL). Our method integrates data valuation with the training of the target task predictor model. DVRL determines a reward by quantifying the performance on a small validation set, and uses it as a reinforcement signal to learn the likelihood of each datum being using in training of the predictor model. In a wide range of use cases, including domain adaptation, corrupted sample discovery and robust learning, we demonstrate significant improvements compared to permutationbased strategies (such as Leaveoneout and Influence Function (Koh & Liang, 2017)) and game theorybased strategies (such as Data Shapley (Ghorbani & Zou, 2019)). The main contributions can be summarized as follows:

[leftmargin=*]

We propose a novel meta learning framework for data valuation that is jointly optimized with the target task predictor model.

We demonstrate multiple use cases of the proposed data valuation framework and show DVRL significantly outperforms competing methods on many image, tabular and language datasets.

Unlike previous methods, DVRL is scalable to large datasets and complex models, and its computational complexity is not directly dependent on the size of the training set.
2 Related Work
Data valuation: A commonlyused method for data valuation is leaveoneout (LOO). It quantifies the performance difference when a specific sample is removed and assigns it as that sample’s data value. The computational cost is a major concern for LOO – it scales linearly with the number of training samples which means its cost becomes prohibitively high for largescale datasets and complex models. In addition, there are fundamental limitations in the approximation. For example, if there are two exactly equivalent samples, LOO underestimates the value of that sample even though that sample may be very important. The method of Influence Function (Koh & Liang, 2017) was proposed to approximate LOO in a computationallyefficient manner. It uses the gradient of the loss function with small perturbations to estimate the data value. In order to compute the gradient, Hessian values are needed; however these are prohibitively expensive to compute for neural networks. Approximations for Hessian computations are possible, although they generally result in performance limitations. From data quality assessment perspective, the method of Influence Function inherits the major limitations of LOO.
Data Shapley (Ghorbani & Zou, 2019) is another relevant work. Shapley values are motivated by game theory (Shapley, 1953) and are commonly used in feature attribution problems such as relating predictions to input features (Lundberg & Lee, 2017). For Data Shapley, the prediction performance of all possible subsets is considered and the marginal performance improvement is used as the data value. The computational complexity for computing the exact Shapley value is exponential with the number of samples. Therefore, Monte Carlo sampling and gradientbased estimation are used to approximate them. However, even with these approximations, the computational complexity still remains high (indeed much higher than LOO) due to retraining for each test combination. In addition, the approximations may result in fundamental limitations in data valuation performance – e.g. with Monte Carlo approximation, the ratio of tested combinations compared to all possible combinations decreases exponentially. Moreover, in all the aforementioned methods data valuation is decoupled from predictor model training, which limits the performance due to lack of joint optimization.
Meta learningbased adaptive learning: There are relevant studies that utilize meta learning for adaptive weight assignment while training for various use cases such as robust learning, domain adaptation, and corrupted sample discovery. ChoiceNet (Choi et al., 2018) explicitly models output distributions and uses the correlations of the output to improve robustness. Xue et al. (2019) estimates uncertainty of predictions to identify the corrupted samples. Li et al. (2019) combines meta learning with standard stochastic gradient update with generated synthetic noise for robust learning. Shen & Sanghavi (2019) alternates the processes of selecting the samples with low loss and model training to improve robustness. Shu et al. (2019) uses neural networks to model the relationship between current loss and the corresponding sample weights, and utilizes a metalearning framework for robust weight assignment. Köhler et al. (2019) estimates the uncertainty to discover the noisy labeled data and relabels mislabeled samples to improve the prediction performance of the predictor model. Gold Loss Correction (Hendrycks et al., 2018) uses a clean validation set to recover the label corruption matrix to retrain the predictor model with corrected labels. Learning to Reweight (Ren et al., 2018) proposes a single gradient descent step guided with validation set performance to reweight the training batch. Domain Adaptive Transfer Learning (Ngiam et al., 2018) introduces importance weights (based on the prior label distribution match) to scale the training samples for transfer learning. MentorNet (Jiang et al., 2018) proposes a curriculum learning framework that learns the order of minibatch for training of the corresponding predictor model. Our method, DVRL, differs from the aforementioned as it directly models the value of the data using learnable neural networks (which we refer to as a data value estimator). To train the data value estimator, we use reinforcement learning with a sampling process. DVRL is modelagnostic and even applicable to nondifferentiable target objectives. Learning is jointly performed for the data value estimator and the corresponding predictor model, yielding superior results in all of the use cases we consider.
3 Proposed Method
Framework: Let us denote the training dataset as where is a feature vector in the dimensional feature space , e.g. , and is a corresponding label in the label space , e.g. for classification where is the number of classes and is the simplex. We consider a disjoint testing dataset where the target distribution does not need to be the same with the training distribution . We assume an availability of a (often small^{1}^{1}1We provide empirical results that how small the validation set can be in Section 4.5.) validation dataset that comes from the target distribution .
The DVRL method (overview in Fig. 1) consists of two learnable functions: (1) the target task predictor model , (2) data value estimator model . The predictor model is trained to minimize a certain weighted loss function (e.g. Mean Squared Error (MSE) for regression or cross entropy for classification) on training set :
(1) 
can be any trainable function with parameters , such as a neural network. The data value estimator model , on the other hand, is optimized to output weights that determine the distribution of selection likelihood of the samples to train the predictor model . We formulate the corresponding optimization problem as:
(2) 
where represents value of the training sample . The data value estimator is also a trainable function, such as a neural network. Similar to , we use MSE or cross entropy for .
Training: To encourage exploration based on uncertainty, we model training sample selection stochastically. Let denote the probability that is used to train the predictor model . is the probability distribution for inclusion of each training sample. is a binary vector that represents the selected samples. If , is selected/not selected for training the predictor model. is the probability that certain selection vector s is selected based on . We assign the outputs of the data value estimator model, , as the data values. We can use the data values to rank the dataset samples (e.g. to determine a subset of the training dataset) and to do sampleadaptive training (e.g. for domain adaptation).
The predictor model can be trained using standard stochastic gradient descent because it is differentiable with respect to the input. However, gradient descentbased optimization cannot be used for the data value estimator because the sampling process is nondifferentiable. There are multiple ways to handle the nondifferentiable optimization bottleneck, such as Gumbelsoftmax (Jang et al., 2017) or stochastic backpropagation (Rezende et al., 2014). In this paper, we consider reinforcement learning instead, which directly encourages exploration of the policy towards the optimal solution of Eq. (2). We use the REINFORCE algorithm (Williams, 1992) to optimize the policy gradients, with the rewards obtained from a small validation set that approximates performance on the target task. For the loss function :
we directly compute the gradient as:
where is
To improve the stability of the training, we use the moving average of the previous loss (), with a window size (), as the baseline for the current loss. The pseudocode is shown in Algorithm 1.
Computational complexity: DVRL models the mapping between an input and its value with a learnable function. The training time of DVRL is not directly proportional to the dataset size, but rather dominated by the required number of iterations and periteration complexity in Algorithm 1. One way to minimize the computational overhead is to use pretrained models to initialize the predictor networks at each iteration. Unlike alternative methods like Data Shapley, we demonstrate the scalability of DVRL to largescale datasets such as CIFAR100, and complex models such as ResNet32 (He et al., 2016) and WideResNet2810 (Zagoruyko & Komodakis, 2016). Instead of being exponential in terms of the dataset size, the training time overhead DVRL is only twice of conventional training. Please see Appendix D for further analysis on learning dynamics of DVRL and Appendix B for additional computational complexity discussions.
4 Experiments
We evaluate the data value estimation quality of DVRL on multiple types of datasets and for various use cases. The sourcecode can be found at https://github.com/googleresearch/googleresearch/tree/master/dvrl.
Benchmark methods: We consider the following benchmarks: (1) Randomlyassigned values (Random), (2) Leaveoneout (LOO), (3) Data Shapley Value (Data Shapley) (Ghorbani & Zou, 2019). For some experiments, we also compare with (4) Learning to Reweight (Ren et al., 2018), (5) MentorNet (Jiang et al., 2018), and (6) Influence Function (Koh & Liang, 2017).
Datasets: We consider 12 public datasets (3 public tabular datasets, 7 public image datasets, and 2 public language datasets) to evaluate DVRL in comparison to multiple benchmark methods. 3 public tabular datasets are (1) \colorblueBlog, (2) \colorblueAdult, (3) \colorblueRossmann; 7 public image datasets are (4) \colorblueHAM 10000, (5) \colorblueMNIST, (6) \colorblueUSPS, (7) \colorblueFlower, (8) \colorblueFashionMNIST, (9) \colorblueCIFAR10, (10) \colorblueCIFAR100; 2 public language datasets are (11) \colorblueEmail Spam, (12) \colorblueSMS Spam. Details of the datasets can be found in the provided hyperlinks (in blue).
Baseline predictor models: We consider various machine learning models as the baseline predictor model to highlight the proposed modelagnostic data valuation framework. For Adult and Blog datasets, we use LightGBM (Ke et al., 2017), and for Rossmann dataset, we use XGBoost and multilayer perceptrons due to their superior performance on the tabular datasets. For Flower, HAM 10000, and CIFAR10 datasets, we use Inceptionv3 with toplayer finetuning (pretrained on ImageNet, (Szegedy et al., 2016)) as the baseline predictor model. For FashionMNIST, MNIST, and USPS datasets, we use multinomial logistic regression, and for Email and SMS datasets, we use Naive Bayes model. We also use ResNet32 (He et al., 2016) and WideResNet2810 (Zagoruyko & Komodakis, 2016) as the baseline models for CIFAR10 and CIFAR100 datasets in Section 4.3 to demonstrate the scalability of DVRL. For data value estimation network, we use multilayer perceptrons with ReLU activation as the base architecture. The number of layers and hidden units are optimized with crossvalidation.
Experimental details: In all experiments, we use \colorblueStandard Normalizer to normalize the entire features to have zero mean and one standard deviation. We transform categorical variables into onehot encoded embeddings. We set the inner iteration count (=200) for the predictor network, moving average window (=20), and minibatch size (=256) for the predictor network and minibatch size (=2000) for the data value estimation network (large batch size often improves the stability of the reinforcement learning model training (McCandlish et al., 2018)). We set the learning rate to 0.01 () for the data value estimation network and 0.001 () for the predictor network.
4.1 Removing high/low value samples
Removing low value samples from the training dataset can improve the predictor model performance, especially in the cases where the training dataset contains corrupted samples. On the other hand, removing high value samples, especially if the dataset is small, would decrease the performance significantly. Overall, the performance after removing high/low value samples is a strong indicator for the quality of data valuation.
Initially, we consider the conventional supervised learning setting, where all training, validation and testing datasets come from the same distribution (without sample corruption or domain mismatch). We use two tabular datasets (Adult and Blog) with 1,000 training samples and one image dataset (Flower) with 2,000 training samples.^{2}^{2}2We use the small training dataset in this experiment in order to compare with LOO and Data Shapley which have high computational complexities. DVRL is scalable to larger datasets as shown in Section 4.3. We use 400 validation samples for tabular datasets and 800 validation samples for the image dataset. Then, we report the prediction performance on the disjoint testing set after removing the high/low value samples based on the estimated data values.
As shown in Fig. 2, even in the absence of sample corruption or domain mismatch, DVRL can marginally improve the prediction performance after removing some portion of the least important samples. Using only 60%70% of the training set (the highest valued samples), DVRL can obtain a similar performance compared to training on the entire dataset. After removing a small portion (10%20%) of the most important samples, the prediction performance significantly degrades which indicates the importance of the high valued samples. Qualitatively looking at these samples, we observe them to typically be representative of the target task which can be insightful. Overall, DVRL shows the fastest performance degradation after removing the most important samples and the slowest performance degradation after removing the least important samples in most cases, underlining the superiority of DVRL in data valuation quality compared to competing methods (LOO and Data Shapley).
Next, we focus on the setting of removing high/low value samples in the presence of label noise in the training data. We consider three image datasets: FashionMNIST, HAM 10000, and CIFAR10. As noisy samples hurt the performance of the predictor model, an optimal data value estimator with a clean validation dataset should assign lowest values to the noisy samples. With the removal of samples with noisy labels (‘Least’ setting), the performance should either increase, or at least decrease much slower, compared to removal of samples with correct labels (‘Most’ setting). In this experiment, we introduce label noise to 20% of the samples by replacing true labels with random labels. As can be seen in Fig. 10, for all data valuation methods the prediction performance tends to first slowly increase and then decrease in the ‘Least’ setting; and tends to rapidly decrease in the ‘Most’ setting. Yet, DVRL achieves the slowest performance decrease in ‘Least’ setting and fastest performance decrease in the ‘Most’ setting, reflecting its superiority in data valuation.
4.2 Corrupted sample discovery
There are some scenarios where training samples may contain corrupted samples, e.g. due to cheap label collection methods. An automated corrupted sample discovery method would be highly beneficial for distinguishing samples with clean vs. noisy labels. Data valuation can be used in this setting by having a small clean validation set to assign low data values to the potential samples with noisy labels. With an optimal data value estimator, all noisy labels would get the lowest data values.
We consider the same experimental setting with the previous subsection with 20% noisy label ratio on 6 datasets. Fig. 4 shows that DVRL consistently outperforms all benchmarks (Data Shapley, LOO and Influence Function^{3}^{3}3Note that Influence Function is a computationallyefficient approximation of LOO; thus, the performance of Influence Function is similar or slightly worse than LOO in most cases.). The trend of noisy label discovery for DVRL can be very close to optimal (as if we perfectly knew which samples have noisy labels), particularly for the Adult, CIFAR10 and Flower datasets. To highlight the stability of DVRL, we provide the confidence intervals of DVRL performance on the corrupted sample discovery in Appendix E.
4.3 Robust learning with noisy labels
In this section, we consider how reliably DVRL can learn with noisy data in an endtoend way, without removing the lowvalue samples as in the previous section. Ideally, noisy samples should get low data values as DVRL converges and a high performance model can be returned. To compare DVRL with two recentlyproposed benchmarks: MentorNet (Jiang et al., 2018) and Learning to Reweight (Ren et al., 2018) for this use case, we focus on two complex deep neural networks as the baseline predictor models, ResNet32 (He et al., 2016) and WideResNet2810 (Zagoruyko & Komodakis, 2016), trained on CIFAR10 and CIFAR100 datasets. Additional results on other image datasets are in Appendix C.1. Furthermore, the results on robust learning with noisy features are in Appendix C.2.
Noise (predictor model)  Uniform (WideResNet2810)  Background (ResNet32)  
Datasets  CIFAR10  CIFAR100  CIFAR10  CIFAR100 
Validation Set Only  46.64 3.90  9.94 0.82  15.90 3.32  8.06 0.76 
Baseline  67.97 0.62  50.66 0.24  59.54 2.16  37.82 0.69 
Baseline + Finetuning  78.66 0.44  54.52 0.40  82.82 0.93  54.23 1.75 
MentorNet + Finetuning  78.00  59.00     
Learning to Reweight  86.92 0.19  61.34 2.06  86.73 0.48  59.30 0.60 
DVRL  89.02 0.27  66.56 1.27  88.07 0.35  60.77 0.57 
Clean Only (60% Data)  94.08 0.23  74.55 0.53  90.66 0.27  63.50 0.33 
Zero Noise  95.78 0.21  78.32 0.45  92.68 0.22  68.12 0.21 
We consider the same experimental setting from Ren et al. (2018) on CIFAR10 and CIFAR100 datasets. For the first experiment, we use WideResNet2810 as the baseline predictor model and apply 40% of label noise uniformly across all classes. We use 1,000 clean (noisefree) samples as the validation set and test the performance on the clean testing set. For the second experiment, we use ResNet32 as the baseline predictor model and apply 40% background noise (sameclass noise to the 40% of the samples). In this case, we only use 10 clean samples per class as the validation set. We consider five additional benchmarks: (1) Validation Set Only – which only uses clean validation set for training, (2) Baseline – which only uses noisy training set for training, (3) Baseline + Finetuning – which is initialized with the trained baseline model on the noisy training set and finetuned on the clean validation set, (4) Clean Only (60% data) – which is trained on the clean training set after removing the training samples with flipped labels, (5) Zero Noise – which uses the original noisefree training set for training (100% clean training data). We exclude Data Shapley and LOO in this experiment due to their prohibitivelyhigh computational complexities.
As shown in Table 1, DVRL outperforms other robust learning methods in all cases. The performance improvements with DVRL are larger with Uniform noise. Learning to Reweight loses 7.16% and 13.21% accuracy compared to the optimal case (Zero Noise); on the other hand, DVRL only loses 5.06% and 7.99% accuracy for CIFAR10 and CIFAR100 respectively with Uniform noise.
4.4 Domain adaptation
In this section, we consider the scenario where the training dataset comes from a substantially different distribution from the validation and testing sets. Naive training methods (i.e. equal treatment of all training samples) often fail in this scenario (Ganin et al., 2016; Glorot et al., 2011). Data valuation is expected to be beneficial for this task by selecting the samples from the training dataset that best match the distribution of the validation dataset.
Source  Target  Task  Baseline  Data Shapley  DVRL 

HAM10000  Skin Lesion Classification  .296  .378  .448  
MNIST  USPS  Digit Recognition  .308  .391  .472 
SMS  Spam Detection  .684  .864  .903 
We initially focus on the three cases from Ghorbani & Zou (2019), shown in Table 2. (1) uses Google image search results (cheaply collected dataset) to predict skin lesion classification on HAM 10000 data (clean), (2) uses MNIST data to recognize digit on USPS dataset, (3) uses Email spam data to detect spam in an SMS dataset. The experimental settings are exactly the same with Ghorbani & Zou (2019). Table 2 shows that DVRL significantly outperforms Baseline and Data Shapley in all three tasks. One primary reason is that DVRL jointly optimizes the data value estimator and corresponding predictor model; on the other hand, Data Shapley needs a two step processes to construct the predictor model in domain adaptation setting.
Next, we focus on a realworld tabular data learning problem where the domain differences are significant. We consider the sales forecasting problem with the Rossmann Store Sales dataset, which consists of sales data from four different store types. Simple statistical investigation shows a significant discrepancy between the input feature distributions across different store types, meaning there is a large domain mismatch across store types (see Appendix F). To further illustrate distribution difference across the store types, we show the tSNE analysis on the final layer of a discriminative neural network trained on the entire dataset in Appendix Fig. 11. We consider three different settings: (1) training on all store types (Train on All), (2) training on store types excluding the store type of interest (Train on Rest), and (3) training only on the store type of interest (Train on Specific). In all cases, we evaluate the performance on each store type separately. For example, to evaluate the performance on store type D, Train on All setting uses all four store type datasets for training, Train on Rest setting uses store types A, B and C for training, and Train on Specific setting only uses the store type D for training. Train on Rest is expected to yield the largest domain mismatch between training and testing sets, and Train on Specific yield the minimal.
Predictor Model  Store  Train on All  Train on Rest  Train on Specific  
(Metric: RMSPE)  Type  Baseline  DVRL  Baseline  DVRL  Baseline  DVRL 
XGBoost  A  0.1736  0.1594  0.2369  0.2109  0.1454  0.1430 
B  0.1996  0.1422  0.7716  0.3607  0.0880  0.0824  
C  0.1839  0.1502  0.2083  0.1551  0.1186  0.1170  
D  0.1504  0.1441  0.1922  0.1535  0.1349  0.1221  
Neural Networks  A  0.1531  0.1428  0.3124  0.2014  0.1181  0.1066 
B  0.1529  0.0979  0.8072  0.5461  0.0683  0.0682  
C  0.1620  0.1437  0.2153  0.1804  0.0682  0.0677  
D  0.1459  0.1295  0.2625  0.1624  0.0759  0.0708 
We evaluate the prediction performance of Baseline (train the predictor model without data valuation) and DVRL in 3 different settings with 2 different predictor models (XGBoost (Chen & Guestrin, 2016) and Neural Networks (3layer perceptrons)). As shown in Table 3, DVRL improves the performance in all settings. The improvements are most significant in Train on Rest setting due to the large domain mismatch. For instance, DVRL reduces the error more than 50% for store type B predictions with XGBoost in comparison to Baseline. In Train on All setting, the performance improvement is still significant, showing that DVRL can distinguish the samples from the target distribution. In Appendix G, we demonstrate that DVRL actually prioritizes selection of the samples from the target store type. In Train on Specific setting, the performance improvements are smaller – even without domain mismatch, DVRL can marginally improve the performance by accurately prioritizing the important samples within the same store type. These results further support the conclusions from Fig. 2 in the conventional supervised learning setting that DVRL learns high quality data value scores.
4.5 Discussion: How many validation samples are needed?
DVRL requires a validation dataset from the target distribution that the testing dataset comes from. Depending on the task, the requirements for the validation dataset may involve being noisefree in labels, being from the same domain, or being high quality. Acquiring such a dataset can be costly in some scenarios and it is desirable to minimize its size requirements.
We analyze the impact of the size of the validation dataset on DVRL with 3 different datasets: Adult, Blog, and Fashion MNIST for the use case of corrupted sample discovery. Similar to Section 4.2, we add 20% noise to the training samples and try to find the corrupted samples with DVRL. As shown in Fig. 5, DVRL achieves reasonable performance with 100 to 400 validation samples. In the Adult dataset, even 10 validation samples are sufficient to achieve a reasonable data valuation quality. Both of these settings are often realistic in real world scenarios.
5 Conclusions
In this paper, we propose a meta learning framework, named DVRL, that adaptively learns data values jointly with a target task predictor model. The value of each datum determines how likely it will be used in training of the predictor model. We model this data value estimation task using a deep neural network, which is trained using reinforcement learning with a reward obtained from a small validation set that represents the target task performance. With a small validation set, DVRL can provide computationally highly efficient and high quality ranking of data values for the training dataset that is useful for domain adaptation, corrupted sample discovery and robust learning. We show that DVRL significantly outperforms other techniques for data valuation in various applications on diverse types of datasets.
6 Acknowledgements
Discussions with Kihyuk Sohn, Amirata Ghorbani and Zizhao Zhang are gratefully acknowledged.
References
 Chen & Guestrin (2016) Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794. ACM, 2016.
 Choi et al. (2018) Sungjoon Choi, Sanghoon Hong, and Sungbin Lim. Choicenet: Robust learning by revealing output correlations. arXiv preprint arXiv:1805.06431, 2018.
 Ferdowsi et al. (2013) Hasan Ferdowsi, Sarangapani Jagannathan, and Maciej Zawodniok. An online outlier identification and removal scheme for improving fault detection performance. IEEE Transactions on Neural Networks and Learning Systems, 25(5):908–919, 2013.
 Frenay & Verleysen (2014) B. Frenay and M. Verleysen. Classification in the presence of label noise: A survey. IEEE Transactions on Neural Networks and Learning Systems, 25(5):845–869, 2014.
 Ganin et al. (2016) Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domainadversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.
 Ghorbani & Zou (2019) Amirata Ghorbani and James Zou. Data shapley: Equitable valuation of data for machine learning. In International Conference on Machine Learning, pp. 2242–2251, 2019.
 Glorot et al. (2011) Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Domain adaptation for largescale sentiment classification: A deep learning approach. In International Conference on Machine Learning, pp. 513–520, 2011.
 He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016.
 Hendrycks et al. (2018) Dan Hendrycks, Mantas Mazeika, Duncan Wilson, and Kevin Gimpel. Using trusted data to train deep networks on labels corrupted by severe noise. In Advances in Neural Information Processing Systems, pp. 10456–10465, 2018.
 Hestness et al. (2017) Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory F. Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically. arXiv:1712.00409, 2017.
 Jang et al. (2017) Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbelsoftmax. In International Conference on Learning Representations, 2017.
 Jiang et al. (2018) Lu Jiang, Zhengyuan Zhou, Thomas Leung, LiJia Li, and Li FeiFei. Mentornet: Learning datadriven curriculum for very deep neural networks on corrupted labels. In International Conference on Machine Learning, pp. 2309–2318, 2018.
 Ke et al. (2017) Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and TieYan Liu. Lightgbm: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems, pp. 3146–3154, 2017.
 Koh & Liang (2017) Pang Wei Koh and Percy Liang. Understanding blackbox predictions via influence functions. In International Conference on Machine Learning, pp. 1885–1894, 2017.
 Köhler et al. (2019) Jan M Köhler, Maximilian Autenrieth, and William H Beluch. Uncertainty based detection and relabeling of noisy image labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 33–37, 2019.
 Li et al. (2019) Junnan Li, Yongkang Wong, Qi Zhao, and Mohan S Kankanhalli. Learning to learn from noisy labeled data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5051–5059, 2019.
 Lundberg & Lee (2017) Scott M Lundberg and SuIn Lee. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, pp. 4765–4774, 2017.
 McCandlish et al. (2018) Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model of largebatch training. arXiv preprint arXiv:1812.06162, 2018.
 Najafabadi et al. (2015) Maryam M Najafabadi, Flavio Villanustre, Taghi M Khoshgoftaar, Naeem Seliya, Randall Wald, and Edin Muharemagic. Deep learning applications and challenges in big data analytics. Journal of Big Data, 2(1):1, 2015.
 Ngiam et al. (2018) Jiquan Ngiam, Daiyi Peng, Vijay Vasudevan, Simon Kornblith, Quoc V Le, and Ruoming Pang. Domain adaptive transfer learning with specialist models. arXiv preprint arXiv:1811.07056, 2018.
 Ren et al. (2018) Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples for robust deep learning. In International Conference on Machine Learning, pp. 4334–4343, 2018.
 Rezende et al. (2014) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pp. 1278–1286, 2014.
 Shapley (1953) Lloyd S Shapley. A value for nperson games. Contributions to the Theory of Games, 2(28):307–317, 1953.
 Shen & Sanghavi (2019) Yanyao Shen and Sujay Sanghavi. Learning with bad training data via iterative trimmed loss minimization. In International Conference on Machine Learning, pp. 5739–5748, 2019.
 Shu et al. (2019) Jun Shu, Qi Xie, Lixuan Yi, Qian Zhao, Sanping Zhou, Zongben Xu, and Deyu Meng. Metaweightnet: Learning an explicit mapping for sample weighting. arXiv preprint arXiv:1902.07379, 2019.
 Szegedy et al. (2016) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826, 2016.
 Toneva et al. (2019) Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J. Gordon. An empirical study of example forgetting during deep neural network learning. In International Conference on Learning Representations, 2019.
 Williams (1992) Ronald J Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning, 8(34):229–256, 1992.
 Xue et al. (2019) Cheng Xue, Qi Dou, Xueying Shi, Hao Chen, and Pheng Ann Heng. Robust learning at noisy labeled medical images: Applied to skin lesion classification. arXiv preprint arXiv:1901.07759, 2019.
 Zagoruyko & Komodakis (2016) Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
 Zhu et al. (2019) Linchao Zhu, Sercan O. Arik, Yi Yang, and Tomas Pfister. Learning to Transfer Learn. arXiv preprint arXiv:1908.11406, 2019.
Appendix A Block diagrams for inference
Appendix B Computational complexity
DVRL first trains the baseline model using the entire dataset (without reweighting). Afterwards, we can use this pretrained baseline model to initialize the predictor network and apply finetuning with DVRL update steps. The convergence of the finetuning process is much faster than the convergence of training from the scratch.
We quantify the computational overhead of DVRL on the CIFAR100 dataset (consisting 50k training samples) with ResNet32 as a representative example. Overall, DVRL training takes less than 8 hours (given a pretrained ResNet32 model on the entire dataset) on a single NVIDIA Tesla V100 GPU without any hardware optimizations. The pretraining time of ResNet32 on the entire dataset (without reweighting) is less than 4 hours; thus the total training time of DVRL is less than 12 hours from the scratch. On the other hand, the training time of Data Shapley (the most competitive benchmark) is more than a week on Fashion MNIST (consisting lower dimensional inputs and less number of classes) with a much simpler predictor model architecture (2layered convolutional neural networks).
At inference, the data value estimator can be used to obtain data value for each sample. The runtime of data valuation is typically much faster (less than 1 ms per sample) than the predictor model (e.g. ResNet32 model).
Appendix C Additional results
c.1 Additional results on robust learning with noisy labels
We evaluate how DVRL can provide robustness for learning with noisy labels. We add various levels of label noise, ranging from 0% to 50%, to the training sets and evaluate how robust the proposed model (DVRL) is for the noisy dataset. In this experiment, we use three image datasets (CIFAR10, Flower, and HAM 10000). Note that we initialize the predictor model using pretrained Inceptionv3 networks on ImageNet and only finetune the top layer (transfer learning setting).
Noise  CIFAR10  Flower  HAM 10000  

ratio  Clean  DVRL  Baseline  Clean  DVRL  Baseline  Clean  DVRL  Baseline 
0%  .8297  .8305  .8297  .9090  .9292  .9090  .7129  .7148  .7129 
10%  .8281  .8306  .7713  .9057  .9158  .7441  .7094  .7142  .6746 
20%  .8285  .8271  .6883  .9026  .9152  .5960  .7098  .7126  .6199 
30%  .8283  .8262  .5897  .8889  .8901  .4546  .7063  .7005  .5508 
40%  .8259  .8255  .4887  .8620  .8787  .2929  .7028  .6968  .4819 
50%  .8236  .8225  .3832  .8542  .8678  .2962  .7009  .6814  .4132 
Noisy labels significantly degrade the prediction performance when they are included in the training dataset (see the increasing differences between Baseline and Clean in Table 4). DVRL demonstrates high robustness up to high noisy label ratio (50%). In some cases (even without noisy labels (i.e. 0% noise ratio)), the prediction performance even outperforms the Clean case, as DVRL prioritizes some clean samples more than others. Overall, DVRL framework is promising in maintaining high prediction performance even with a significant increase in the amount of noisy labels.
c.2 Additional results on robust learning with noisy features
In this section, we consider training with noisy input features, with a clean validation set. We independently add Gaussian noise with zero mean and a certain standard deviation of to each feature in the training set independently. We use two tabular datasets (Adult and Blog) to evaluate the robustness of DVRL on input noise. As can be seen in Table 5, DVRL is robust with noise on the features and the performance gains are higher with larger noise in comparison to Baseline (i.e. treat all the noisy training samples equally), since DVRL can discover the training samples with less corrupted by the additive noise among the entire noisy training samples and provide higher weights on those less noisy samples.
Blog  Adult  

Baseline  DVRL  Baseline  DVRL  
0.1  0.733  0.819  0.802  0.820 
0.2  0.647  0.798  0.753  0.788 
0.3  0.626  0.766  0.699  0.771 
0.4  0.623  0.717  0.652  0.725 
c.3 Additional results on removing high/low value samples with 20% label noise
c.4 Additional results on corrupted sample discovery with 20% label noise
Appendix D Learning curves of DVRL
Fig. 9 shows the learning curves of DVRL on the noisy data (with 20% label noise) setting in comparison to the validation log loss without DVRL (directly trained on the noisy data without reweighting) on 2 tabular datasets (Adult and Blog) and 4 image datasets (FashionMNIST, Flower, HAM 10000, and CIFAR10).
Appendix E Confidence intervals of DVRL performance on corrupted sample discovery experiments
Appendix F Rossmann data statistics & tSNE analysis
Store Type  A  B  C  D 

of Samples  457042 (54.1%)  15560 (1.8%)  112968 (13.4%)  258768 (30.6%) 
Sales  139016601854  205224592661  175319742178  210923552524 
Customers  169203221  436492543  192232259  224246259 
Appendix G Further analysis on Rossmann dataset in Train on All setting
To further understand the results in Train on All setting, we sorted (in a decreasing order) the training samples by their data values estimated by DVRL and illustrate the distributions of the training samples that come from the target store type. As can be seen in Fig. 12, DVRL prioritizes the training samples which come from the same target store type.