# Importance weighted generative networks

###### Abstract

Deep generative networks can simulate from a complex target distribution, by minimizing a loss with respect to samples from that distribution. However, often we do not have direct access to our target distribution—our data may be subject to sample selection bias, or may be from a different but related distribution. We present methods based on importance weighting that can estimate the loss with respect to a target distribution, even if we cannot access that distribution directly, in a variety of settings. These estimators, which differentially weight the contribution of data to the loss function, offer both theoretical guarantees and impressive empirical performance.

## 1 Introduction

Deep generative models have important application in many fields: we can automatically generate illustrations for text [32]; simulate video streams [30] or molecular fingerprints [17]; and create privacy-preserving versions of medical time-series data [9]. Such models use a neural network to parametrize a function , which maps random noise to a target probability distribution . This is achieved by minimizing a loss function between simulations and data, which is equivalent to learning a distribution over simulations that is indistinguishable from under an appropriate two-sample test. In this paper we focus on Generative Adversarial Networks (GANs) [11, 2, 3, 19], which incorporate an adversarially learned neural network in the loss function; however the results are also applicable to non-adversarial networks [8, 20].

An interesting challenge arises when we do not have direct access to i.i.d. samples from . This could arise either because observations are obtained via a biased sampling mechanism [4, 33], or in a transfer learning setting where our target distribution differs from our training distribution. As an example of the former, a dataset of faces generated as part of a university project may contain disproportionately many young adult faces relative to the population. As an example of the latter, a Canadian hospital system might want to customize simulations to its population while still leveraging a training set of patients from the United States (which has a different statistical distribution of medical records). In both cases, and more generally, we want to generate data from a target distribution but only have access to representative samples from a modified distribution . We give a pictorial example of this setting in Fig. 1.

In some cases, we can approach this problem using existing methods. For example, if we can reduce our problem to a conditional data-generating mechanism, we can employ Conditional Generative Adversarial Networks (C-GANs) or related models [22, 25], which enable conditional sampling given one or more latent variables. However, this requires that a) can be described on a low-dimensional space, and b) we can sample from our target distribution over that latent space. Further, C-GANs rely on a large, labeled dataset of training samples with diversity over the conditioning variable (within each batch), which becomes a challenge when conditioning on a high-dimensional variable. For example, if we wish to modify a distribution over faces with respect to age, gender and hair length, there may be few exemplars of 80-year-old men with long hair with which to learn the corresponding conditional distribution.

In this paper, we propose an alternate approach based on importance sampling [26]. Our method modifies an existing GAN by rescaling the observed data distribution during training, or equivalently by reweighting the contribution of each data point to the loss function. For concreteness, consider as the loss function the Maximum Mean Discrepancy (MMD) [12], a statistical distance between two distributions. When training a GAN with samples from , the standard MMD estimator equally weights the contribution of each point, yielding an estimator of the distance between the generator distribution and , as shown in Fig. (c)c.

In order to yield the desired estimator of the distance between our generator distribution and our target distribution , we reweight the MMD estimator by scaling the kernel evaluation for each sample. When the Radon-Nikodym derivative between the target and observed distributions (aka the modifier function ) is known, we inversely scale each evaluation by that derivative, yielding the finite-sample importance sampling transform on the MMD estimate, which we call the importance weighted estimator. This reweighting asymptotically guarantees that MMD discrimination, and the corresponding GAN update, occurs with respect to instead of (Fig. (d)d).

This approach has multiple advantages and extensions. First, if is known, we can now estimate importance weighted losses using robust estimators like the median-of-means estimator, which is crucial for controlling variance in settings where the modifier function has a large dynamic range. Second, even when the modifier function is only known up to a constant scaling, we can construct an alternative estimator using self-normalized sampling [27, 26] to use this partial information, while still maintaining asymptotic correctness. Finally, if the modifier function is unknown, we demonstrate techniques for estimating it from partially labeled data.

Our contributions are as follows: 1) We provide the first application of traditional importance weighting to deep generative models. This has connections to many types of GAN loss functions through the theory of U-statistics. 2) We propose several variants of our importance weighting framework for different practical scenarios. When dealing with particularly difficult functions , we propose to use robust median-of-means estimation and show that it has similar theoretical guarantees under weaker assumptions, i.e. bounded second moment. When is not known fully (only up to a scaling factor), we propose a self-normalized estimator. 3) We conduct extensive experimental evaluation of the proposed methods on both synthetic and real-world datasets. This includes estimating when less than of the data is labeled with user-provided exemplars.

### 1.1 Related work

Our method aims to generate samples from a distribution , given access to samples from . While to the best of our knowledge this has not been explicitly addressed in the GAN literature, several approaches have related goals.

Inverse probability weighting: Inverse probability weighting (IPW), originally proposed by [15] and still in wide use in the field of survey statistics [21], can be seen as a special case of importance sampling. IPW is a weighting scheme used to correct for biased treatment assignment methods in survey sampling. In such settings, the target distribution is known and the sampling distribution is typically finite and discrete, and can easily be estimated from data.

Conditional GANs: Conditional GANs (C-GANs) are an extension of GANs that aim to simulate from a conditional distribution, given some covariate. If our modifier function can be represented in terms of a low-dimensional covariate space, and if we can generate samples from the marginal distribution of on that space, then we can, in theory, use a C-GAN to generate samples from , by conditioning on the sampled covariates. This strategy suffers from two limitations. First, it assumes we can express in terms of a sampleable distribution on a low-dimensional covariate space. For settings where varies across many data dimensions or across a high-dimensional latent embedding, this ability to sample becomes untenable. Second, learning a family of conditional distributions is typically more difficult than learning a single joint distribution. As we show in our experiments, C-GANs often fail if there are too few real exemplars for a given covariate setting.

Weighted loss: In the context of domain adaptation for data with discrete class labels, the strategy of reweighting the MMD metric based on class probabilities has been proposed by [31]. This approach, however, differs from ours in several ways: It is limited to class imbalance problems, as opposed to changes in continuous-valued latent features; it requires access to the non-conforming target dataset; it provides no theoretical guarantees about the weighted estimator; and it is not in the generative model setting.

## 2 Problem formulation and technical approach

The problem: Given training samples from a distribution our goal is to construct (train) a generator function that produces i.i.d. samples from a distribution

To train , we follow the methodology of a Generative Adversarial Network (GAN) [11]. In brief, a GAN consists of a pair of interacting and evolving neural networks – a generator neural network with outputs that approximate the desired distribution, and a discriminator neural network that distinguishes between increasingly realistic outputs from the generator and samples from a training dataset.

The loss function is a critical feature of the GAN discriminator, and evaluates the closeness between the samples of the generator and those of the training data. Designing good loss functions remains an active area of research [2, 19]. One popular loss function is the Maximum Mean Discrepancy (MMD) [12], a distributional distance that is zero if and only if all moments of the two distributions are the same. As such, MMD can be used to prevent against mode collapse [28, 5] during training.

Our approach: We are able to train a GAN to generate samples from using a simple reweighting modification to the MMD loss function. Reweighting forces the loss function to apply greater penalties in areas of the support where the target and observed distributions differ most.

Below, we formally describe the MMD loss function, and describe its importance weighted variants. We note that while we discuss and evaluate our results in the context of MMD loss, much of this study naturally generalizes to other loss functions, as we describe in Remark 1.

### 2.1 Maximum mean discrepancy between two distributions

The MMD projects two distributions and into a reproducing kernel Hilbert space (RKHS) , and looks at the maximum mean distance between the two projections, i.e.

If we specify the kernel mean embedding of as , where is the characteristic kernel defining the RKHS, then we can write the square of this distance as

(1) |

Since projections are into infinite-dimensional space, the MMD—and hence, the squared MMD—between the two distributions is zero if and only if all of their moments are the same [12]. In order to be a useful loss function for training a neural network, we must be able to estimate from data, and compute gradients of this estimate with respect to the network parameters. Let be a sample , and be a sample . We can construct an unbiased estimator of [12] using these samples as

(2) |

### 2.2 Importance weighted estimator for known

We begin with the case where (which relates the distribution of the samples and the desired distribution; formally the Radon-Nikodym derivative) is known. Here, the reweighting of our loss function can be framed as an importance sampling problem: we want to estimate , which is in terms of the target distribution and the distribution implied by our generator, but we have samples from the modified . Importance sampling [26] provides a method for constructing an estimator for the expectation of a function with respect to a distribution , by taking an appropriately weighted sum of evaluations of at values sampled from a different distribution. We can therefore modify the estimator in (2) by weighting each term in the estimator involving data point using the likelihood ratio , yielding an unbiased importance weighted estimator that takes the form

(3) |

Note that this unbiased estimator is superficially similar to an estimator obtained by upsampling each data point in our training set by and then using the estimator from (2). For example, for a particular value , if , then two duplicates of would be added to the dataset, for a total of three. However, such an estimator would be biased: If two or more copies of the data point appear in a minibatch, then will appear in the first term of (2). Further, upsampling in this manner would increase the computational and memory requirements over importance weighting. We compare our methods to this upsampling estimator in Section 3.

While importance weighting using the likelihood ratio yields an unbiased estimator (3), the estimator may not concentrate well because the weights may be large or even unbounded. We now provide a concentration bound for the estimator in (3) for the case where weights are upper-bounded by some maximum value.

###### Theorem 1.

Let be the unbiased, importance weighted estimator for defined in (3), given i.i.d samples from and , and maximum kernel value K. Further assume that for all . Then

where .

### 2.3 Robust importance weighted estimator for known

Theorem 1 is sufficient to guarantee good concentration of our importance weighted estimator only when is uniformly bounded by some constant , which is not too large. Many class imbalance problems fall into this setting. However, may be unbounded in practice. Therefore, we now introduce a different estimator, which enjoys good concentration even when only is bounded, while may be unbounded for many values of .

The estimator is based on the classical idea of median of means, which goes back to [24, 16, 1]. Given samples from and , we divide these samples uniformly at random into equal sized groups, indexed . Let be the value obtained when the estimator in (3) is applied on the -th group of samples. Then our median of means based estimator is given by

(4) |

###### Theorem 2.

Let be the unbiased median of means estimator defined in (4) using groups. Further assume that and let be bounded. Then

(5) |

where .

We defer the proof of this theorem to Appendix B. Note that the confidence bound in Theorem 2 depends on the term being bounded. This is the second moment of where . Thus, unlike in Theorem 1, this confidence bound may still hold even if is not uniformly bounded. In addition to increased robustness, the median of means MMD estimator is more computationally efficient: since calculating scales quadratically in the batch size, using the median of means estimator introduces a speed-up that is linear in the number of groups.

### 2.4 Self-normalized importance weights for unknown

To specify , we must know the forms of our target and observed distributions along any marginals where the two differ. In some settings this is available: consider for example a class rebalancing setting where we have class labels and a desired class ratio, and can estimate the observed class ratio from data. This, however, may be infeasible if is continuous and/or varies over several dimensions, particularly if data are arriving in a streaming manner. In such a setting it may be easier to specify a thinning function that is proportional to , i.e. for some unknown , than to estimate directly. This is because can be directly obtained from an estimate of how much a given location is underestimated, without any knowledge of the underlying distribution.

This setting—where the weights used in Section 2.2 are only known up to a normalizing constant—motivates the use of a self-normalized importance sampling scheme, where the weights are normalized to sum to one [27, 26]. For example, by letting , the resulting self-normalized estimator for the squared MMD takes the form

(6) |

While use of self-normalized weights means this self-normalized estimator is biased, it is asymptotically unbiased, with the bias decreasing at a rate of [18]. Also, although we have motivated self-normalized weights out of necessity, in practice they often trade off bias for reduced variance, making them preferable in some practical applications [26].

More generally, in addition to not knowing the normalizing constant , we might also not know the thinning function . For example, might vary along some latent dimension—perhaps we want to have more images of people fitting a certain aesthetic, rather than corresponding to a certain observed covariate or class. In this setting, a practitioner may be able to estimate , or equivalently , for a small number of training points , by considering how much those training points are under- or over-represented. Continuous latent preferences can therefore be expressed by applying higher weights to points deemed more appealing. From here, we can use function estimation techniques, such as neural network regression, to estimate from a small number of labeled data points.

###### Remark 1 (Extension to other losses).

While this paper focuses on the MMD loss, we note that the above estimators can be extended to any estimator that can be expressed as the expectation of some function with respect to one or more distributions. This class includes losses such as squared mean difference between two distributions, cross entropy loss, and autoencoder losses [29, 13, 23]. Such losses can be estimated from data using a combination of U-statistics, V-statistics and sample averages. Each of these statistics can be reweighted, in a manner analogous to the treatment above. We provide more comprehensive details in Appendix D, and in Section 3.1, we evaluate all three importance weighting techniques as applied to the standard cross entropy GAN objective.

## 3 Evaluation

In this section, we show that our estimators, in conjunction with an appropriate generator network, allow us to generate simulations that are close in distribution to our target distribution, even when we only have access to this distribution via a biased sampling mechanism. Further, we show that our method performs comparably with, or better than, a number of reasonable alternative approaches.

Most of our weighted GAN models are based on the implementation of MMD-GAN [19], replacing the original MMD loss with either our importance weighted loss (IW-MMD-GAN), our median of means loss (MIW-MMD-GAN), or our self-normalized loss (SN-MMD-GAN). Other losses used in [19] were also appropriately weighted, following the results in Appendix D. In the synthetic data examples of Section 3.1, the kernel is a fixed radial basis function, while in all other sections it is adversarially trained using a discriminator network as in [19].

To demonstrate that our method is applicable to other losses, in Section 3.1 we also create models that use the standard cross entropy GAN loss, replacing this loss with either an importance weighted estimator (IW-CE-GAN), a median of means estimator (MIW-CE-GAN) or a self-normalized estimator (SN-CE-GAN). These models used a two-layer feedforward neural network with ten nodes per layer. Details of applying these estimators to other losses are given in Appendix D.

We consider two comparison methods. If is known exactly and expressible in terms of a lower-dimensional covariate space, a conditional GAN (C-GAN) offers an alternative method to sample from : learn the appropriate conditional distributions given each covariate value, sample new covariate values, and then sample from using each conditional distribution. Additionally, we evaluate a baseline upsampling scheme: add exact copies of the underrepresented data until the augmented dataset follows , and then train a standard cross entropy GAN. This baseline is asymptotically equivalent to importance weighting. However, it requires more computation to process the same amount of data. Further, it is no longer unbiased (since we may see terms due to replication of ), and lacks theoretical guarantees.

### 3.1 Can GANs with importance weighted estimators recover target distributions, given ?

To evaluate whether using importance weighted estimators can recover target distributions, we aim to recover samples from a synthetically generated distribution that has been manipulated along a latent dimension. Under the target distribution, a latent representation of each data point lives in a ten-dimensional latent space, with each dimension being independently Uniform(0,1). The observed data points are then obtained as , where represents a fixed mapping between the latent space and -dimensional observed space. In the training data, the first dimension of has distribution . We assume that the modifying function is observed, but that the remaining latent dimensions are unobserved.

We attempt to generate samples from the target distribution using each of the methods described above. We include weighted versions of the cross entropy GAN to ensure that our results are comparable with C-GAN, which also uses cross entropy. To compare the methods, we report the distance (measured in terms of squared MMD) between the true target distribution and the generated samples. The results are shown in Table 1, for varying real dimensions . We see that C-GAN performs well in two dimensions, but as the problem becomes more challenging by adding more dimensions, the importance weighted methods dominate. Note the high variance of the C-GAN estimates, suggesting the method is unstable. In fact, many runs either ran into numerical issues or diverged; in these cases we report the best score among runs, before training failure.

Model | 2D | 4D | 10D |
---|---|---|---|

IW-CE-GAN | 0.04300.0049 | 0.04910.0024 | 0.02660.0064 |

MIW-CE-GAN | 0.04550.0032 | 0.05370.0045 | 0.02360.0025 |

SN-CE-GAN | 0.03910.0048 | 0.05380.0047 | 0.02690.0018 |

IW-MMD-GAN | 0.05750.0044 | 0.06940.0024 | 0.02130.0008 |

MIW-MMD-GAN | 0.05520.0045 | 0.06760.0037 | 0.02160.0018 |

SN-MMD-GAN | 0.04370.0052 | 0.04360.0015 | 0.01920.0015 |

C-GAN | 0.03770.0047 | 0.09470.0119 | 0.04090.0141 |

Upsampling | 0.03640.0028 | 0.04980.0052 | 0.02370.0024 |

^{1}

^{1}1Distributional distance is measured as the squared MMD estimated (using the standard estimator) between samples from each.

### 3.2 In a high dimensional image setting, how does importance weighting compare with conditional generation?

Next we evaluate performance of importance weighted MMD on image generation. In this section we address two questions: Can our estimators generate simulations from in such a setting, and how do the resulting images compare with those obtained using a C-GAN? To do so, we evaluate several generative models on the Yearbook dataset [10], which contains over high school yearbook photos across over years and demonstrates evolving styles and demographics. The goal is to produce images uniformly across each half decade. Each GAN, however, is trained on the original dataset, which contains many more photos from recent decades.

Since we have specified in terms of a single covariate (time), we can compare with both upsampling and C-GANs. For the C-GAN, we use a conditional version of the standard DCGAN architecture (C-DCGAN).

Fig. 2 shows generated images from each network. All networks were trained until convergence. The images show a diversity across hairstyles, demographics and facial expressions. Since some covariates have fewer than images, C-DCGAN cannot learn the conditional distributions. We note that while upsampling is competitive with our proposed methods, it lacks theoretical guarantees. We conjecture that our discriminator is powerful enough to learn to ignore the additional terms introduced by upsampling, even for finite samples. Implementation details and additional experiments are shown Appendix C.

### 3.3 When is unknown, but can be estimated up to a normalizing constant on a subset of data, are we able to sample from our target distribution?

In many settings, we will not know the functional form of , or even the corresponding thinning function . For such a setting, we propose labeling a small subset of our training data using an estimated weighting function. We train a neural network to propagate the estimated weighting function to the full dataset. The MMD-GAN uses an autoencoder, so we use the encoded images as input to this weighting function. Since the resulting continuous-valued weighting function exists in a high-dimensional space that changes as the encoder is updated, and since we do not know the full observed distribution on this space, we use self-normalized estimators for all losses, and are in a setting unsuitable for conditional methods.

We evaluate using a collection of sevens from the MNIST dataset, where the goal is to generate more European-style sevens with horizontal bars. Out of 5915 images, 200 were manually labeled with a weight (reciprocal of a thinning function value), where a seven with no horizontal bar was assigned a 1, and sevens with horizontal bars were assigned weights between 2 and 200 based on the width of the bar.

Fig. (a)a shows 100 real images, sorted in terms of their predicted weights – note that the majority have no horizontal bar. Fig. (b)b shows 100 generated simulations, sorted in the same manner, clearly showing an increase in the number of horizontal-bar sevens. This qualitative result is supported by Fig. (c)c, which shows the predicted weights for both real and simulated data.

## 4 Conclusions and future work

We present three estimators for the MMD (and a wide class of other loss functions) between target distribution and the distribution implied by our generator. These estimators can be used to train a GAN to simulate from the target distribution , given samples from a modified distribution . We present solutions for when is potentially unbounded, is unknown, or is known only up to a scaling factor.

While the median of means estimator offers a more robust estimate of the MMD, we may still experience high variance in our estimates, for example if we rarely see data points from a class we want to boost. An interesting future line of research is exploring how variance-reduction techniques [7] or adaptive batch sizes [6] could be used to overcome this problem.

## References

- [1] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. In ACM symposium on Theory of Computing, pages 20–29. ACM, 1996.
- [2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. In ICML, 2017.
- [3] M. Bińkowski, D.J. Sutherland, M. Arbel, and A. Gretton. Demystifying MMD GANs. In ICLR, 2018.
- [4] T. Bolukbasi, K.-W. Chang, J.Y. Zou, V. Saligrama, and A.T. Kalai. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In NIPS, 2016.
- [5] T. Che, Y. Li, A.P. Jacob, Y. Bengio, and W. Li. Mode regularized generative adversarial networks. In ICLR, 2017.
- [6] Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Automated inference with adaptive batches. In AISTATS, pages 1504–1513, 2017.
- [7] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In NIPS, pages 1646–1654, 2014.
- [8] G.K. Dziugaite, D.M. Roy, and Z. Ghahramani. Training generative neural networks via maximum mean discrepancy optimization. In UAI, 2015.
- [9] C. Esteban, S.L. Hyland, and G. Rätsch. Real-valued (medical) time series generation with recurrent conditional GANs. arXiv:1706.02633, 2017.
- [10] S. Ginosar, K. Rakelly, S. M. Sachs, B. Yin, C. Lee, P. Krähenbühl, and A. A. Efros. A century of portraits: A visual historical record of American high school yearbooks. IEEE Transactions on Computational Imaging, 3(3):421–431, Sept 2017.
- [11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
- [12] A. Gretton, K.M. Borgwardt, M.J. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample test. JMLR, 13(Mar):723–773, 2012.
- [13] Wassily Hoeffding. A class of statistics with asymptotically normal distribution. The annals of mathematical statistics, pages 293–325, 1948.
- [14] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. JASA, 58(301):13–30, 1963.
- [15] D.G. Horvitz and D.J. Thompson. A generalization of sampling without replacement from a finite universe. JASA, 47(260):663–685, 1952.
- [16] Mark R Jerrum, Leslie G Valiant, and Vijay V Vazirani. Random generation of combinatorial structures from a uniform distribution. Theoretical Computer Science, 43:169–188, 1986.
- [17] A. Kadurin, A. Aliper, A. Kazennov, P. Mamoshina, Q. Vanhaelen, K. Khrabrov, and A. Zhavoronkov. The cornucopia of meaningful leads: Applying deep adversarial autoencoders for new molecule development in oncology. Oncotarget, 8(7):10883, 2017.
- [18] Augustine Kong. A note on importance sampling using standardized weights. University of Chicago, Dept. of Statistics, Tech. Rep, 348, 1992.
- [19] C.-L. Li, W.-C. Chang, Y. Cheng, Y. Yang, and B. Póczos. MMD GAN: Towards deeper understanding of moment matching network. In NIPS, 2017.
- [20] Y. Li, K. Swersky, and R. Zemel. Generative moment matching networks. In ICML, 2015.
- [21] M.A. Mansournia and D.G. Altman. Inverse probability weighting. BMJ, 352:i189, 2016.
- [22] M. Mehdi and S. Osindero. Conditional generative adversarial nets. arXiv:1411.1784, 2014.
- [23] R v Mises. On the asymptotic distribution of differentiable statistical functions. The Annals of Mathematical Statistics, 18(3):309–348, 1947.
- [24] Arkadii Nemirovskii, David Borisovich Yudin, and Edgar Ronald Dawson. Problem complexity and method efficiency in optimization. Wiley, 1983.
- [25] A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary classifier GANs. In ICML, 2017.
- [26] A.B. Owen. Monte Carlo theory, methods and examples. Book draft, 2013.
- [27] C. Robert and G. Casella. Monte Carlo Statistical Methods. Springer Texts in Statistics. Springer, 2 edition, 2004.
- [28] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training GANs. In NIPS, 2016.
- [29] Gábor J Székely and Maria L Rizzo. Energy statistics: A class of statistics based on distances. Journal of statistical planning and inference, 143(8):1249–1272, 2013.
- [30] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In NIPS, 2016.
- [31] Hongliang Yan, Yukang Ding, Peihua Li, Qilong Wang, Yong Xu, and Wangmeng Zuo. Mind the class weight bias: Weighted maximum mean discrepancy for unsupervised domain adaptation. arXiv preprint arXiv:1705.00609, 2017.
- [32] H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and D. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
- [33] J. Zhao, T. Wang, M. Yatskar, V. Ordonez, and K.-W. Chang. Men also like shopping: Reducing gender bias amplification using corpus-level constraints. In EMNLP, 2017.

## Appendix A Proof of Theorem 1

Before we prove Theorem 1, we will define some notation. Suppose , and are the empirical samples obtained from , and , respectively. We use the following quantity as in [12], with samples and :

(7) |

Here, denotes a pair of i.i.d. samples from . The estimator can be written as

###### Proof.

Now consider the setting with samples and . For a modifying function with values on , the weights are therefore bounded above, i.e. . We rewrite the function , now including weights, as

(8) |

Assuming the kernel is bounded between and , we can infer function bounds such that .

Using Theorem 10 from Gretton et al. [12], we have that

(9) |

where , as the MMD requires two samples to evaluate . ∎

## Appendix B Proof of Theorem 2

Before we prove Theorem 2, we prove two functional lemmas.

###### Lemma 1.

The variance of the estimator given samples each from and is upper bounded by , where and .

###### Proof.

Let and let . Using Hoeffding’s Theorem and the fact that [13], we bound the variance of the unbiased MMD U-statistic by

∎

###### Lemma 2.

We have the following bound:

where the expectation is with respect to the distribution .

###### Proof.

Let . Note that . Therefore, we have the following chain,

This implies the lemma as are independent and generated from . The first inequality follows from the fact that , if lies on the simplex. The last inequality follows from the assumption that . ∎

###### Proof of Theorem 2.

Define to be the variance upper bound in Lemma 2. Suppose we have samples from and , for . We divide the samples into groups, where . We form the estimators of type for each of the groups indexed . Let be the estimator for group .

Note that by Lemma 1 the variance of is bounded by . Therefore, with probability at least , is within distance of its mean. As such, the probability that the median is not within the distance is at most , which is exponentially small in . Substituting the value of yields the result. ∎

## Appendix C Implementation and Additional Experiments

### c.1 Yearbook

The C-DCGAN is trained for epochs using the ADAM optimizer with , , and , and a batch size of . The latent variable has dimension , and we condition on a -dimensional vector corresponding to each half-decade in the dataset.

Networks for the importance weighted and median of means estimator are trained using and RMSprop optimizer with learning rate . We use the same regularizers and schedule of generator-discriminator updates as [19]. For a batch size of was used, and for , a large batch of was split randomly into groups of samples.

Fig. 4 shows interpolation in the latent for the half-decade experiment in Section 3.2. Fig. 5 shows another Yearbook experiment with larger imbalance between time periods: Old (1930) and New (1980-2013). MMD-GANs are trained for generator iterations.

Fig. 6 show a related experiment in which we produce more older images given a dataset with equal amounts of old (1925-1944) and new (2000-2013) photos. Here, each time period contains over images, which increases the stability of conditional GAN training. MMD-GANs are trained until convergence ( generator iterations).

### c.2 Mnist

Analogous to Section 3.1, we use our self-normalized estimator to manipulate the distribution over twos from the MNIST dataset, where we aim to have fewer curly twos and more twos with a flat bottom. As before, 200 were manually labeled with weights. Fig. (a)a shows 100 real images, sorted in terms of their inferred weight. Fig. (b)b shows 100 generated simulations, sorted in the same manner, clearly showing a decrease in the proportion of curly twos. Fig. (c)c shows the inferred weights for both real and simulated data.

## Appendix D Extension to other losses

The importance weighted and self-normalized estimation techniques described in this paper can be applied to any loss that can be expressed as the expectation of some function with respect to one or both distributions and . Estimators for such functions can be expressed in terms of some combination of U-statistics, V-statistics and sample averages.

To construct an importance weighted estimator , we replace any U-statistics, V-statistics or sample averages in according to Table 2. To construct a self-normalized estimator , we replace those statistics according to Table 2. Construction of a median of means estimator proceeds analogously to (4).

U-statistic | ||
---|---|---|

V-statistic | ||

Average |

U-statistic | ||
---|---|---|

V-statistic | ||

Average |